Feature #26 error handling for processes

georgemccabe commented 1 year ago

Closes #26

We should create a related issue to enhance UI to allow users to easily view the log files.

Side Note: It looks like there is a lot of duplicate code in the run_ functions for each Process sub-class that could be pulled up to the Process class. We could add a bool class variable self.ALLOW_BATCH to determine if the submitjob logic can be called for the process if number of cores != 1. Also, the processes call os.chdir which changes the directory that the commands are run from. This can make the logic confusing and was the reason that the logs.zip file was not found in the expected directory on the cluster. It would be nice to refactor the logic to write files to explicit directories to avoid switching directories this way. I know some applications must be run from a specific directory, but it would be easier to follow if the directory was changed only in the shell/batch command or if it was done in the run (rename to run_app?) function and immediately change directories back to the original to prevent confusion.

Expected Differences

[X] Do these changes modify the system output in any way? [Yes]

Failures from each process are caught and reported to UI. Status message displayed on UI for a failed job will include the process name, e.g. GeoGrid failed, if the process returned a non-zero error code, if none of the expected output files exist, or if the log file does not contain text that signifies a successful run. If something else goes wrong in the run logic, the status message will just say Failed.

Pull Request Testing

[X] Describe testing already performed for these changes:
Modified expected file paths, log file path, and string to find in log file to force runs to fail -- confirmed the expected error message is shown as the job status message
WABE408C1B3: Set small 100x100 domain configuration (midwest_test_cores_copy) to use 96 cores, which previously caused a failure, and confirmed that the failure is now caught from the application that failed (Real).
W4C3299C43F: Changed num_metgrid_levels=34 to num_metgrid_levels=27 in config (midwest_test_cores) and confirmed that failure is now caught in Real
W146C654FDD: This run should succeed with no errors.
[X] Recommend testing for the reviewer(s) to perform, including the location of input datasets, and any additional instructions:
Review code changes
(Optionally) create a bad configuration (namelist) that should cause a failure and confirm that the expected process is listed in the status message for the failure.
[X] Do these changes include sufficient documentation updates, ensuring that no errors or warnings exist in the build of the documentation? [No]

We should provide more information on expected status messages on failure (see above).

[X] Do these changes include sufficient testing updates? [Yes]
[X] Will this PR result in changes to the test suite? [No]
[X] Please complete this pull request review by 5/11/2023.

Pull Request Checklist

[X] Review the source issue metadata (labels, project, and milestone).
[X] Complete the PR definition above.
[X] Ensure the PR title matches the feature or bugfix branch name.
[X] Define the PR metadata, as permissions allow. Select: Reviewer(s) Select: Project Select: Milestone as the version that will include these changes Select: Development to link to the original development issue.
[ ] After the PR is approved, merge your changes. If permissions do not allow this, request that the reviewer do the merge.
[ ] Close the linked issue and delete your feature or bugfix branch from GitHub.

fossell commented 1 year ago

@georgemccabe - Can you make two new Issues to capture the enhancements noted in this PR, e.g. 1) Reduce duplicate code in the run_* functions for each Process sub-class that could be pulled up to the Process class. and 2) refactor the logic to write files to explicit directories to avoid switching directories. These can be placed in the backlog milestone so we don't lose track of these improvements. I added an addition issue #200 for documentation related to this PR and job status, but as noted in that issue I'm holding off in case work on #198 changes documentation as well.

fossell commented 1 year ago

@georgemccabe - Also thanks for finding and fixing that readthedocs issue! I was having the same problem but hadn't figured out how to fix it.

NCAR / wrfcloud