NCAR / wrfcloud

WRF Cloud Framework
Apache License 2.0
14 stars 6 forks source link

Feature #26 error handling for processes #197

Closed georgemccabe closed 1 year ago

georgemccabe commented 1 year ago

Closes #26

We should create a related issue to enhance UI to allow users to easily view the log files.

Side Note: It looks like there is a lot of duplicate code in the run_ functions for each Process sub-class that could be pulled up to the Process class. We could add a bool class variable self.ALLOW_BATCH to determine if the submitjob logic can be called for the process if number of cores != 1. Also, the processes call os.chdir which changes the directory that the commands are run from. This can make the logic confusing and was the reason that the logs.zip file was not found in the expected directory on the cluster. It would be nice to refactor the logic to write files to explicit directories to avoid switching directories this way. I know some applications must be run from a specific directory, but it would be easier to follow if the directory was changed only in the shell/batch command or if it was done in the run (rename to run_app?) function and immediately change directories back to the original to prevent confusion.

Expected Differences

Failures from each process are caught and reported to UI. Status message displayed on UI for a failed job will include the process name, e.g. GeoGrid failed, if the process returned a non-zero error code, if none of the expected output files exist, or if the log file does not contain text that signifies a successful run. If something else goes wrong in the run logic, the status message will just say Failed.

Pull Request Testing

We should provide more information on expected status messages on failure (see above).

Pull Request Checklist

fossell commented 1 year ago

@georgemccabe - Can you make two new Issues to capture the enhancements noted in this PR, e.g. 1) Reduce duplicate code in the run_* functions for each Process sub-class that could be pulled up to the Process class. and 2) refactor the logic to write files to explicit directories to avoid switching directories. These can be placed in the backlog milestone so we don't lose track of these improvements. I added an addition issue #200 for documentation related to this PR and job status, but as noted in that issue I'm holding off in case work on #198 changes documentation as well.

fossell commented 1 year ago

@georgemccabe - Also thanks for finding and fixing that readthedocs issue! I was having the same problem but hadn't figured out how to fix it.