Wrapper logic for production jobs

cwhitlock-NOAA commented 2 months ago

The thing-currently-named-wrapper (wrapper.py) works for testing fre-cli, when we have all history files already available at the time of running. However, when we get to running production jobs, the models will be sending over bundles of history files and post-processing them in parallel. This breaks some of the logic currently in the wrapper flow - in particular, the assumptions that there's not already a pre-existing experiment belonging to the same user that the current set of history files is being added to, and the assumption that there's not already an experiment with that name running.

For the production jobs, we need the logic outlined in the README for the fre pp tools (see end of this issue). This involves new logic (and tests of the new logic) in several tools: fre pp checkout, fre pp validate, fre pp run and fre pp status.

fre pp checkout:

[ ] Can we check out a branch other than main?
[ ] Is there a preexisting experiment with the same -e -p -t as the current experiment?
[ ] If so, do the remote and the local / branch specifications match?

fre pp validate:

[ ] Is the pp.yaml up-to-date with the current configuration?
[ ] If not, update pp.yaml

fre pp run:

[x] Is the -e -p -t workflow installed? If not, install. (current behavior)
[ ] If it is installed, does the workflow match the existing config? Check the codebase too. If answer is yes:
- [ ] If the workflow is not running, start running - cylc run
- [ ] If the workflow is already running (i.e. a previous set of history files started the experiment) - cylc trigger If answer is no:
- [ ] Do no automatically exit - prompt user for more info. The current experiment may be overriding a previous configuration.
- [ ] If we want new config and nothing is running - cylc run
- [ ] If we want new config and there's already an experiment running (i.e. you noticed a config error too late) - cylc reload

fre pp status:

[ ] Has the job completed?
[ ] Is the job running or stalled?
[ ] If stalled: exit with error (we MIGHT be able to correct stalled jobs in the future)
[ ] If running: wait and check again

Chris and I have discussed adding the decision logic to the wrapper script itself versus the individual tools that it calls; we came down on the side of the logic being useful at a tool level as well as the wrapper level and thus a good addition to the tools themselves.

Please note: this logic is only strictly necessary for production jobs. For our tests, where we can assume that all the history files are already present, this is not needed - and the existing wrapper script works fine, according to the two people who have used it.

cwhitlock-NOAA commented 1 month ago

Closing issue to divide into 4 smaller issues - see the wrapper logic: $tool issues for more

ceblanton commented 1 week ago

agree this for 2025.01

NOAA-GFDL / fre-cli

Wrapper logic for production jobs #171