Preamble:
The thing-currently-named-wrapper (wrapper.py) works for testing fre-cli, when we have all history files already available at the time of running. However, when we get to running production jobs, the models will be sending over bundles of history files and post-processing them in parallel. This breaks some of the logic currently in the wrapper flow - in particular, the assumptions that there's not already a pre-existing experiment belonging to the same user that the current set of history files is being added to, and the assumption that there's not already an experiment with that name running.
The logic we need is encapsulated in a flowchart at the end of this issue; this breaks it down by tool.
The tool:
Fre pp status needs some of the logic that we would normally apply to slurm jobs: what's your status? Are you done yet? The logic's not going to be that different from anything else that checks on a running job, but getting tests for artifically-stalled experiments is going to take mild effort. It may be possible to use some experiments kicked off by fre pp checkout or fre pp validate to test this section, though that makes the debugging worse.
fre pp status:
[ ] Has the job completed?
[ ] Is the job running or stalled?
[ ] If stalled: exit with error (we MIGHT be able to correct stalled jobs in the future)
[ ] corresponding test: stall experiment and check the status
[ ] If running: wait and check again
[ ] corresponding test: check on a happily-running experiment
Preamble: The thing-currently-named-wrapper (wrapper.py) works for testing fre-cli, when we have all history files already available at the time of running. However, when we get to running production jobs, the models will be sending over bundles of history files and post-processing them in parallel. This breaks some of the logic currently in the wrapper flow - in particular, the assumptions that there's not already a pre-existing experiment belonging to the same user that the current set of history files is being added to, and the assumption that there's not already an experiment with that name running.
The logic we need is encapsulated in a flowchart at the end of this issue; this breaks it down by tool.
The tool: Fre pp status needs some of the logic that we would normally apply to slurm jobs: what's your status? Are you done yet? The logic's not going to be that different from anything else that checks on a running job, but getting tests for artifically-stalled experiments is going to take mild effort. It may be possible to use some experiments kicked off by fre pp checkout or fre pp validate to test this section, though that makes the debugging worse.
fre pp status:
[ ] Has the job completed? [ ] Is the job running or stalled? [ ] If stalled: exit with error (we MIGHT be able to correct stalled jobs in the future)