Open mtanneau opened 4 months ago
We can have each job submit the next one upon completion? This lets us decide how much time/memory to give the sampler job based on the ref job too.
Honestly, the current behavior isn't too bad. Instead of failing silently and leaving the user to investigate why the jobs finished but the results files are not there, the user simply checks the queue, sees where it failed, and can fix & re-submit.
Using this issue as a catch-all for potential future pipeline improvements...
Right now the pipeline only works out-of-the-box if submitting to a SLURM cluster.
Currently to run locally one can use the commands below, then delete the $export_dir/res_h5
folder. One problem with this is that it doesn't (automatically) parallelize.
julia --project=. slurm/make_ref.jl path/to/config.toml
julia --project=. sampler.jl path/to/config.toml 1 100
julia --project=. slurm/merge.jl path/to/config.toml
julia --project=. slurm/cleanup.jl path/to/config.toml
I think it would require only minor edits to provide a similar experience to the SLURM pipeline when running locally, but it's not high priority.
Right now the pipeline only works out-of-the-box if submitting to a SLURM cluster.
I'm OK with that, we are the primary users of this workflow.
to run locally [...] one problem is that it doesn't (automatically) parallelize.
I'm also OK with that for now. To me, the main limitation of the current setup is that we are not able to unit-test it fully. But the slurm pipeline works great and I have nothing bad to say about it :)
it's not high priority
Agreed. Our energy is better spent elsewhere (e.g. building documentation or supporting additional OPF formulations). I'll also point out that one goal is that we generate the datasets, so that people can readily use it. Hence my focus on improving the experience of downstream data users, rather than ours generating data.
I ran into an issue when using the new slurm automated workflow: my sysimage job failed, which prevented the rest of the jobs from ever running. I don't know how, but having a failsafe would be nice.
I did not see anything in slurm that allows something like "if job B depends on job A and job A failed, then fail job B" rather than "if job B depends on job A and job A failed, then job B will wait forever" (the latter is slurm's current behavior).
Some possibilities:
--dependency=afterany
and track job progress differently?TBH, this would be more of a convenience, not the highest priority.