Fool-proofing automated slurm workflow

AI4OPT / OPFGenerator

Instance generator for OPF problems

MIT License

2 stars 3 forks source link

Fool-proofing automated slurm workflow #39

Open mtanneau opened 4 months ago

mtanneau commented 4 months ago

I ran into an issue when using the new slurm automated workflow: my sysimage job failed, which prevented the rest of the jobs from ever running. I don't know how, but having a failsafe would be nice.

I did not see anything in slurm that allows something like "if job B depends on job A and job A failed, then fail job B" rather than "if job B depends on job A and job A failed, then job B will wait forever" (the latter is slurm's current behavior).

Some possibilities:

something magic in slurm that would do the above one?
use --dependency=afterany and track job progress differently?

TBH, this would be more of a convenience, not the highest priority.

klamike commented 4 months ago

We can have each job submit the next one upon completion? This lets us decide how much time/memory to give the sampler job based on the ref job too.

Honestly, the current behavior isn't too bad. Instead of failing silently and leaving the user to investigate why the jobs finished but the results files are not there, the user simply checks the queue, sees where it failed, and can fix & re-submit.

klamike commented 2 months ago

Using this issue as a catch-all for potential future pipeline improvements...

Right now the pipeline only works out-of-the-box if submitting to a SLURM cluster.

Currently to run locally one can use the commands below, then delete the $export_dir/res_h5 folder. One problem with this is that it doesn't (automatically) parallelize.

julia --project=. slurm/make_ref.jl path/to/config.toml
julia --project=. sampler.jl path/to/config.toml 1 100
julia --project=. slurm/merge.jl path/to/config.toml
julia --project=. slurm/cleanup.jl path/to/config.toml

I think it would require only minor edits to provide a similar experience to the SLURM pipeline when running locally, but it's not high priority.

mtanneau commented 2 months ago

Right now the pipeline only works out-of-the-box if submitting to a SLURM cluster.

I'm OK with that, we are the primary users of this workflow.

to run locally [...] one problem is that it doesn't (automatically) parallelize.

I'm also OK with that for now. To me, the main limitation of the current setup is that we are not able to unit-test it fully. But the slurm pipeline works great and I have nothing bad to say about it :)

it's not high priority

Agreed. Our energy is better spent elsewhere (e.g. building documentation or supporting additional OPF formulations). I'll also point out that one goal is that we generate the datasets, so that people can readily use it. Hence my focus on improving the experience of downstream data users, rather than ours generating data.