[Feature request]. Ability to schedule with SLURM job_arrays.

facebookresearch / dora

Dora is an experiment management framework. It expresses grid searches as pure python files as part of your repo. It identifies experiments with a unique hash signature. Scale up to hundreds of experiments without losing your sanity.

MIT License

272 stars 24 forks source link

[Feature request]. Ability to schedule with SLURM job_arrays. #12

Closed robert-verkuil closed 3 years ago

robert-verkuil commented 3 years ago

Thanks again for the great tool. Recently, I'm running Dora with many (1k - 10k) small single-gpu experiments that run for a few hours each. Is there a way to launch these jobs via slurm jobarray? Otherwise scheduler load is too high.

adefossez commented 3 years ago

Yes this is definitely possible, I can work on that :)

adefossez commented 3 years ago

Might I ask though if in your case it wouldn't be more appropriate to schedule those small jobs directly with submitit ? Given that they are short it seems like you might not need all the capabilities of dora to handle long running jobs (cancellation and resumability). Are those main model training or some kind of post training evaluation ? Are you actually using the dora command line interface to read the results ?

kwanUm commented 3 years ago

Hey Alex, we are using the dora command line interface at the moment, although it's a bit hard to use due to the many lines at the table.

Our jobs are typically running a heavy model for 10-20K times in row for the purpose of optimizing protein sequences. They finish at about a few hours up to a few days (depending on the model being run).

We do need dora's resuming and cancellation capabilities and are constantly using them. Does it req a significant change in dora?

adefossez commented 3 years ago

I see. In any case it is a good idea to add but I wanted to make sure it was worth it. It does require a few non trivial changes in how things are done. But I should be able to get a first version sometime next week :)

One thing I was wondering on the longer term (in particular for the brainmagick evaluation kind of workflows) was how to design a Dora-like generic API for any target function in python, as long as there is some way to define a reliable job signature from the function arguments. Then you would get an API similar to submitit or things like ProcessPool, except it would do the resumability and job de duplication for you transparently! And this would be much easier to apply to something that is not your main training script. Anyway just curious if something like that would be useful in the long run to you guys.

robert-verkuil commented 3 years ago

Amazing if you think the job_array switch might take effect quickly! For @kwanUm and I, that's our largest blocker for Dora usage currently (more than git_save). We've canceled our latest jobs and are probably going to use hydra directly as a workaround until job_array support is added. (losing all the nice Dora benefits in the meantime.)

For the more general Dora-like functionality w/ a concurrent.futures -like interface, this is maybe interesting.... Will think about it more? Currently submitit/Dask with manual checkpointing to file-system solves many problems for the alternative workloads I'm thinking of. IIUC, Dora would additionally bring - {multi-node, easy retries of failures, easy grid-style sweeping + grouping of results, command-line display of results for quick monitoring}?

robert-verkuil commented 3 years ago

We've been using this successfully on the development branch and now master. Thanks so much @adefossez! ❤️ This enables us to do at-scale sweeps again, and was put together lightning-fast. 💪