NVIDIA / NeMo-Curator

Scalable data pre processing and curation toolkit for LLMs
Apache License 2.0
559 stars 76 forks source link

Explore Dask jobque's slurm runner for multi node slurm setups. #215

Open ayushdg opened 2 months ago

ayushdg commented 2 months ago

Is your feature request related to a problem? Please describe. Our current Slurm scripts are a combination of 2 bash scripts that might be difficult to understand and customize in other user environments since it has some assumptions baked in (enroot/pyxis for containers), specific cluster setup etc. There have been advancements made to Dask job queue's slurm runner which should make it easier to launch multi-node jobs in a similar environment to what we do. In theory it should make it easier to launch MN slurm jobs with all the setup info shared as a part of the runner API.

It could be worth exploring if this makes our multi-node slurm setup a bit simpler.

Thanks @jacobtomlinson for the suggestion!

jacobtomlinson commented 2 months ago

Very happy to pair/collaborate on this work! I'd love to see curator seeing the new dask-jobqueue tools.