NVIDIA / NeMo-Framework-Launcher

Provides end-to-end model development pipelines for LLMs and Multimodal models that can be launched on-prem or cloud-native.
Apache License 2.0
475 stars 140 forks source link

Which versions of Pyxis, Slurm and enroot for running NeMo-Megatron-Launcher on one Node with 8 * A100? #73

Open starlitsky2010 opened 1 year ago

starlitsky2010 commented 1 year ago

Environment:

Pyxis v0.14.0 Slrum19.05.5 enroot: enroot+caps_3.4.1

Get Method:

# cd pyxis
git log and found tag:
commit ea7bb88a4f31f3535334f92cbcc1324d60b113d8 (HEAD -> master, tag: v0.14.0)

# srun -V
slurm-wlm 19.05.5

Error Info:

launcher_scripts# cat results/download_gpt3_pile/download/log-nemo-megatron-download_gpt3_pile_23_0.err

Problem I've met:
srun: unrecognized option '--container-image'
srun: unrecognized option '--container-image'
Try "srun --help" for more information

Thanks Aaron

roclark commented 1 year ago

Hey @starlitsky2010! Are Pyxis and enroot installed on all nodes in the cluster and at the same version as well? The Slurm version is a bit older than what we've tested previously so it's possible that would benefit from an update if practical. The oldest version we've documented with NeMo Framework on Slurm that I'm aware of was the following:

So your Pyxis version should be fine, but Slurm could potentially be updated, though I can't say that's definitively the problem at the moment.

Was Pyxis/enroot installed recently? Have the Slurm daemons been restarted?

starlitsky2010 commented 1 year ago

Hi @roclark ,

Pyxis and enroot installed on all nodes. It should be the slurm version too old (19.05.5), it's not compatible with the latest version pyxis.

I've tested v0.7.0. When I srun --help. the container relative options will be shown. For Ubuntu 20.04, it will install slurm-wlm 19.05.5 automatically by command below: sudo apt install slurmd slurmctld -y

Do you have any Ubuntu version recommended? How did you install the slurm? Could you help provide some links about it?

I'll try the following version later. Slurm: 20.11.7 Pyxis: 0.9.1

Thanks Aaron