Open RachitBansal opened 1 month ago
Hi, for slurm, we currently only support clusters that have https://github.com/NVIDIA/pyxis enabled. With pyxis, we mount the configs at /nemo_run/configs/
using --container-mounts
option in srun
automatically. Does your cluster have pyxis enabled?
Running into this deserialization issue in
src/nemo_run/core/runners/fdl_runner.py
.Context: I am running a pretraining example job:
python scripts/llm/llama3_pretraining.py --size=8b --slurm
The srun command being launched looks like this:srun --output /path/to/logfile.out python -m nemo_run.core.runners.fdl_runner -n llama3-8b \ -p /nemo_run/configs/llama3-8b_packager /nemo_run/configs/llama3-8b_fn_or_script
A probable cause of the issue is that the cfg being passed in this srun
/nemo_run/configs/llama3-8b_fn_or_script
does not exist. I can't find it anywhere in my filesystem. However, I don't understand what to change.Any pointers would be useful!
CC: @hemildesai