Open RachitBansal opened 3 weeks ago
Can you try adding executor.srun_args = ["--mpi=pmix"]
and see if that works?
@hemildesai
How do I set that argument? I don't see it in the listed arguments with the command nemorun llm pretrain --help
:
In the current version of the repo, the srun_args
(here) seem to have the --mpi=pmix
command by default? However, I still see the same error as before.
I am trying to run a simple pretraining job with nemorun:
nemorun llm pretrain --factory llama3_8b
However, I see the following error before the training starts:
I understand that this is a problem with the way openmpi is configured on my cluster but I can't change it. An alternate solution that I have been told is to replace the
srun
command with something likempirun -np
. I want to try this but I am unable to locate where thesrun
command is being run in the first place? I dug into the slurm.py file in NeMo-run, but changing the commands there did not yield any difference.Are there any other potential alternatives to this?