NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
12.12k stars 2.52k forks source link

srun issue with nemorun #10997

Open RachitBansal opened 3 weeks ago

RachitBansal commented 3 weeks ago

I am trying to run a simple pretraining job with nemorun: nemorun llm pretrain --factory llama3_8b

However, I see the following error before the training starts:

The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:

  version 16.05 or later: you can use SLURM's PMIx support. This
  requires that you configure and build SLURM --with-pmix.

  Versions earlier than 16.05: you must use either SLURM's PMI-1 or
  PMI-2 support. SLURM builds PMI-1 by default, or you can manually
  install PMI-2. You must then build Open MPI using --with-pmi pointing
  to the SLURM PMI library location.

Please configure as appropriate and try again.

I understand that this is a problem with the way openmpi is configured on my cluster but I can't change it. An alternate solution that I have been told is to replace the srun command with something like mpirun -np. I want to try this but I am unable to locate where the srun command is being run in the first place? I dug into the slurm.py file in NeMo-run, but changing the commands there did not yield any difference.

Are there any other potential alternatives to this?

hemildesai commented 3 weeks ago

Can you try adding executor.srun_args = ["--mpi=pmix"] and see if that works?

RachitBansal commented 3 weeks ago

@hemildesai How do I set that argument? I don't see it in the listed arguments with the command nemorun llm pretrain --help:

Image

RachitBansal commented 2 weeks ago

In the current version of the repo, the srun_args (here) seem to have the --mpi=pmix command by default? However, I still see the same error as before.