NVIDIA / pyxis

Container plugin for Slurm Workload Manager
Apache License 2.0
263 stars 28 forks source link

Slurm "run" is failing to find "munge" component when run with "container" #139

Closed RamHPC closed 1 month ago

RamHPC commented 1 month ago

Slurm "srun" is failing to find "munge" component when run with "--container-name" option. There are no issues with running (srun) MPI applications. I can clearly in the debug log, munge component is found, opened, init and creating credentials etc. I am trying to run simple NVIDIA benchmarking application called "image segmentation".

Environment: Slurm - 23.11.5 OpenMPI - 5.0.3 Pmix - 5.0.2 Enroot - 3.4.1-1

$ srun --container-name=image_segmentation_105400 bash -c "all_reduce_perf_mpi -b 62M -e 62M -d half"

A requested component was not found, or was unable to be opened. This means that this component is either not installed or is unable to be used on your system (e.g., sometimes this means that shared libraries that the component requires are unable to be found/loaded). Note that PMIX stopped checking at the first component that it did not find.

Host: gpu1 Framework: psec Component: munge


It looks like pmix_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during pmix_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an PMIX developer):

pmix_psec_base_open failed --> Returned value -46 instead of PMIX_SUCCESS

[gpu1:926665] PMIX ERROR: NOT-FOUND in file client/pmix_client.c at line 562 [gpu1:926665] OPAL ERROR: Not found in file pmix3x_client.c at line 112

The application appears to have been direct launched using "srun", but OMPI was not built with SLURM's PMI support and therefore cannot execute. There are several options for building PMI support under SLURM, depending upon the SLURM version you are using:

version 16.05 or later: you can use SLURM's PMIx support. This requires that you configure and build SLURM --with-pmix.

Versions earlier than 16.05: you must use either SLURM's PMI-1 or PMI-2 support. SLURM builds PMI-1 by default, or you can manually install PMI-2. You must then build Open MPI using --with-pmi pointing to the SLURM PMI library location.

Please configure as appropriate and try again.

An error occurred in MPI_Init on a NULL communicator MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, and potentially your MPI job) [gpu1:926665] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed! srun: error: gpu1: task 0: Exited with exit code 1

flx42 commented 1 month ago

Did you try using srun --mpi=pmix?

RamHPC commented 1 month ago

Yes! It is also configured in slurm.conf with "MpiDefault=pmix". I don't have any ENROOT_ENVIRON configured, does this cause issues? How do I setup this so it can work across the cluster? I have OpenMPI, PMIx are all installed on a NFS mount.

I tried adding PATH/LD_LIBRARY_PATH as suggested in "https://github.com/NVIDIA/pyxis/issues/88" but that is not working (sudo also doesn't work)

$ echo "UCX_TLS=tcp,cuda,cuda_copy,cuda_ipc" >> "${ENROOT_ENVIRON}" -bash: : No such file or directory

RamHPC commented 1 month ago

Added ompi.sh in /etc/enroot/hooks.d folder with PATH and LD_LIBRARY_PATH. Still failing on the same error. The container seems to have problems with mounting NFS

flx42 commented 1 month ago

Maybe try setting PMIX_MCA_psec=none like mentioned in https://github.com/NVIDIA/pyxis/wiki/Setup#slurmd-configuration

RamHPC commented 1 month ago

Thank you! I had to comment out some "sudo" commands in the bash script. Now, I am getting this error:

RamHPC commented 1 month ago

With interactive shell, I found the library in the container is at /usr/local/lib. Somehow I need to set this as part of LD_LIBRARY_PATH. What is the correct way to do this? I am currently mounting the user home directory