NVIDIA / pyxis

Container plugin for Slurm Workload Manager
Apache License 2.0
281 stars 31 forks source link

Multi-node jobs fail #107

Closed rormseth closed 1 year ago

rormseth commented 1 year ago

I am running a small cluster with Slurm 22.05.5-1, Enroot 3.4.0-2 and Pyxis 0.14.0 utilizing Rocky 9 compute nodes and a stock kernel 5.14.0-70.22.1. I can run basic single node jobs with containers, but when I try and run multi-node MPI jobs, they fail. The container I am testing is Rocky 8.7 with OpenMPI 4.1.4 installed inside of it. Here is my sample job script:

$ cat myhello.slurm 
#!/bin/sh 
#SBATCH -J mpitest
#SBATCH --container-image=docker://hpcreid/openmpi:230130
#SBATCH --container-mount-home
#SBATCH --container-mounts=/etc/slurm,/scratch/rormseth,/var/run/munge
#SBATCH -N 3
#SBATCH -n 9
#SBATCH -o %j.o
#SBATCH -e %j.e

grep PRETTY /etc/os-release

srun --mpi=pmix /workspace/mpihello.exe

The job output I receive is:

$ cat 96.o 
PRETTY_NAME="Rocky Linux 8.7 (Green Obsidian)"
$ cat 96.e 
pyxis: imported docker image: docker://hpcreid/openmpi:230130
srun: error: slurm_receive_msgs: [[n1.mytestdomain]:6818] failed: Socket timed out on send/recv operation
srun: error: Task launch for StepId=96.0 failed on node n2: No such process
srun: error: Task launch for StepId=96.0 failed on node n3: No such process
srun: error: Task launch for StepId=96.0 failed on node n1: Socket timed out on send/recv operation
srun: error: Application launch failed: Socket timed out on send/recv operation
srun: Job step aborted
[rormseth@n0 ~]$

SLURM logs on the first compute node show Pyxis starting the container:

[2023-03-14T22:13:40.747] task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 96
[2023-03-14T22:13:40.747] task/affinity: batch_bind: job 96 CPU input mask for node: 0x7
[2023-03-14T22:13:40.747] task/affinity: batch_bind: job 96 CPU final HW mask for node: 0x7
[2023-03-14T22:13:40.842] Launching batch job 96 for UID 1001
[2023-03-14T17:13:50.410] [96.batch] pyxis: imported docker image: docker://hpcreid/openmpi:230130
[2023-03-14T17:13:50.410] [96.batch] pyxis: creating container filesystem: pyxis_96.4294967291
[2023-03-14T17:13:51.058] [96.batch] pyxis: starting container: pyxis_96.4294967291
[2023-03-14T22:13:51.278] launch task StepId=96.0 request from UID:1001 GID:1001 HOST:10.82.91.220 PORT:60362
[2023-03-14T22:13:51.278] task/affinity: lllp_distribution: JobId=96 implicit auto binding: sockets,one_thread, dist 2
[2023-03-14T22:13:51.278] task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic 
[2023-03-14T22:13:51.278] task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [96]: mask_cpu,one_thread, 0x1,0x2,0x4
[2023-03-14T22:13:51.279] error: _send_slurmstepd_init failed
[2023-03-14T17:14:11.320] [96.batch] pyxis: removing container filesystem: pyxis_96.4294967291
[2023-03-14T17:14:11.609] [96.batch] done with job

However, the secondary nodes do not start the container via Pyxis:

[2023-03-14T22:13:31.964] launch task StepId=96.0 request from UID:1001 GID:1001 HOST:10.82.91.220 PORT:53952
[2023-03-14T22:13:31.964] task/affinity: lllp_distribution: JobId=96 implicit auto binding: sockets,one_thread, dist 2
[2023-03-14T22:13:31.964] task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic 
[2023-03-14T22:13:31.964] task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [96]: mask_cpu,one_thread, 0x1,0x2,0x4
[2023-03-14T22:13:32.048] error: _send_slurmstepd_init failed
[2023-03-14T22:13:32.048] error: Unable to init slurmstepd
[2023-03-14T17:13:32.051] fatal: Failed to read MPI conf from slurmd
flx42 commented 1 year ago

Getting srun to work inside a containerized sbatch script (because of #SBATCH --container-image) is tricky, probably even more so if PMIx is involved.

Is there any reason why you must do that? Running the sbatch script uncontainerized and then using --container-image only for the srun should be simpler and work out of the box without any need for bind-mounts of Slurm files / sockets.

rormseth commented 1 year ago

If I move all those container flags to the srun, that doesn't work either. Here's my submit script:

#!/bin/sh 
#SBATCH -J mpitest
#SBATCH -N 3
#SBATCH -n 9
#SBATCH -o %j.o
#SBATCH -e %j.e

srun --container-image=docker://hpcreid/openmpi:230130 --container-mount-home --container-mounts=/scratch/rormseth --mpi=pmix /workspace/mpihello.exe

I've attached my output error file. 98_e.txt

flx42 commented 1 year ago

This could be a problem with how MPI is installed inside the container image, could you try the TensorFlow container image published by NVIDIA? (warning: it's a large image)

$ srun -N1 --ntasks=8 --mpi=pmix --container-image=nvcr.io#nvidia/tensorflow:23.02-tf2-py3 all_reduce_perf_mpi -b 1G -e 1G -c 1
flx42 commented 1 year ago

Closing as I didn't get an answer, feel free to reopen.