Closed rormseth closed 1 year ago
Getting srun
to work inside a containerized sbatch
script (because of #SBATCH --container-image
) is tricky, probably even more so if PMIx is involved.
Is there any reason why you must do that? Running the sbatch script uncontainerized and then using --container-image
only for the srun
should be simpler and work out of the box without any need for bind-mounts of Slurm files / sockets.
If I move all those container flags to the srun
, that doesn't work either. Here's my submit script:
#!/bin/sh
#SBATCH -J mpitest
#SBATCH -N 3
#SBATCH -n 9
#SBATCH -o %j.o
#SBATCH -e %j.e
srun --container-image=docker://hpcreid/openmpi:230130 --container-mount-home --container-mounts=/scratch/rormseth --mpi=pmix /workspace/mpihello.exe
I've attached my output error file. 98_e.txt
This could be a problem with how MPI is installed inside the container image, could you try the TensorFlow container image published by NVIDIA? (warning: it's a large image)
$ srun -N1 --ntasks=8 --mpi=pmix --container-image=nvcr.io#nvidia/tensorflow:23.02-tf2-py3 all_reduce_perf_mpi -b 1G -e 1G -c 1
Closing as I didn't get an answer, feel free to reopen.
I am running a small cluster with Slurm 22.05.5-1, Enroot 3.4.0-2 and Pyxis 0.14.0 utilizing Rocky 9 compute nodes and a stock kernel 5.14.0-70.22.1. I can run basic single node jobs with containers, but when I try and run multi-node MPI jobs, they fail. The container I am testing is Rocky 8.7 with OpenMPI 4.1.4 installed inside of it. Here is my sample job script:
The job output I receive is:
SLURM logs on the first compute node show Pyxis starting the container:
However, the secondary nodes do not start the container via Pyxis: