Closed verdimrc closed 1 year ago
What is the container image? Could you try with a NGC image just for sanity?
$ srun -N1 --ntasks=8 --mpi=pmix --container-image=nvcr.io#nvidia/tensorflow:23.02-tf2-py3 all_reduce_perf_mpi -b 1G -e 1G -c 1
Side node: you don't need to mount IB devices or the gdrcopy device, this is taken care by enroot. And also pyxis/Slurm don't work like Docker, only the last --container-mounts
argument is kept.
Thank you @flx42.
I can get pyxis+mpi works when using NGC's tenforflow container. I'll debug what happens with my container (it's PyTorch-DLC-1.13-ec2).
Also +1 to your side notes.
I have setup
pyxis-0.15.0
,slurm-22.05.5
, andenroot-3.4.1
. I've added the enroot extra hook 50-slurm-pmi.sh to/etc/enroot/hooks.d/50-slurm-pmi.sh
.Unfortunately, my container job (which uses
openmpi-4.1.4
) failed with these error:Here's the fragment of my .sbatch file:
Is there any other configuration that I'm still missing? Appreciate any insight and/or help.