NVIDIA / pyxis

Container plugin for Slurm Workload Manager
Apache License 2.0
263 stars 28 forks source link

OpenMPI mpirun received unexpected process identifier #114

Closed verdimrc closed 1 year ago

verdimrc commented 1 year ago

I have setup pyxis-0.15.0, slurm-22.05.5, and enroot-3.4.1. I've added the enroot extra hook 50-slurm-pmi.sh to /etc/enroot/hooks.d/50-slurm-pmi.sh.

Unfortunately, my container job (which uses openmpi-4.1.4) failed with these error:

[ip-26-0-164-166][[16971,28321],8][btl_tcp_endpoint.c:625:mca_btl_tcp_endpoint_recv_connect_ack] received unexpected process identifier [[16971,28321],13]
slurmstepd: error: *** JOB 148 ON ip-26-0-160-44 CANCELLED AT 2023-05-24T08:57:43 ***
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 148.0 ON ip-26-0-160-44 CANCELLED AT 2023-05-24T08:57:43 ***

Here's the fragment of my .sbatch file:

declare -a ARGS=(
    --container-image $IMAGE
    --container-mounts /dev/infiniband/uverbs0:/dev/infiniband/uverbs0
    --container-mounts /dev/infiniband/uverbs1:/dev/infiniband/uverbs1
    --container-mounts /dev/infiniband/uverbs2:/dev/infiniband/uverbs2
    --container-mounts /dev/infiniband/uverbs3:/dev/infiniband/uverbs3
    --container-mounts /dev/gdrdrv:/dev/gdrdrv
    --container-mount-home
    --container-mounts $FSX_MOUNT
)
srun --mpi=pmix "${ARGS[@]}" /opt/nccl-tests/build/all_reduce_perf -b 8 -e 1G -f 2 -g 1 -c 1 -n 100

Is there any other configuration that I'm still missing? Appreciate any insight and/or help.

flx42 commented 1 year ago

What is the container image? Could you try with a NGC image just for sanity?

$ srun -N1 --ntasks=8 --mpi=pmix --container-image=nvcr.io#nvidia/tensorflow:23.02-tf2-py3 all_reduce_perf_mpi -b 1G -e 1G -c 1

Side node: you don't need to mount IB devices or the gdrcopy device, this is taken care by enroot. And also pyxis/Slurm don't work like Docker, only the last --container-mounts argument is kept.

verdimrc commented 1 year ago

Thank you @flx42.

I can get pyxis+mpi works when using NGC's tenforflow container. I'll debug what happens with my container (it's PyTorch-DLC-1.13-ec2).

Also +1 to your side notes.