NVIDIA / pyxis

Container plugin for Slurm Workload Manager
Apache License 2.0
263 stars 28 forks source link

NCCL cannot use GPU RDMA inside pyxis-enroot container with Slurm #117

Closed szhengac closed 12 months ago

szhengac commented 1 year ago

Hi,

I was running a Megatron-LM test job with Slurm and image nvcr.io/ea-bignlp/bignlp-training:23.03-py3. But NCCL log shows GPU RDMA is not enabled during the training job. However, if I use docker run to launch a container with the above mentioned image, I can see that GPU RDMA is used. So I think there is something wrong on how I use pyxis+enroot. The following is the relevant part of my command to launch a Megatron-LM job on a single node:

export CUDA_DEVICE_MAX_CONNECTIONS=1
export NCCL_DEBUG=WARN

var_UCX_NET_DEVICES=mlx5_0:1
var_NCCL_IB_HCA="=mlx5_5,mlx5_6,mlx5_7,mlx5_8,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_14,mlx5_15,mlx5_16,mlx5_17,mlx5_9,mlx5_10,mlx5_11,mlx5_12"

export UCX_TLS=ud,self,sm \
       NCCL_DEBUG=INFO \
       NCCL_IB_CUDA_SUPPORT=1 \
       NCCL_IB_SL=0 \
       NCCL_IB_TC=41 \
       NCCL_IB_QPS_PER_CONNECTION=4 \
       UCX_NET_DEVICES=${var_UCX_NET_DEVICES} \
       HCOLL_ENABLE_MCAST_ALL=0 \
       coll_hcoll_enable=0 \
       NCCL_IB_GID_INDEX=3 \
       NCCL_IB_HCA="${var_NCCL_IB_HCA}" \
       NCCL_ALGO=Ring \
       OMPI_MCA_coll=^hcoll \
       OMPI_MCA_pml=ucx \
       ENROOT_RESTRICT_DEV=y

DEVICE_MOUNT="/dev/infiniband/rdma_cm:/dev/infiniband/rdma_cm"

for i in {0..17}
do
  DEVICE_MOUNT="$DEVICE_MOUNT,/dev/infiniband/uverbs$i:/dev/infiniband/uverbs$i"
done

echo $DEVICE_MOUNT

srun -l \
     --container-image /fsx/enroot_images/nvcr.io+ea-bignlp+bignlp-training+23.03-py3.sqsh \
     --container-mounts /fsx:/fsx,$DEVICE_MOUNT \
     --exclusive \
     --output=$DIR/logs/%x_%j_$DATETIME.log bash -c "${run_cmd}"

set +x

I am not sure if additional DEVICE_MOUNT is needed, but it makes no difference if I remove it.

flx42 commented 1 year ago

But NCCL log shows GPU RDMA is not enabled during the training job. The following is the relevant part of my command to launch a Megatron-LM job on a single node:

Were you always testing on 1 node? It might be normal, NCCL could be using only nvlink for a single node application.

szhengac commented 12 months ago

@flx42 I have fixed the issue. I was running the experiment on 1 node. The additional device_mount is necessary for enabling RDMA. It is just RDMA is not printed out for single-node experiment with Slurm job. If I did nccl test by manually launching an enroot container and force the connection to go through NIC, I can see the RDMA printed out.