Closed szhengac closed 12 months ago
But NCCL log shows GPU RDMA is not enabled during the training job. The following is the relevant part of my command to launch a Megatron-LM job on a single node:
Were you always testing on 1 node? It might be normal, NCCL could be using only nvlink for a single node application.
@flx42 I have fixed the issue. I was running the experiment on 1 node. The additional device_mount is necessary for enabling RDMA. It is just RDMA is not printed out for single-node experiment with Slurm job. If I did nccl test by manually launching an enroot container and force the connection to go through NIC, I can see the RDMA printed out.
Hi,
I was running a Megatron-LM test job with Slurm and image
nvcr.io/ea-bignlp/bignlp-training:23.03-py3
. But NCCL log shows GPU RDMA is not enabled during the training job. However, if I usedocker run
to launch a container with the above mentioned image, I can see that GPU RDMA is used. So I think there is something wrong on how I use pyxis+enroot. The following is the relevant part of my command to launch a Megatron-LM job on a single node:I am not sure if additional
DEVICE_MOUNT
is needed, but it makes no difference if I remove it.