Using GPUDirect and NCCL with torch 2.5 (nvidia-nccl-cu12 2.21.5)

Hello,

I am working on upgrading pytorch to latest stable release (2.5.1) and observed NCCL issues which I believe are tied to our usage of the GPUDirect-TCPX + NCCL daemonsets for A3 High VMs on GKE.

For context, I have a working pytorch 2.1.2 setup that uses the daemonset and other configuration as per the docs that works for multi-node communication via NCCL. The pytorch library was installed from a pre-built wheel, which pulled in the nvidia-nccl-cu12==2.18.1 dependency. With NCCL_DEBUG set, the version which is printed by torch distributed is 2.18.1. The mounted libraries from the Daemonset appear to be libnccl.so.2.18.5.

For the version where I am using pytorch 2.5.1, the base image used was pytorch/pytorch:2.5.1-cuda12-cudnn9-devel, which pulls in the nvidia-nccl-cu12==2.21.5 dependency. When using this image, the NCCL backend works on single-node multi-device (all 8), but with multi-node, we see NCCL errors:

[5]:torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:317, internal error - please report this issue to the NCCL developers, NCCL version 2.21.5
[5]:ncclInternalError: Internal check failed.
[5]:Last error:
[5]:NET/GPUDirectTCPX failed to connect socket
[5]:Exception raised from create at ../torch/csrc/distributed/c10d/NCCLUtils.hpp:317 (most recent call first):

I'm happy to provide a more detailed stack trace if helpful.

I was hoping the maintainers could help address a couple of questions regarding the daemonset and its version constraints. In the docs for installing the GPUDirect binary and configuring NCCL via the daemonset, it states that the daemonset will install a specific NCCL library version.

Questions

Does this mean that the NCCL library that is used by containers that request GPU and use hostPath volume mounts to mount the library and the binary in the /home/kubernetes/bin/nvidia/lib64 directory on the VM is the specific NCCL library version hard-coded in the nccl-installer?
Does this mean that applications which are compiled to use a newer NCCL version (2.21.5) are not possible to run on A3 VMs with GPUDirect-TCPX?
Is there a way to control the NCCL version installed by the daemonset? Inspecting the container, I see that the installer entrypoint /scripts/container_entry.sh install --install-nccl says that it installs "NCCL main branch". As far as I can tell, the NCCL installation simply involves copying a pre-built libnccl.so.2.18.5 from /var/lib/tcpx/lib64/ to /usr/local/nvidia/lib64

install_nccl() {
  local -r nccltype=$1
  echo -n "Installing NCCL ${nccltype}, "
  if [[ "${nccltype}" == "nvtx" ]]; then
    cp -P /third_party/nccl-netsupport-nvtx/build/lib/libnccl.so* /var/lib/tcpx/lib64/
    cp -P /third_party/nccl-netsupport-nvtx/build/lib/libnccl.so* /var/lib/tcpxo/lib64/
    cp -P /third_party/nccl-netsupport-nvtx/build/lib/libnccl.so* /var/lib/fastrak/lib64/
  else
    cp -P /third_party/nccl-netsupport/build/lib/libnccl.so* /var/lib/tcpx/lib64/
    cp -P /third_party/nccl-netsupport/build/lib/libnccl.so* /var/lib/tcpxo/lib64/
    cp -P /third_party/nccl-netsupport/build/lib/libnccl.so* /var/lib/fastrak/lib64/
  fi
}

GoogleCloudPlatform / container-engine-accelerators

Using GPUDirect and NCCL with torch 2.5 (nvidia-nccl-cu12 2.21.5) #414

Questions