This patch introduces support for running multi-process, multi-node NCCL tests using the Torch c10d Gloo distributed framework.
Previously, running multi-node NCCL tests required MPI, which relies on SSH or Kubexec (in Kubernetes) to access worker nodes. This setup posed deployment and security challenges due to the need for maintaining SSH keys or Kubexec RBAC policies.
With the introduction of C10D Gloo, worker nodes now communicate with the master node over TCP transport. This simplifies the process, making it similar to running multi-node PyTorch training jobs. Users only need to set the following environment variables to start the test:
MASTER_ADDR
RANK
WORLD_SIZE
Dependencies
PyTorch C++ APIs and libraries are required. Download LibTorch with the following commands:
cd /tmp/
wget
https://download.pytorch.org/libtorch/nightly/cpu/libtorch-shared-with-deps-latest.zip
unzip libtorch-shared-with-deps-latest.zip
sudo mv libtorch /usr/local/
Build instructions
To build the NCCL test binaries supporting both MPI and C10D Gloo, use:
This patch introduces support for running multi-process, multi-node NCCL tests using the Torch c10d Gloo distributed framework.
Previously, running multi-node NCCL tests required MPI, which relies on SSH or Kubexec (in Kubernetes) to access worker nodes. This setup posed deployment and security challenges due to the need for maintaining SSH keys or Kubexec RBAC policies.
With the introduction of C10D Gloo, worker nodes now communicate with the master node over TCP transport. This simplifies the process, making it similar to running multi-node PyTorch training jobs. Users only need to set the following environment variables to start the test:
PyTorch C++ APIs and libraries are required. Download LibTorch with the following commands:
To build the NCCL test binaries supporting both MPI and C10D Gloo, use:
Set environment variables:
Execute the test:
Node 1:
Set environment variables:
Execute the test:
Node 2:
Set environment variables:
Execute the test: