NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
819 stars 230 forks source link

Enhance Multi-Node NCCL Testing with Torch C10D Gloo Framework #243

Open hexinw opened 1 month ago

hexinw commented 1 month ago

This patch introduces support for running multi-process, multi-node NCCL tests using the Torch c10d Gloo distributed framework.

Previously, running multi-node NCCL tests required MPI, which relies on SSH or Kubexec (in Kubernetes) to access worker nodes. This setup posed deployment and security challenges due to the need for maintaining SSH keys or Kubexec RBAC policies.

With the introduction of C10D Gloo, worker nodes now communicate with the master node over TCP transport. This simplifies the process, making it similar to running multi-node PyTorch training jobs. Users only need to set the following environment variables to start the test:

Dependencies

PyTorch C++ APIs and libraries are required. Download LibTorch with the following commands:

  cd /tmp/
  wget
  https://download.pytorch.org/libtorch/nightly/cpu/libtorch-shared-with-deps-latest.zip
  unzip libtorch-shared-with-deps-latest.zip
  sudo mv libtorch /usr/local/

Build instructions

To build the NCCL test binaries supporting both MPI and C10D Gloo, use:

  MPI=1 GLOO=1 make

Usage

Run a Single 8-GPU Node NCCL Test:

  1. Set environment variables:

    export NCCL_TOPO_FILE=<topo_file_location>
    export LD_LIBRARY_PATH=/usr/local/libtorch/lib:$LD_LIBRARY_PATH
  2. Execute the test:

    #!/bin/bash
    
    for i in {0..7}; do
    MASTER_ADDR=localhost RANK=$i WORLD_SIZE=8 ./all_reduce_perf -b1G -e2G -f2 -t1 -g1 &
    done
    
    wait

Run a Two-Node NCCL Test:

Node 1:

  1. Set environment variables:

    export NCCL_TOPO_FILE=<topo_file_location>
    export MASTER_ADDR=<master_node_ip_address>
    export LD_LIBRARY_PATH=/usr/local/libtorch/lib:$LD_LIBRARY_PATH
  2. Execute the test:

    RANK=0 WORLD_SIZE=2 /tmp/all_reduce_perf -b1G -e2G -f2 -t1 -g8

Node 2:

  1. Set environment variables:

    export NCCL_TOPO_FILE=<topo_file_location>
    export MASTER_ADDR=<master_node_ip_address>
    export LD_LIBRARY_PATH=/usr/local/libtorch/lib:$LD_LIBRARY_PATH
  2. Execute the test:

    RANK=1 WORLD_SIZE=2 /tmp/all_reduce_perf -b1G -e2G -f2 -t1 -g8