In intra-node collective communication works well via NCCL(H100 HGX server with NVswitch), but we encountered below error in terms of infiniband device error for inter-node communication(GPU Direct RDMA).

EXPORT REMOVED

NCCL_IB_HCA="=mlx5_0,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7,mlx5_8,mlx5_9" NCCL_SOCKET_IFNAME=eth0 NCCL_DEBUG=INFO NCCL_IB_HCA="mlx5" NCCL_ALGO=Tree,Ring MELLANOX_VISIBLE_DEVICES=all NVIDIA_VISIBLE_DEVICES=void NCCL_IB_SL=1 UCX_NET_DEVICES=mlx_5 LD_LIBRARY_PATH=${NCCL_DIR}/build/lib:${LD_LIBRARY_PATH}

Log the assigned nodes

echo "Using nodes: $SLURM_JOB_NODELIST" Using nodes: export NCCL_TESTS_DIR="/testdir/shared/users/wayne/nccl-tests"

srun --mpi=pmi2 \ --container-image="/testdir/shared/sqsh/nvidia+pytorch+24.09-py3.sqsh" \ --container-mounts=$NCCL_TESTS_DIR \ --container-remap-root --no-container-mount-home \ "${NCCL_TESTS_DIR}/build/all_reduce_perf" \ --minbytes 1G --maxbytes 8G --stepfactor 2 --ngpus 8 --warmup_iters 5 --iters 40 -c 0 ib_device.c:1171 UCX ERROR ibv_create_ah(dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:10.0.0.40 flow_label=0xffffffff sgid_index=3 traffic_class=106) for RC DEVX QP connect on mlx5_bond_0 failed: Connection timed out srun: Job step aborted: Waiting up to 32 seconds for job step to finish. slurmstepd: error: JOB 150 ON hgx CANCELLED AT 2024-10-02T03:03:04 slurmstepd: error: STEP 150.0 ON hgx CANCELLED AT 2024-10-02T03:03:04

I will try to check rdma/perftest in enroot, but want to know right approach or manual, tips for this.

NVIDIA / nccl

Enroot #1467

Log the assigned nodes