NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.19k stars 804 forks source link

Enroot #1467

Open dobiup opened 2 weeks ago

dobiup commented 2 weeks ago

In intra-node collective communication works well via NCCL(H100 HGX server with NVswitch), but we encountered below error in terms of infiniband device error for inter-node communication(GPU Direct RDMA).

EXPORT REMOVED

NCCL_IB_HCA="=mlx5_0,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7,mlx5_8,mlx5_9" NCCL_SOCKET_IFNAME=eth0 NCCL_DEBUG=INFO NCCL_IB_HCA="mlx5" NCCL_ALGO=Tree,Ring MELLANOX_VISIBLE_DEVICES=all NVIDIA_VISIBLE_DEVICES=void NCCL_IB_SL=1 UCX_NET_DEVICES=mlx_5 LD_LIBRARY_PATH=${NCCL_DIR}/build/lib:${LD_LIBRARY_PATH}

Log the assigned nodes

echo "Using nodes: $SLURM_JOB_NODELIST" Using nodes: export NCCL_TESTS_DIR="/testdir/shared/users/wayne/nccl-tests"

srun --mpi=pmi2 \ --container-image="/testdir/shared/sqsh/nvidia+pytorch+24.09-py3.sqsh" \ --container-mounts=$NCCL_TESTS_DIR \ --container-remap-root --no-container-mount-home \ "${NCCL_TESTS_DIR}/build/all_reduce_perf" \ --minbytes 1G --maxbytes 8G --stepfactor 2 --ngpus 8 --warmup_iters 5 --iters 40 -c 0 ib_device.c:1171 UCX ERROR ibv_create_ah(dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:10.0.0.40 flow_label=0xffffffff sgid_index=3 traffic_class=106) for RC DEVX QP connect on mlx5_bond_0 failed: Connection timed out srun: Job step aborted: Waiting up to 32 seconds for job step to finish. slurmstepd: error: JOB 150 ON hgx CANCELLED AT 2024-10-02T03:03:04 slurmstepd: error: STEP 150.0 ON hgx CANCELLED AT 2024-10-02T03:03:04

I will try to check rdma/perftest in enroot, but want to know right approach or manual, tips for this.

kiskra-nvidia commented 2 weeks ago

You are getting an error from UCX, not NCCL. Quite likely, it's MPI initialization that's failing and you never even get to the NCCL init.

So your first step should be to confirm whether any simple MPI program works. You can also try to configure you MPI differently, say, to use TCP, which may be simpler to get working (this will not affect NCCL's performance, as NCCL performs its own inter-node communication independent of MPI).