Open dobiup opened 1 month ago
You are getting an error from UCX, not NCCL. Quite likely, it's MPI initialization that's failing and you never even get to the NCCL init.
So your first step should be to confirm whether any simple MPI program works. You can also try to configure you MPI differently, say, to use TCP, which may be simpler to get working (this will not affect NCCL's performance, as NCCL performs its own inter-node communication independent of MPI).
In intra-node collective communication works well via NCCL(H100 HGX server with NVswitch), but we encountered below error in terms of infiniband device error for inter-node communication(GPU Direct RDMA).
EXPORT REMOVED
NCCL_IB_HCA="=mlx5_0,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7,mlx5_8,mlx5_9" NCCL_SOCKET_IFNAME=eth0 NCCL_DEBUG=INFO NCCL_IB_HCA="mlx5" NCCL_ALGO=Tree,Ring MELLANOX_VISIBLE_DEVICES=all NVIDIA_VISIBLE_DEVICES=void NCCL_IB_SL=1 UCX_NET_DEVICES=mlx_5 LD_LIBRARY_PATH=${NCCL_DIR}/build/lib:${LD_LIBRARY_PATH}
Log the assigned nodes
echo "Using nodes: $SLURM_JOB_NODELIST" Using nodes: export NCCL_TESTS_DIR="/testdir/shared/users/wayne/nccl-tests"
srun --mpi=pmi2 \ --container-image="/testdir/shared/sqsh/nvidia+pytorch+24.09-py3.sqsh" \ --container-mounts=$NCCL_TESTS_DIR \ --container-remap-root --no-container-mount-home \ "${NCCL_TESTS_DIR}/build/all_reduce_perf" \ --minbytes 1G --maxbytes 8G --stepfactor 2 --ngpus 8 --warmup_iters 5 --iters 40 -c 0 ib_device.c:1171 UCX ERROR ibv_create_ah(dlid=49152 sl=0 port=1 src_path_bits=0 dgid=::ffff:10.0.0.40 flow_label=0xffffffff sgid_index=3 traffic_class=106) for RC DEVX QP connect on mlx5_bond_0 failed: Connection timed out srun: Job step aborted: Waiting up to 32 seconds for job step to finish. slurmstepd: error: JOB 150 ON hgx CANCELLED AT 2024-10-02T03:03:04 slurmstepd: error: STEP 150.0 ON hgx CANCELLED AT 2024-10-02T03:03:04
I will try to check rdma/perftest in enroot, but want to know right approach or manual, tips for this.