Closed gim4moon closed 4 months ago
We can only guess what is wrong with your system without seeing the NCCL_DEBUG=INFO file.
Do simple MPI programs work between the two nodes?
Can you try experiments with smaller message sizes e.g. -b 8 -e 8
Maybe try with just one GPU per node -g 1
?
And perhaps try with NCCL_ALGO=RING NCCL_PROTO=SIMPLE to rule out some issues.
Are you using GDRDMA ? How many network adapters per node ?
After downgrading HPC-X to HPC-X version 2.14 according to the version of MLNX OFED, I tried NCCL-test and it worked well.
It seems that the speed is not as fast as expected. Even though it is an NDR environment.
command: mpirun -np 4 --allow-run-as-root --bind-to socket -H test01,test02,test03,test04 -x NCCL_CUMEM_ENABLE=0 -x LD_LIBRARY_PATH -x NCCL_UCX_TLS=rc_x,cuda_copy -x NCCL_UCX_RNDV_THRESH=0 -x UCX_MEMTYPE_CACHE=n -x NCCL_COLLNET_ENABLE=0 -x NCCL_PLUGIN_P2P=ucx -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=NET -x NCCL_IB_HCA=mlx5 ./build/all_reduce_perf -b 128M -e 2G -f 2 -g 8
134217728 33554432 float sum -1 1455.5 92.21 178.66 0 1455.4 92.22 178.68 0 268435456 67108864 float sum -1 2871.7 93.48 181.11 0 2872.7 93.44 181.05 0 536870912 134217728 float sum -1 5287.1 101.54 196.74 0 5284.9 101.59 196.82 0 1073741824 268435456 float sum -1 10533 101.94 197.51 0 10532 101.95 197.53 0 2147483648 536870912 float sum -1 21101 101.77 197.18 0 21260 101.01 195.70 0
hello.
While performing the nccl-test of the muti-node 8GPU device, a hang phenomenon of unknown cause occurred.
command:
mpirun --allow-run-as-root -np 2 -H host1,host2 -x LD_LIBRARY_PATH -x NCCL_DEBUG=INFO -x NCCL_CUMEM_ENABLE=1 -x NCCL_IB_HCA=mlx5 /nccl-tests/build/all_reduce_perf -b 8 -e 128M -f 2 -g 8
(hang here)
The environment is as follows. OS:Ubuntu 22.04 OS Kernel: 5.15.0-25-generic nvidia-driver: 535.86.10 CUDA toolkit: 12.2 MLNX OFED version:5.8-1.1.2.1 HPC-X version: 2.17.1-CUDA12.x-LTS mlnx_ofed Ubuntu 22.04 NCCL version: 2.19.3-cuda12.2_1.0-1
It is a bare metal environment with two work nodes.
I am testing muiti node using InfiniBand NDR adapter.
Can you help me with what the problem is?