NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
775 stars 226 forks source link

Multi node test hang phenomenon #200

Closed gim4moon closed 4 months ago

gim4moon commented 4 months ago

hello.

While performing the nccl-test of the muti-node 8GPU device, a hang phenomenon of unknown cause occurred.

command:

mpirun --allow-run-as-root -np 2 -H host1,host2 -x LD_LIBRARY_PATH -x NCCL_DEBUG=INFO -x NCCL_CUMEM_ENABLE=1 -x NCCL_IB_HCA=mlx5 /nccl-tests/build/all_reduce_perf -b 8 -e 128M -f 2 -g 8

(hang here)

The environment is as follows. OS:Ubuntu 22.04 OS Kernel: 5.15.0-25-generic nvidia-driver: 535.86.10 CUDA toolkit: 12.2 MLNX OFED version:5.8-1.1.2.1 HPC-X version: 2.17.1-CUDA12.x-LTS mlnx_ofed Ubuntu 22.04 NCCL version: 2.19.3-cuda12.2_1.0-1

It is a bare metal environment with two work nodes.

I am testing muiti node using InfiniBand NDR adapter.

Can you help me with what the problem is?

AddyLaddy commented 4 months ago

We can only guess what is wrong with your system without seeing the NCCL_DEBUG=INFO file.

Do simple MPI programs work between the two nodes?

Can you try experiments with smaller message sizes e.g. -b 8 -e 8 Maybe try with just one GPU per node -g 1? And perhaps try with NCCL_ALGO=RING NCCL_PROTO=SIMPLE to rule out some issues.

Are you using GDRDMA ? How many network adapters per node ?

gim4moon commented 4 months ago

After downgrading HPC-X to HPC-X version 2.14 according to the version of MLNX OFED, I tried NCCL-test and it worked well.

It seems that the speed is not as fast as expected. Even though it is an NDR environment.

command: mpirun -np 4 --allow-run-as-root --bind-to socket -H test01,test02,test03,test04 -x NCCL_CUMEM_ENABLE=0 -x LD_LIBRARY_PATH -x NCCL_UCX_TLS=rc_x,cuda_copy -x NCCL_UCX_RNDV_THRESH=0 -x UCX_MEMTYPE_CACHE=n -x NCCL_COLLNET_ENABLE=0 -x NCCL_PLUGIN_P2P=ucx -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=NET -x NCCL_IB_HCA=mlx5 ./build/all_reduce_perf -b 128M -e 2G -f 2 -g 8

size count type redop root time algbw busbw #wrong time algbw busbw #wrong

(B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)

134217728 33554432 float sum -1 1455.5 92.21 178.66 0 1455.4 92.22 178.68 0 268435456 67108864 float sum -1 2871.7 93.48 181.11 0 2872.7 93.44 181.05 0 536870912 134217728 float sum -1 5287.1 101.54 196.74 0 5284.9 101.59 196.82 0 1073741824 268435456 float sum -1 10533 101.94 197.51 0 10532 101.95 197.53 0 2147483648 536870912 float sum -1 21101 101.77 197.18 0 21260 101.01 195.70 0

Out of bounds values : 0 OK

Avg bus bandwidth : 190.098