NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.14k stars 791 forks source link

HGX 2-node test with different NIC topologies different network card names hangs, no results #1277

Open superLiben opened 4 months ago

superLiben commented 4 months ago

I have an HGX H100 server with 2 nodes, and I'm performing node bandwidth testing. After running the command, it hangs. My NCCL is the latest version, and OpenMPI is 4.1.7. I found that the NIC topology is different between the two nodes, which may cause the hang. If I test two nodes with the same IB card topology, there is no issue. My run command is as follows:

root/nccl_apps/openmpi-4.1.7a1/bin/mpirun --allow-run-as-root -np 16 -H 100.64.24.75:8,100.64.24.76:8 --timestamp-output --mca btl_tcp_if_include enp25s0np0 --mca oob_tcp_if_include enp25s0np0 -x NCCL_IB_GID_INDEX=3 -x NCCL_DEBUG=WARN -x NCCL_DEBUG_SUBSYS=INIT,NET,GRAPH -x NCCL_IB_QPS_PER_CONNECTION=4 -x NCCL_PXN_DISABLE=0 -x NCCL_CROSS_NIC=1 -x LD_LIBRARY_PATH=/root/nccl_apps/nccl/lib:/root/nccl_apps/openmpi-4.1.7a1/lib /root/nccl_apps/nccl-test/all_reduce_perf -b 1M -e 20G -g 1 -f 2

Each host configuration: openmpi version:4.1.7a1 nccl version: 2.21.5 H100 =8 400G CX7 =4 (switch inter-GPU communication) other:200G=1 /25G=4 (Manage and store communications)

Two images, one with MLX5_0/3/5/8 and another with mlx5_0/1/4/5. They are both 400Gb single-port IB cards, and my network is RoCE V2." Note: MLX5 refers to a type of network interface card (NIC) from Mellanox, and the numbers (e.g. 0/3/5/8) likely represent the ports or lanes on the card. RoCE V2 stands for RDMA over Converged Ethernet Version 2, which is a protocol for remote direct memory access over Ethernet.

Can the different NIC topology between the two nodes cause the hang?" 20240508121515 20240508121536

superLiben commented 4 months ago

image I would like to ask whether red and pink are both 400G network cards. Can these network cards with different names be tested?Will NCCL become unresponsive if the network card name is different? image No response after executing command

image

superLiben commented 4 months ago

Regarding the above question, if two nodes have the same topology, same network card slot positions, and same network card names, can they perform NCCL testing with a bus bandwidth of 360GB/s,If the two nodes have different network card slot positions, causing the network card names to change, can the NCCL cluster still communicate and perform testing? I am currently encountering an issue where the command I am running is unresponsive, and I am unsure of how to resolve it.

sjeaugey commented 4 months ago

Configuring multiple RoCE NICs is complicated, because of the IP subnets which may prevent NICs from communicating with each other. Adding the fact that NICs are not numbered in the same way amplifies the complexity by a new order of magnitude.

Making this setup work should be possible, but would requires a lot of time and effort (which we can't provide over github issues). I would suggest you ensure all nodes are exactly identical.

superLiben commented 4 months ago

Thanx,Can changing the name of the NIC using "rdma set name" solve this problem?