NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.15k stars 793 forks source link

Inter-node communication with Rail-optimized design for RoCEv2 #1109

Open skullq opened 9 months ago

skullq commented 9 months ago

Hi Team,

I have 3 nodes and each node have 8 x H100 GPU and 8 NICs on Ethernet. Intra node GPUs are working great with NVLink. For Inter node communication, I have allocated /24 subnet to all 3 nodes GPU NICs which means all NICs are placed in same broadcast domain so NCCL picks only one 400G ethernet NIC and Switch perspective MAC and ARP reply is seen on one interface per node.

After watching a video presented by @sjeaugey introduced Rail-optimized design, I want to change IP scheme to avoid inter node routing collision. In his presentation there are 4 leaves and 4 nodes. Each node's 4 bonding interfaces are connected to each leaf.

My question is how to allocate IP subnets across 3 nodes? Case-1 4 subnets to each bonding interface per node and all nodes shares all 4 subnets. ex) bond1 on all nodes have same subent1(10.1.1.0/24) , bond2 on all nodes have same subnet2(10.1.2.0/24) so on)

Case-2 4 subnets to each bonding interface per node and all nodes do not share 4 subnets so there are 16 subnets in total. it requires 0.0.0.0 default gateway routing to reach different subnets so extra 8 default gatewa routing requires.

What is rail-optimized desing concept for RoCEv2?

Thanks in advance

sjeaugey commented 9 months ago

The rail-optimization is about physical switch wiring. Not about the IP subnet design.

Allocating a different subnet for each rail (case-1) might work for most use cases, like full allreduce or alltoall with PXN enabled, but there can be corner cases where it will cause failures because we'll need to communicate across NICs. Ideally, each subnet should be able to communicate with others at full speed. Your switches might allow that, or not.

Allocating a subnet per interface (case-2) makes little sense to me. Not sure why you'd want to do that.

The last alternative is to use a single subnet for all NICs. That would ensure all NICs can talk to each other, but that requires complex configuration in linux to make sure RoCE will select the right NIC.

skullq commented 9 months ago

Really appreciate your quick reply! ;-)

I think, you are fully understand my situation when i see your last alternative case using single subnet for all NICs. My research also started from ARP reply is seen on one interface for all the NIC's ARP request in the Node. I need you advice what is best sysctl kernel option in Nvidia GPU Linux box for better ROCEv2 communication. (I found that it's ARP Flux prevention) BTW L3 Switch is not a problem at all because L3 Switch has all the gateway, lol

Another one is my suggestion. For your better understading, reduce 4 NIC scenario and all subent is /24. No extra routes configured.

| Node-1 | NIC1 10.1.1.1 | NIC2 10.1.2.1 | NIC3 10.1.3.1 | NIC4 10.1.4.1 | | Node-2 | NIC1 10.1.1.2 | NIC2 10.1.2.2 | NIC3 10.1.3.2 | NIC4 10.1.4.2 | | Node-3 | NIC1 10.1.1.3 | NIC2 10.1.2.3 | NIC3 10.1.3.3 | NIC4 10.1.4.3 |

As you pointed out my case-1, If Node-1 NIC1 need to communicate Node-2 NIC4 there is no route when NCCL picks Node-1 NIC1 ip but Node-1 NIC4 is able to talk to Node-2 NIC4 becuase they are in the same subnet. If NCCL picks Node-1 NIC4 ip as a source address then Node-1 can make it without any complex configuration.

Cheers!