NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.15k stars 793 forks source link

NCCL socket performance over multiple NICs #1447

Open iojw opened 1 week ago

iojw commented 1 week ago

How does NCCL decide which NIC to use for inter-node send / recv operations over sockets?

I'm running benchmarks where multiple GPUs on a node are sending data to the corresponding GPU on another node (gpu0 to gpu0, gpu1 to gpu1 etc.) over sockets. It appears that only eth0 is being used and bandwidth seems to max out at ~60Gbps. However, each GPU has multiple 100 Gbps NICs attached to it so I'd expect to be able to improve performance by using at least one NIC per GPU. Why is it that NCCL only uses eth0?

sjeaugey commented 1 week ago

How did you configure your NICs? Did you use the same IP subnet for all NICs or different ones?

iojw commented 1 week ago

I'm on a AWS EC2 instance where the NICs look to all be on the same subnet.

However, I've also noticed that multiple NICs are used with collectives like all-reduce, so this issue seems to only occur with send / recvs?