Open iojw opened 1 week ago
How did you configure your NICs? Did you use the same IP subnet for all NICs or different ones?
I'm on a AWS EC2 instance where the NICs look to all be on the same subnet.
However, I've also noticed that multiple NICs are used with collectives like all-reduce, so this issue seems to only occur with send / recvs?
How does NCCL decide which NIC to use for inter-node send / recv operations over sockets?
I'm running benchmarks where multiple GPUs on a node are sending data to the corresponding GPU on another node (gpu0 to gpu0, gpu1 to gpu1 etc.) over sockets. It appears that only
eth0
is being used and bandwidth seems to max out at ~60Gbps. However, each GPU has multiple 100 Gbps NICs attached to it so I'd expect to be able to improve performance by using at least one NIC per GPU. Why is it that NCCL only useseth0
?