NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.25k stars 822 forks source link

NICs on same subnet #1269

Open samsamoa opened 6 months ago

samsamoa commented 6 months ago

Is there a canonical way to do Socket transport if all NICs are on the same subnet?

I'm running into the issue mentioned here where all traffic is routed through a single interface on the client side: https://github.com/NVIDIA/nccl/issues/601#issuecomment-979088321

[edit: might have a solution, will post here if i can get it working]

sjeaugey commented 6 months ago

It's tricky. I think there is a way to add routing tables to pick the right interface based on the source/destination IP, but it requires some serious expertise in linux networking/routing tables management.