wrong devices picked by NCCL

NVIDIA / nccl

Optimized primitives for collective multi-GPU communication

Other

3.24k stars 817 forks source link

I don't see a trace in the bug with NCCL_DEBUG=INFO so I can't say for sure which IP interface NCCL picked. If it did pick the vet* interface, I'd guess that's because those had an IPv6 address set.

NCCL has a very simple IP selection system: it just lists the interfaces with an IP set (IPv4 or IPv6) and pick the first one that matches the filtering, i.e. ^lo,docker by defaut, or what the user passed as NCCL_SOCKET_IFNAME.

I'm not sure why that behavior would have changed between different NCCL versions though, I don't recall having made changes to the IP interface selection logic recently.

NVIDIA / nccl

wrong devices picked by NCCL #519