NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.24k stars 817 forks source link

wrong devices picked by NCCL #519

Open pseudotensor opened 3 years ago

pseudotensor commented 3 years ago

https://github.com/dmlc/xgboost/issues/7019

In particular, for network setup: https://github.com/dmlc/xgboost/issues/7019#issuecomment-855657869

NCCL is wrongly using a non-IP based vet device instead of the one associated with the IP, but it is only doing that on one system instead of a very similar other system.

I have to specify the names directly.

This is a regression from the NCCL associated with rapids 0.14 conda install, where this did not occur.

The problem NCCL is NCCL 2.8.3+cuda11.2

sjeaugey commented 3 years ago

I don't see a trace in the bug with NCCL_DEBUG=INFO so I can't say for sure which IP interface NCCL picked. If it did pick the vet* interface, I'd guess that's because those had an IPv6 address set.

NCCL has a very simple IP selection system: it just lists the interfaces with an IP set (IPv4 or IPv6) and pick the first one that matches the filtering, i.e. ^lo,docker by defaut, or what the user passed as NCCL_SOCKET_IFNAME.

I'm not sure why that behavior would have changed between different NCCL versions though, I don't recall having made changes to the IP interface selection logic recently.