Open pseudotensor opened 3 years ago
I don't see a trace in the bug with NCCL_DEBUG=INFO
so I can't say for sure which IP interface NCCL picked. If it did pick the vet*
interface, I'd guess that's because those had an IPv6 address set.
NCCL has a very simple IP selection system: it just lists the interfaces with an IP set (IPv4 or IPv6) and pick the first one that matches the filtering, i.e. ^lo,docker
by defaut, or what the user passed as NCCL_SOCKET_IFNAME
.
I'm not sure why that behavior would have changed between different NCCL versions though, I don't recall having made changes to the IP interface selection logic recently.
https://github.com/dmlc/xgboost/issues/7019
In particular, for network setup: https://github.com/dmlc/xgboost/issues/7019#issuecomment-855657869
NCCL is wrongly using a non-IP based
vet
device instead of the one associated with the IP, but it is only doing that on one system instead of a very similar other system.I have to specify the names directly.
This is a regression from the NCCL associated with rapids 0.14 conda install, where this did not occur.
The problem NCCL is NCCL 2.8.3+cuda11.2