Open zhangciba opened 1 year ago
The error here is due to MPI struggling to find the right interface. And it's not even easy to tell MPI which interface to use given they're not the same depending on the node.
Now even if you went past the MPI issue, I'm not sure how you'd make NCCL communicate across the right interfaces. There is not way to specify a particular network connectivity, and even if we had that information, I'm not sure how we'd be supposed to communicate between GPU X on one node and GPU Y on another node if their NICs are not connected to each other.
So, you should really make sure nodes are identical, and all NICs can talk to all others.
I have two nodes, name nodea and node b, both has 8 A800 GPU
nodea has 5 roce network: xgbe0 for cpu, xgbe2/4/6/8 for gpu
nodea has 5 roce network: xgbe4 for cpu, xgbe0/2/6/8 for gpu
Network Connectivity are below: nodea xgbe0 <-> nodeb xgbe4 nodea xgbe2/6 <-> nodeb xgbe0/6 nodea xgbe4/8 <-> nodeb xgbe2/8
I set env on node a export NCCL_SOCKET_IFNAME=xgbe4
set env on nodeb export NCCL_SOCKET_IFNAME=xgbe0
and I am running nccl test on this two nodes
node_nums=2 ip file context is :
and I get the result:
can nccl use different network card to communicate?