Open flyree opened 2 years ago
It seems you're running on 4 nodes. NCCL by default uses the ib0
interface, which has IPs 172.17.120.71 to 172.17.120.74. Now it seems those interfaces can't talk to each other. If you ping 172.17.120.71
from dlv04 it will probably fail.
So you should either fix that and allow your nodes to communicate with each other using ib0, or you should set NCCL_SOCKET_IFNAME
to another interface which works.
Hi,
I encountered this error
ncclGroupEnd() : 0002 unhandled system error when I attempted to run a job among 4 GPU node with OpenMPI 4.1.2. The MPI is working (I tested with a MPI alone job).
I compile the nccl from the source code downloaded directly from github repo, and my CUDA version is 11.1
Here is the debug info printed:
dlv01:25797:25797 [0] NCCL INFO Bootstrap : Using ib0:172.17.120.71<0> dlv01:25797:25797 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation dlv01:25797:25797 [0] NCCL INFO NET/IB : Using [0]mlx4_0:1/IB ; OOB ib0:172.17.120.71<0> dlv01:25797:25797 [0] NCCL INFO Using network IB NCCL version 2.11.4+cuda11.1 dlv04:8736:8736 [0] NCCL INFO Bootstrap : Using ib0:172.17.120.74<0> dlv03:14376:14376 [0] NCCL INFO Bootstrap : Using ib0:172.17.120.73<0> dlv02:27184:27184 [0] NCCL INFO Bootstrap : Using ib0:172.17.120.72<0> dlv04:8736:8736 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation dlv03:14376:14376 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation dlv02:27184:27184 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation dlv04:8736:8736 [0] NCCL INFO NET/IB : Using [0]mlx4_0:1/IB ; OOB ib0:172.17.120.74<0> dlv04:8736:8736 [0] NCCL INFO Using network IB dlv03:14376:14376 [0] NCCL INFO NET/IB : Using [0]mlx4_0:1/IB ; OOB ib0:172.17.120.73<0> dlv03:14376:14376 [0] NCCL INFO Using network IB dlv02:27184:27184 [0] NCCL INFO NET/IB : Using [0]mlx4_0:1/IB ; OOB ib0:172.17.120.72<0> dlv02:27184:27184 [0] NCCL INFO Using network IB ncclGroupEnd() in file 0002 unhandled system error
dlv04:8736:8745 [0] include/socket.h:409 NCCL WARN Net : Connect to 172.17.120.71<59941> failed : No route to host dlv04:8736:8745 [0] NCCL INFO bootstrap.cc:360 -> 2 dlv04:8736:8745 [0] NCCL INFO init.cc:525 -> 2 dlv04:8736:8745 [0] NCCL INFO init.cc:941 -> 2 dlv04:8736:8745 [0] NCCL INFO group.cc:72 -> 2 [Async thread]