NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.25k stars 826 forks source link

ncclGroupEnd() error #621

Open flyree opened 2 years ago

flyree commented 2 years ago

Hi,

I encountered this error

ncclGroupEnd() : 0002 unhandled system error when I attempted to run a job among 4 GPU node with OpenMPI 4.1.2. The MPI is working (I tested with a MPI alone job).

I compile the nccl from the source code downloaded directly from github repo, and my CUDA version is 11.1

Here is the debug info printed:


dlv01:25797:25797 [0] NCCL INFO Bootstrap : Using ib0:172.17.120.71<0> dlv01:25797:25797 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation dlv01:25797:25797 [0] NCCL INFO NET/IB : Using [0]mlx4_0:1/IB ; OOB ib0:172.17.120.71<0> dlv01:25797:25797 [0] NCCL INFO Using network IB NCCL version 2.11.4+cuda11.1 dlv04:8736:8736 [0] NCCL INFO Bootstrap : Using ib0:172.17.120.74<0> dlv03:14376:14376 [0] NCCL INFO Bootstrap : Using ib0:172.17.120.73<0> dlv02:27184:27184 [0] NCCL INFO Bootstrap : Using ib0:172.17.120.72<0> dlv04:8736:8736 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation dlv03:14376:14376 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation dlv02:27184:27184 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation dlv04:8736:8736 [0] NCCL INFO NET/IB : Using [0]mlx4_0:1/IB ; OOB ib0:172.17.120.74<0> dlv04:8736:8736 [0] NCCL INFO Using network IB dlv03:14376:14376 [0] NCCL INFO NET/IB : Using [0]mlx4_0:1/IB ; OOB ib0:172.17.120.73<0> dlv03:14376:14376 [0] NCCL INFO Using network IB dlv02:27184:27184 [0] NCCL INFO NET/IB : Using [0]mlx4_0:1/IB ; OOB ib0:172.17.120.72<0> dlv02:27184:27184 [0] NCCL INFO Using network IB ncclGroupEnd() in file 0002 unhandled system error

dlv04:8736:8745 [0] include/socket.h:409 NCCL WARN Net : Connect to 172.17.120.71<59941> failed : No route to host dlv04:8736:8745 [0] NCCL INFO bootstrap.cc:360 -> 2 dlv04:8736:8745 [0] NCCL INFO init.cc:525 -> 2 dlv04:8736:8745 [0] NCCL INFO init.cc:941 -> 2 dlv04:8736:8745 [0] NCCL INFO group.cc:72 -> 2 [Async thread]

sjeaugey commented 2 years ago

It seems you're running on 4 nodes. NCCL by default uses the ib0 interface, which has IPs 172.17.120.71 to 172.17.120.74. Now it seems those interfaces can't talk to each other. If you ping 172.17.120.71 from dlv04 it will probably fail.

So you should either fix that and allow your nodes to communicate with each other using ib0, or you should set NCCL_SOCKET_IFNAME to another interface which works.