NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
809 stars 229 forks source link

Nccl test fails on 8 x V100- misc/socket.cc:483 NCCL WARN socketStartConnect: Connect to xxx failed : Software caused connection abort #162

Closed hacker-jerry closed 1 year ago

hacker-jerry commented 1 year ago

Hi, I am getting an error while using nccl test.

The versions of them are: nccl 2.18.3 cuda 11.1

NCCL_DEBUG=INFO NCCL_IBEXT_DISABLE=1 NCCL_IB_DISABLE=1 ./build/all_reduce_perf -b 8 -e 128
M -f 2 -g 8
# nThread 1 nGpus 8 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  47853 on    beijing device  0 [0x1a] NVIDIA Tesla V100-PCIE-32GB
#  Rank  1 Group  0 Pid  47853 on    beijing device  1 [0x1b] NVIDIA Tesla V100-PCIE-32GB
#  Rank  2 Group  0 Pid  47853 on    beijing device  2 [0x3d] NVIDIA Tesla V100-PCIE-32GB
#  Rank  3 Group  0 Pid  47853 on    beijing device  3 [0x3e] NVIDIA Tesla V100-PCIE-32GB
#  Rank  4 Group  0 Pid  47853 on    beijing device  4 [0x88] NVIDIA Tesla V100-PCIE-32GB
#  Rank  5 Group  0 Pid  47853 on    beijing device  5 [0x89] NVIDIA Tesla V100-PCIE-32GB
#  Rank  6 Group  0 Pid  47853 on    beijing device  6 [0xb1] NVIDIA Tesla V100-PCIE-32GB
#  Rank  7 Group  0 Pid  47853 on    beijing device  7 [0xb2] NVIDIA Tesla V100-PCIE-32GB
beijing:47853:47853 [0] NCCL INFO Bootstrap : Using eno1:avahi:xxx.xxx.xxx.xxx<0>
beijing:47853:47853 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
beijing:47853:47853 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
beijing:47853:47853 [7] NCCL INFO cudaDriverVersion 11030
NCCL version 2.18.3+cuda11.1
beijing:47853:47912 [7] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
beijing:47853:47912 [7] NCCL INFO NET/Socket : Using [0]eno1:avahi:xxx.xxx.xxx.xxx<0> [1]eno2:xxx.xxx.xxx.xxx<0> [2]br-9df229593f2c:172.21.0.1<0> [3]br-a4438331b2aa:172.18.0.1<0> [4]br-a45ba72c1a15:172.24.0.1<0> [5]br-dbd02fdc9737:172.20.0.1<0> [6]br-0c73c6652639:192.168.1.1<0> [7]veth95ad967:fexxxxxad967<0> [8]vethbf5be72:fexxxxbf5be72<0> [9]vetha340738xxxxtha340738<0> [10]veth6aef8ab:xxxxx6aef8ab<0> [11]veth3fb926axxxx%veth3fb926a<0> [12]vethb4fdxxxxxxxaaf<0> [13]vethaf919a7xxxxvethaf919a7<0> [14]vethf341af8:fexxxxxf341af8<0>
beijing:47853:47912 [7] NCCL INFO Using network Socket
beijing:47853:47905 [0] NCCL INFO Using network Socket
beijing:47853:47907 [2] NCCL INFO Using network Socket
beijing:47853:47906 [1] NCCL INFO Using network Socket
beijing:47853:47908 [3] NCCL INFO Using network Socket
beijing:47853:47909 [4] NCCL INFO Using network Socket
beijing:47853:47910 [5] NCCL INFO Using network Socket
beijing:47853:47911 [6] NCCL INFO Using network Socket

beijing:47853:47912 [7] misc/socket.cc:483 NCCL WARN socketStartConnect: Connect to 169.254.6.81<50301> failed : Software caused connection abort

beijing:47853:47905 [0] misc/socket.cc:483 NCCL WARN socketStartConnect: Connect to 169.254.6.81<50301> failed : Software caused connection abort
beijing:47853:47905 [0] NCCL INFO misc/socket.cc:564 -> 2
beijing:47853:47912 [7] NCCL INFO misc/socket.cc:564 -> 2
beijing:47853:47905 [0] NCCL INFO misc/socket.cc:618 -> 2
beijing:47853:47905 [0] NCCL INFO bootstrap.cc:270 -> 2
beijing:47853:47912 [7] NCCL INFO misc/socket.cc:618 -> 2
beijing:47853:47912 [7] NCCL INFO bootstrap.cc:270 -> 2
beijing:47853:47912 [7] NCCL INFO init.cc:1350 -> 2
beijing:47853:47905 [0] NCCL INFO init.cc:1350 -> 2
beijing:47853:47912 [7] NCCL INFO group.cc:65 -> 2 [Async thread]
beijing:47853:47905 [0] NCCL INFO group.cc:65 -> 2 [Async thread]
beijing:47853:47906 [1] NCCL INFO bootstrap.cc:270 -> 3
beijing:47853:47906 [1] NCCL INFO init.cc:1350 -> 3
beijing:47853:47906 [1] NCCL INFO group.cc:65 -> 3 [Async thread]
beijing:47853:47909 [4] NCCL INFO bootstrap.cc:270 -> 3
beijing:47853:47909 [4] NCCL INFO init.cc:1350 -> 3
beijing:47853:47907 [2] NCCL INFO bootstrap.cc:270 -> 3
beijing:47853:47907 [2] NCCL INFO init.cc:1350 -> 3
beijing:47853:47909 [4] NCCL INFO group.cc:65 -> 3 [Async thread]
beijing:47853:47907 [2] NCCL INFO group.cc:65 -> 3 [Async thread]
beijing:47853:47908 [3] NCCL INFO bootstrap.cc:270 -> 3
beijing:47853:47908 [3] NCCL INFO init.cc:1350 -> 3
beijing:47853:47908 [3] NCCL INFO group.cc:65 -> 3 [Async thread]
beijing:47853:47911 [6] NCCL INFO bootstrap.cc:270 -> 3
beijing:47853:47911 [6] NCCL INFO init.cc:1350 -> 3
beijing:47853:47911 [6] NCCL INFO group.cc:65 -> 3 [Async thread]
beijing:47853:47910 [5] NCCL INFO bootstrap.cc:270 -> 3
beijing:47853:47910 [5] NCCL INFO init.cc:1350 -> 3
beijing:47853:47910 [5] NCCL INFO group.cc:65 -> 3 [Async thread]
beijing:47853:47853 [7] NCCL INFO group.cc:406 -> 2
beijing:47853:47853 [7] NCCL INFO group.cc:96 -> 2
beijing:47853:47853 [7] NCCL INFO init.cc:1691 -> 2
beijing: Test NCCL failure common.cu:951 'unhandled system error (run with NCCL_DEBUG=INFO for details)'
 .. beijing pid 47853: Test failure common.cu:842

I noticed that the output at NCCL INFO NET/Socket is not quite the same as others. How can I fix the communication problem here?

Thanks!

sjeaugey commented 1 year ago

Is that running on WSL2? The "Software caused connection abort" error message is a windows error IIRC.

Did you check that your NICs (eno1) could talk to each other on any port like 50301 (in the log).

hacker-jerry commented 1 year ago

Is that running on WSL2? The "Software caused connection abort" error message is a windows error IIRC.

Did you check that your NICs (eno1) could talk to each other on any port like 50301 (in the log).

Sorry, I forgot to mention that my platform is ubuntu 16.04.

hacker-jerry commented 1 year ago

BTW, my machine is a single node, why are there so many extra ports here in NCCL INFO NET/Socket?

Using [0]eno1:avahi:xxx.xxx.xxx.xxx<0> [1]eno2:xxx.xxx.xxx.xxx<0> [2]br-9df229593f2c:172.21.0.1<0> [3]br-a4438331b2aa:172.18.0.1<0> [4]br-a45ba72c1a15:172.24.0.1<0> [5]br-dbd02fdc9737:172.20.0.1<0> [6]br-0c73c6652639:192.168.1.1<0> [7]veth95ad967:fexxxxxad967<0> [8]vethbf5be72:fexxxxbf5be72<0> [9]vetha340738xxxxtha340738<0> [10]veth6aef8ab:xxxxx6aef8ab<0> [11]veth3fb926axxxx%veth3fb926a<0> [12]vethb4fdxxxxxxxaaf<0> [13]vethaf919a7xxxxvethaf919a7<0> [14]vethf341af8:fexxxxxf341af8<0>
sjeaugey commented 1 year ago

Sorry, I forgot to mention that my platform is ubuntu 16.04.

Ah, my bad then, I guess that error message is not OS-specific. Still, it means what it means: some software (typically a firewall) prevented the network communication.

my machine is a single node, why are there so many extra ports here in NCCL INFO NET/Socket?

Even single-node, NCCL uses sockets to communicate between ranks (we don't have a different implementation for single and multi-node). Just pick one that works intra-node, even lo if needed, and set it as NCCL_SOCKET_IFNAME.

hacker-jerry commented 1 year ago

@sjeaugey Thank you for your reply, how do I determine what is preventing the network from communicating? I am not familiar with firewalls, how do I determine if these devices eno can communicate with each other?

hacker-jerry commented 1 year ago

@sjeaugey Thank you for your reply, how do I determine what is preventing the network from communicating? I am not familiar with firewalls, how do I determine if these devices eno can communicate with each other?

Or can I set up a way to communicate between different ports on a single NIC for nccl testing by modifying environment variables?

sjeaugey commented 1 year ago

You can call nc -l 12345 on one node and nc <node1> 12345 on the other node.

You can also try to disable the firewall to see if it works better: sudo ufw disable seems to be the first answer when googling for "disable ubuntu16 firewall".

hacker-jerry commented 1 year ago

You can call nc -l 12345 on one node and nc <node1> 12345 on the other node.

You can also try to disable the firewall to see if it works better: sudo ufw disable seems to be the first answer when googling for "disable ubuntu16 firewall".

Thanks @sjeaugey , I tested it and only communication between different ports of 127.0.0.1 is allowed, and I don't have root access, so I can't change the rules for other network segments. May I ask if the node communication of nccl can make ip? For example using 127.0.0.1.

hacker-jerry commented 1 year ago

NET/Socket : Using

I got it. Just set NCCL_SOCKET_IFNAME=lo!