Closed hacker-jerry closed 1 year ago
Is that running on WSL2? The "Software caused connection abort" error message is a windows error IIRC.
Did you check that your NICs (eno1) could talk to each other on any port like 50301 (in the log).
Is that running on WSL2? The "Software caused connection abort" error message is a windows error IIRC.
Did you check that your NICs (eno1) could talk to each other on any port like 50301 (in the log).
Sorry, I forgot to mention that my platform is ubuntu 16.04.
BTW, my machine is a single node, why are there so many extra ports here in NCCL INFO NET/Socket?
Using [0]eno1:avahi:xxx.xxx.xxx.xxx<0> [1]eno2:xxx.xxx.xxx.xxx<0> [2]br-9df229593f2c:172.21.0.1<0> [3]br-a4438331b2aa:172.18.0.1<0> [4]br-a45ba72c1a15:172.24.0.1<0> [5]br-dbd02fdc9737:172.20.0.1<0> [6]br-0c73c6652639:192.168.1.1<0> [7]veth95ad967:fexxxxxad967<0> [8]vethbf5be72:fexxxxbf5be72<0> [9]vetha340738xxxxtha340738<0> [10]veth6aef8ab:xxxxx6aef8ab<0> [11]veth3fb926axxxx%veth3fb926a<0> [12]vethb4fdxxxxxxxaaf<0> [13]vethaf919a7xxxxvethaf919a7<0> [14]vethf341af8:fexxxxxf341af8<0>
Sorry, I forgot to mention that my platform is ubuntu 16.04.
Ah, my bad then, I guess that error message is not OS-specific. Still, it means what it means: some software (typically a firewall) prevented the network communication.
my machine is a single node, why are there so many extra ports here in NCCL INFO NET/Socket?
Even single-node, NCCL uses sockets to communicate between ranks (we don't have a different implementation for single and multi-node). Just pick one that works intra-node, even lo
if needed, and set it as NCCL_SOCKET_IFNAME
.
@sjeaugey Thank you for your reply, how do I determine what is preventing the network from communicating? I am not familiar with firewalls, how do I determine if these devices eno can communicate with each other?
@sjeaugey Thank you for your reply, how do I determine what is preventing the network from communicating? I am not familiar with firewalls, how do I determine if these devices eno can communicate with each other?
Or can I set up a way to communicate between different ports on a single NIC for nccl testing by modifying environment variables?
You can call nc -l 12345
on one node and nc <node1> 12345
on the other node.
You can also try to disable the firewall to see if it works better: sudo ufw disable
seems to be the first answer when googling for "disable ubuntu16 firewall".
You can call
nc -l 12345
on one node andnc <node1> 12345
on the other node.You can also try to disable the firewall to see if it works better:
sudo ufw disable
seems to be the first answer when googling for "disable ubuntu16 firewall".
Thanks @sjeaugey , I tested it and only communication between different ports of 127.0.0.1 is allowed, and I don't have root access, so I can't change the rules for other network segments. May I ask if the node communication of nccl can make ip? For example using 127.0.0.1.
NET/Socket : Using
I got it. Just set NCCL_SOCKET_IFNAME=lo!
Hi, I am getting an error while using nccl test.
The versions of them are: nccl 2.18.3 cuda 11.1
I noticed that the output at NCCL INFO NET/Socket is not quite the same as others. How can I fix the communication problem here?
Thanks!