Closed isaacgerg closed 2 months ago
You're running on a single GPU, so I'm not sure what you're trying to do, but it looks like the NIC enp3s0f0 is not letting NCCL communicate. If you don't intend to run on across multiple nodes, maybe try setting NCCL_SOCKET_IFNAME=lo
?
@sjeaugey My intent was to demonstrate that even on a single GPU, the test hangs. It also hangs when more than 1 GPU is specified.
I tried your suggestion and it still hangs. Here is the output:
:~/nccl-tests$ export NCCL_SOCKET_IFNAME=lo
:~/nccl-tests$ ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 70918 on NAME device 0 [0x01] NVIDIA RTX A6000
NAME :70918:70918 [0] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
NAME :70918:70918 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
NAME :70918:70918 [0] NCCL INFO NET/Plugin: Using internal network plugin.
NAME :70918:70918 [0] NCCL INFO cudaDriverVersion 12040
NCCL version 2.21.5+cuda12.4
NAME :70918:70928 [0] NCCL INFO NET/IB : No device found.
NAME :70918:70928 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
NAME :70918:70928 [0] NCCL INFO Using non-device net plugin version 0
NAME :70918:70928 [0] NCCL INFO Using network Socket
NAME :70918:70927 [0] misc/socket.cc:50 NCCL WARN socketProgress: Connection closed by remote peer localhost<56086>
NAME :70918:70927 [0] NCCL INFO misc/socket.cc:752 -> 6
NAME:70918:70927 [0] NCCL INFO misc/socket.cc:428 -> 6
NAME :70918:70927 [0] NCCL INFO misc/socket.cc:564 -> 6
NAME :70918:70927 [0] NCCL INFO misc/socket.cc:668 -> 6
NAME :70918:70927 [0] NCCL INFO bootstrap.cc:129 -> 6
Alright, understood.
Would you have a very aggressive firewall that would prevent any network communication even through lo
?
@sjeaugey We have no firewall on the machine. Is there more debug I could provide you to help diagnose?
Mysterious... Given that it's reproducible within a single process, I would run all_reduce_perf under strace -f
to get some insight what might be going on with these socket operations... Brace yourself for a lot of output...
@kiskra-nvidia You were not kidding. Is there something I should specific grep for? Here are some lines around the warning we receive regarding socketProgress.
pid 78857] <... accept resumed>0x7f0ab230c6e0, [28]) = -1 EAGAIN (Resource temporarily unavailable)
[pid 78851] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=1000}, <unfinished ...>
[pid 78856] lseek(67, 0, SEEK_CUR <unfinished ...>
[pid 78857] accept(66, <unfinished ...>
[pid 78856] <... lseek resumed>) = 0
[pid 78857] <... accept resumed>0x7f0ab230c6e0, [28]) = -1 EAGAIN (Resource temporarily unavailable)
[pid 78856] fstat(67, <unfinished ...>
[pid 78857] accept(66, <unfinished ...>
[pid 78856] <... fstat resumed>{st_mode=S_IFREG|0644, st_size=247, ...}) = 0
[pid 78857] <... accept resumed>0x7f0ab230c6e0, [28]) = -1 EAGAIN (Resource temporarily unavailable)
[pid 78851] <... clock_nanosleep resumed>NULL) = 0
[pid 78857] accept(66, <unfinished ...>
[pid 78856] read(67, <unfinished ...>
[pid 78851] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=1000}, <unfinished ...>
[pid 78857] <... accept resumed>0x7f0ab230c6e0, [28]) = -1 EAGAIN (Resource temporarily unavailable)
[pid 78856] <... read resumed>"127.0.0.1 localhost\n127.0.1.1 hi"..., 4096) = 247
[pid 78857] accept(66, 0x7f0ab230c6e0, [28]) = -1 EAGAIN (Resource temporarily unavailable)
[pid 78856] close(67 <unfinished ...>
[pid 78857] accept(66, <unfinished ...>
[pid 78856] <... close resumed>) = 0
[pid 78857] <... accept resumed>0x7f0ab230c6e0, [28]) = -1 EAGAIN (Resource temporarily unavailable)
[pid 78851] <... clock_nanosleep resumed>NULL) = 0
[pid 78857] accept(66, <unfinished ...>
[pid 78851] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=1000}, <unfinished ...>
[pid 78856] gettid( <unfinished ...>
[pid 78857] <... accept resumed>0x7f0ab230c6e0, [28]) = -1 EAGAIN (Resource temporarily unavailable)
[pid 78856] <... gettid resumed>) = 78856
[pid 78857] accept(66, 0x7f0ab230c6e0, [28]) = -1 EAGAIN (Resource temporarily unavailable)
[pid 78856] write(1, "\nhighgarden:78851:78856 [0] misc"..., 122 <unfinished ...>
highgarden:78851:78856 [0] misc/socket.cc:50 NCCL WARN socketProgress: Connection closed by remote peer localhost<48506>
[pid 78857] accept(66, <unfinished ...>
[pid 78851] <... clock_nanosleep resumed>NULL) = 0
[pid 78856] <... write resumed>) = 122
[pid 78857] <... accept resumed>0x7f0ab230c6e0, [28]) = -1 EAGAIN (Resource temporarily unavailable)
[pid 78851] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=1000}, <unfinished ...>
[pid 78857] accept(66, <unfinished ...>
[pid 78856] write(1, "highgarden:78851:78856 [0] NCCL "..., 61 <unfinished ...>
highgarden:78851:78856 [0] NCCL INFO misc/socket.cc:752 -> 6
[pid 78857] <... accept resumed>0x7f0ab230c6e0, [28]) = -1 EAGAIN (Resource temporarily unavailable)
[pid 78856] <... write resumed>) = 61
[pid 78857] accept(66, 0x7f0ab230c6e0, [28]) = -1 EAGAIN (Resource temporarily unavailable)
[pid 78856] write(1, "highgarden:78851:78856 [0] NCCL "..., 61 <unfinished ...>
highgarden:78851:78856 [0] NCCL INFO misc/socket.cc:428 -> 6
[pid 78851] <... clock_nanosleep resumed>NULL) = 0
[pid 78857] accept(66, <unfinished ...>
[pid 78851] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=1000}, <unfinished ...>
[pid 78856] <... write resumed>) = 61
[pid 78857] <... accept resumed>0x7f0ab230c6e0, [28]) = -1 EAGAIN (Resource temporarily unavailable)
[pid 78857] accept(66, <unfinished ...>
[pid 78856] write(1, "highgarden:78851:78856 [0] NCCL "..., 61 <unfinished ...>
highgarden:78851:78856 [0] NCCL INFO misc/socket.cc:564 -> 6
[pid 78857] <... accept resumed>0x7f0ab230c6e0, [28]) = -1 EAGAIN (Resource temporarily unavailable)
[pid 78856] <... write resumed>) = 61
[pid 78857] accept(66, <unfinished ...>
[pid 78851] <... clock_nanosleep resumed>NULL) = 0
[pid 78856] write(1, "highgarden:78851:78856 [0] NCCL "..., 61 <unfinished ...>
highgarden:78851:78856 [0] NCCL INFO misc/socket.cc:668 -> 6
[pid 78851] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=1000}, <unfinished ...>
[pid 78857] <... accept resumed>0x7f0ab230c6e0, [28]) = -1 EAGAIN (Resource temporarily unavailable)
[pid 78856] <... write resumed>) = 61
UPDATE 1: I also see that we don't have nccl.conf defined anywhere. Would putting something in here help:
[pid 78851] openat(AT_FDCWD, "/home/local/NAME/isaac.gerg/.nccl.conf", O_RDONLY) = -1 ENOENT (No such file or directory)
[pid 78851] openat(AT_FDCWD, "/etc/nccl.conf", O_RDONLY) = -1 ENOENT (No such file or directory)
The excerpt you attached are just after-effects (the thread is busy looking up the hostname to include in the debug output); the culprit must be earlier... You may want to attach the whole thing just to be sure (GitHub has a semi-hidden "Paste, drop, or click to add files" button underneath the comment box).
@kiskra-nvidia
The log was too large to upload so I truncated it near where it started to repeat.
The first 8% or so of that log would've been enough 😄.
Still, it's very weird. It looks like a message just disappears?! I'm wondering if your node has its TCP/IP stack configured in some unusual (broken?) way. Can you try compiling (gcc -pthread -o cs cs.c
) and running the attached source?
cs.c.txt
It should generate output such as this:
Server listening...
Client calling connect
Client calling send
Client calling shutdown
Client calling close
Client calling pthread_join...
Server calling accept...
Server calling recv...
Terminating the server
Terminating the client
but I'm curious if on your system it will trigger an assertion failure in line 39...
Here is the output. What do you think?
(base) isaac.gerg@highgarden:~/nccl-tests$ ./cs
Server listening...
Client calling connect
Client calling send
Client calling shutdown
Client calling close
Client calling pthread_join...
Server calling accept...
Server calling recv...
cs: cs.c:39: server_f: Assertion `recv(client, &id, 8, MSG_DONTWAIT) == 8' failed.
Aborted (core dumped)
As I suspected. For some reason, on your system, when a client sends a message to the server over a newly established TCP connection and then closes the connection before the server even accepts the connection, the message just disappears. That's rather unexpected and frankly it strikes me as broken, but at least it fails consistently so hopefully we can get to the bottom of this.
First, what distro is this? Are you running a distro kernel (and what version?) or something custom? Anything you can think of regarding any tweaks to the networking stack? In particular, any customizations to /etc/sysctl.conf
/ /etc/sysctl.d/*
or the kernel command line (cat /proc/cmdline
)? Let's double-check on that firewall config that @sjeaugey asked you about -- does [sudo] iptables-save
print any output?
Also, let's experiment with a few tweaks to that client-server code from https://github.com/NVIDIA/nccl-tests/issues/217#issuecomment-2125875092 to get a better feeling what triggers the problem:
shutdown
call in line 78. Any change?sleep(5);
from line 31 to line 36 (before the recv
call). Any change? If not, also add a sleep(1);
after the connect
call (line 72 in the original version) and try again.sleep(10);
to line 76 (before the shutdown
call). Any change?MSG_DONTWAIT
in the recv
call in line 39 with 0
(zero). Any change?(base) isaac.gerg@highgarden:~$ hostnamectl Static hostname: highgarden Icon name: computer-server Chassis: server Machine ID: 078b15e05dfd4619adc5dd6738b608e8 Boot ID: 7601aa74c4234ecea1bdbcf4ef3e1910 Operating System: Ubuntu 20.04.6 LTS Kernel: Linux 5.15.0-105-generic Architecture: x86-64
Anything seem out of the ordinary here?
Will work out experiments shortly....
We've got the very same kernel version on a couple of boxes though they seem to be running Ubuntu 22.04.4 LTS? On the 20.04.6 LTS boxes we're running various 5.4.0 kernel variants. Could it be that you are running a kernel version that's not meant for your distro version? Though how that would result in this strange network stack behavior, I have no idea...
Either way, I tried with the same kernel version on Ubuntu 22.04.4 LTS and could not reproduce it there...
localadmin@highgarden:~$ sudo iptables -L
Chain INPUT (policy ACCEPT)
target prot opt source destination
Chain FORWARD (policy ACCEPT)
target prot opt source destination
DOCKER-USER all -- anywhere anywhere
DOCKER-ISOLATION-STAGE-1 all -- anywhere anywhere
ACCEPT all -- anywhere anywhere ctstate RELATED,ESTABLISHED
DOCKER all -- anywhere anywhere
ACCEPT all -- anywhere anywhere
ACCEPT all -- anywhere anywhere
ACCEPT all -- anywhere anywhere ctstate RELATED,ESTABLISHED
DOCKER all -- anywhere anywhere
ACCEPT all -- anywhere anywhere
ACCEPT all -- anywhere anywhere
ACCEPT all -- anywhere anywhere ctstate RELATED,ESTABLISHED
DOCKER all -- anywhere anywhere
ACCEPT all -- anywhere anywhere
ACCEPT all -- anywhere anywhere
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
Chain DOCKER (3 references)
target prot opt source destination
Chain DOCKER-ISOLATION-STAGE-1 (1 references)
target prot opt source destination
DOCKER-ISOLATION-STAGE-2 all -- anywhere anywhere
DOCKER-ISOLATION-STAGE-2 all -- anywhere anywhere
DOCKER-ISOLATION-STAGE-2 all -- anywhere anywhere
RETURN all -- anywhere anywhere
Chain DOCKER-ISOLATION-STAGE-2 (3 references)
target prot opt source destination
DROP all -- anywhere anywhere
DROP all -- anywhere anywhere
DROP all -- anywhere anywhere
RETURN all -- anywhere anywhere
Chain DOCKER-USER (1 references)
target prot opt source destination
RETURN all -- anywhere anywhere
localadmin@highgarden:~$
We reset the ESET service (which is our antivirus) and now the C test you gave us works. So I am thinking this issue may be linked to our antivirus. Are there any other files you want us to examine?
Hey, if you figured out a way to make the C test work, no point in doing the other crazy stuff work -- just try again with NCCL's all_reduce_perf!
all_reduce_perf now works okay after ESET service restart.
Thanks for the confirmation @isaacgerg. Feel free to close the issue if the problem is solved. Thanks.
Quick update: Disabling ESET's Web Access Protection fixed this issue.
Quick update: Disabling ESET's Web Access Protection fixed this issue.
I think I encountered the same issue as you. Are you using the EEA version? Could you please briefly tell me how to disable ESET's Web Access Protection using commands?
I am unable to provide you this information.
Sorry to bother you. Thank you.
Downloaded, built, and installed NCCL and NCCL-tests today. When I attempt to run on my Linux machine which contains 4 GPUs, the test hangs, even when running with a single GPU. Below is an example output that I get after setting some debug flags in an attempt to diagnose.
I have provided the topology of the system below:
NVIDIA-SMI gives me the following below: