all_reduce_perf hangs; using a single GPU on a 4GPU machine

isaacgerg commented 2 months ago

Downloaded, built, and installed NCCL and NCCL-tests today. When I attempt to run on my Linux machine which contains 4 GPUs, the test hangs, even when running with a single GPU. Below is an example output that I get after setting some debug flags in an attempt to diagnose.

:~/nccl-tests$ ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 va                                                                                                                                  lidation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  69438 on NAME device  0 [0x01] NVIDIA RTX A6000
NAME :69438:69438 [0] NCCL INFO Bootstrap : Using enp3s0f0:10.50.56.33<0>
NAME :69438:69438 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
NAME :69438:69438 [0] NCCL INFO NET/Plugin: Using internal network plugin.
NAME :69438:69438 [0] NCCL INFO cudaDriverVersion 12040
NCCL version 2.21.5+cuda12.4
NAME :69438:69444 [0] NCCL INFO NET/IB : No device found.
NAME :69438:69444 [0] NCCL INFO NET/Socket : Using [0]enp3s0f0:10.50.56.33<0> [1]br-50a6a39f12ff:172.1                                                                                                                                  8.0.1<0> [2]br-93ef2f1539c5:172.25.0.1<0>
NAME :69438:69444 [0] NCCL INFO Using non-device net plugin version 0
NAME :69438:69444 [0] NCCL INFO Using network Socket

NAME :69438:69443 [0] misc/socket.cc:50 NCCL WARN socketProgress: Connection closed by remote peer NAME<59364>
NAME :69438:69443 [0] NCCL INFO misc/socket.cc:752 -> 6
NAME :69438:69443 [0] NCCL INFO misc/socket.cc:428 -> 6
NAME :69438:69443 [0] NCCL INFO misc/socket.cc:564 -> 6
NAME :69438:69443 [0] NCCL INFO misc/socket.cc:668 -> 6
NAME :69438:69443 [0] NCCL INFO bootstrap.cc:129 -> 6

I have provided the topology of the system below:

:~$ nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      SYS     SYS     SYS     0-63    0               N/A
GPU1    SYS      X      SYS     SYS     0-63    0               N/A
GPU2    SYS     SYS      X      SYS     0-63    0               N/A
GPU3    SYS     SYS     SYS      X      0-63    0               N/A

NVIDIA-SMI gives me the following below:

:~/nccl-tests$ nvidia-smi
Wed May 22 11:59:25 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX A6000               Off |   00000000:01:00.0 Off |                  Off |
| 30%   31C    P8             26W /  300W |       3MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA RTX A6000               Off |   00000000:41:00.0 Off |                  Off |
| 30%   29C    P8             20W /  300W |       3MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA RTX A6000               Off |   00000000:81:00.0 Off |                  Off |
| 30%   29C    P8             22W /  300W |       3MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA RTX A6000               Off |   00000000:C1:00.0 Off |                  Off |
| 30%   28C    P8             16W /  300W |       3MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

sjeaugey commented 2 months ago

You're running on a single GPU, so I'm not sure what you're trying to do, but it looks like the NIC enp3s0f0 is not letting NCCL communicate. If you don't intend to run on across multiple nodes, maybe try setting NCCL_SOCKET_IFNAME=lo?

isaacgerg commented 2 months ago

@sjeaugey My intent was to demonstrate that even on a single GPU, the test hangs. It also hangs when more than 1 GPU is specified.

I tried your suggestion and it still hangs. Here is the output:

:~/nccl-tests$ export NCCL_SOCKET_IFNAME=lo
:~/nccl-tests$ ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1
# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  70918 on NAME device  0 [0x01] NVIDIA RTX A6000
NAME :70918:70918 [0] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
NAME :70918:70918 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
NAME :70918:70918 [0] NCCL INFO NET/Plugin: Using internal network plugin.
NAME :70918:70918 [0] NCCL INFO cudaDriverVersion 12040
NCCL version 2.21.5+cuda12.4
NAME :70918:70928 [0] NCCL INFO NET/IB : No device found.
NAME :70918:70928 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
NAME :70918:70928 [0] NCCL INFO Using non-device net plugin version 0
NAME :70918:70928 [0] NCCL INFO Using network Socket

NAME :70918:70927 [0] misc/socket.cc:50 NCCL WARN socketProgress: Connection closed by remote peer localhost<56086>
NAME :70918:70927 [0] NCCL INFO misc/socket.cc:752 -> 6
NAME:70918:70927 [0] NCCL INFO misc/socket.cc:428 -> 6
NAME :70918:70927 [0] NCCL INFO misc/socket.cc:564 -> 6
NAME :70918:70927 [0] NCCL INFO misc/socket.cc:668 -> 6
NAME :70918:70927 [0] NCCL INFO bootstrap.cc:129 -> 6

sjeaugey commented 2 months ago

Alright, understood.

Would you have a very aggressive firewall that would prevent any network communication even through lo?

isaacgerg commented 2 months ago

@sjeaugey We have no firewall on the machine. Is there more debug I could provide you to help diagnose?

kiskra-nvidia commented 2 months ago

Mysterious... Given that it's reproducible within a single process, I would run all_reduce_perf under strace -f to get some insight what might be going on with these socket operations... Brace yourself for a lot of output...

isaacgerg commented 2 months ago

@kiskra-nvidia You were not kidding. Is there something I should specific grep for? Here are some lines around the warning we receive regarding socketProgress.

pid 78857] <... accept resumed>0x7f0ab230c6e0, [28]) = -1 EAGAIN (Resource temporarily unavailable)
[pid 78851] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=1000},  <unfinished ...>
[pid 78856] lseek(67, 0, SEEK_CUR <unfinished ...>
[pid 78857] accept(66,  <unfinished ...>
[pid 78856] <... lseek resumed>)        = 0
[pid 78857] <... accept resumed>0x7f0ab230c6e0, [28]) = -1 EAGAIN (Resource temporarily unavailable)
[pid 78856] fstat(67,  <unfinished ...>
[pid 78857] accept(66,  <unfinished ...>
[pid 78856] <... fstat resumed>{st_mode=S_IFREG|0644, st_size=247, ...}) = 0
[pid 78857] <... accept resumed>0x7f0ab230c6e0, [28]) = -1 EAGAIN (Resource temporarily unavailable)
[pid 78851] <... clock_nanosleep resumed>NULL) = 0
[pid 78857] accept(66,  <unfinished ...>
[pid 78856] read(67,  <unfinished ...>
[pid 78851] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=1000},  <unfinished ...>
[pid 78857] <... accept resumed>0x7f0ab230c6e0, [28]) = -1 EAGAIN (Resource temporarily unavailable)
[pid 78856] <... read resumed>"127.0.0.1 localhost\n127.0.1.1 hi"..., 4096) = 247
[pid 78857] accept(66, 0x7f0ab230c6e0, [28]) = -1 EAGAIN (Resource temporarily unavailable)
[pid 78856] close(67 <unfinished ...>
[pid 78857] accept(66,  <unfinished ...>
[pid 78856] <... close resumed>)        = 0
[pid 78857] <... accept resumed>0x7f0ab230c6e0, [28]) = -1 EAGAIN (Resource temporarily unavailable)
[pid 78851] <... clock_nanosleep resumed>NULL) = 0
[pid 78857] accept(66,  <unfinished ...>
[pid 78851] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=1000},  <unfinished ...>
[pid 78856] gettid( <unfinished ...>
[pid 78857] <... accept resumed>0x7f0ab230c6e0, [28]) = -1 EAGAIN (Resource temporarily unavailable)
[pid 78856] <... gettid resumed>)       = 78856
[pid 78857] accept(66, 0x7f0ab230c6e0, [28]) = -1 EAGAIN (Resource temporarily unavailable)
[pid 78856] write(1, "\nhighgarden:78851:78856 [0] misc"..., 122 <unfinished ...>

highgarden:78851:78856 [0] misc/socket.cc:50 NCCL WARN socketProgress: Connection closed by remote peer localhost<48506>
[pid 78857] accept(66,  <unfinished ...>
[pid 78851] <... clock_nanosleep resumed>NULL) = 0
[pid 78856] <... write resumed>)        = 122
[pid 78857] <... accept resumed>0x7f0ab230c6e0, [28]) = -1 EAGAIN (Resource temporarily unavailable)
[pid 78851] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=1000},  <unfinished ...>
[pid 78857] accept(66,  <unfinished ...>
[pid 78856] write(1, "highgarden:78851:78856 [0] NCCL "..., 61 <unfinished ...>
highgarden:78851:78856 [0] NCCL INFO misc/socket.cc:752 -> 6
[pid 78857] <... accept resumed>0x7f0ab230c6e0, [28]) = -1 EAGAIN (Resource temporarily unavailable)
[pid 78856] <... write resumed>)        = 61
[pid 78857] accept(66, 0x7f0ab230c6e0, [28]) = -1 EAGAIN (Resource temporarily unavailable)
[pid 78856] write(1, "highgarden:78851:78856 [0] NCCL "..., 61 <unfinished ...>
highgarden:78851:78856 [0] NCCL INFO misc/socket.cc:428 -> 6
[pid 78851] <... clock_nanosleep resumed>NULL) = 0
[pid 78857] accept(66,  <unfinished ...>
[pid 78851] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=1000},  <unfinished ...>
[pid 78856] <... write resumed>)        = 61
[pid 78857] <... accept resumed>0x7f0ab230c6e0, [28]) = -1 EAGAIN (Resource temporarily unavailable)
[pid 78857] accept(66,  <unfinished ...>
[pid 78856] write(1, "highgarden:78851:78856 [0] NCCL "..., 61 <unfinished ...>
highgarden:78851:78856 [0] NCCL INFO misc/socket.cc:564 -> 6
[pid 78857] <... accept resumed>0x7f0ab230c6e0, [28]) = -1 EAGAIN (Resource temporarily unavailable)
[pid 78856] <... write resumed>)        = 61
[pid 78857] accept(66,  <unfinished ...>
[pid 78851] <... clock_nanosleep resumed>NULL) = 0
[pid 78856] write(1, "highgarden:78851:78856 [0] NCCL "..., 61 <unfinished ...>
highgarden:78851:78856 [0] NCCL INFO misc/socket.cc:668 -> 6
[pid 78851] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=1000},  <unfinished ...>
[pid 78857] <... accept resumed>0x7f0ab230c6e0, [28]) = -1 EAGAIN (Resource temporarily unavailable)
[pid 78856] <... write resumed>)        = 61

UPDATE 1: I also see that we don't have nccl.conf defined anywhere. Would putting something in here help:

[pid 78851] openat(AT_FDCWD, "/home/local/NAME/isaac.gerg/.nccl.conf", O_RDONLY) = -1 ENOENT (No such file or directory)
[pid 78851] openat(AT_FDCWD, "/etc/nccl.conf", O_RDONLY) = -1 ENOENT (No such file or directory)

kiskra-nvidia commented 2 months ago

The excerpt you attached are just after-effects (the thread is busy looking up the hostname to include in the debug output); the culprit must be earlier... You may want to attach the whole thing just to be sure (GitHub has a semi-hidden "Paste, drop, or click to add files" button underneath the comment box).

isaacgerg commented 2 months ago

@kiskra-nvidia

The log was too large to upload so I truncated it near where it started to repeat.

log_snip.txt

kiskra-nvidia commented 2 months ago

The first 8% or so of that log would've been enough 😄.

Still, it's very weird. It looks like a message just disappears?! I'm wondering if your node has its TCP/IP stack configured in some unusual (broken?) way. Can you try compiling (gcc -pthread -o cs cs.c) and running the attached source? cs.c.txt It should generate output such as this:

Server listening...
Client calling connect
Client calling send
Client calling shutdown
Client calling close
Client calling pthread_join...
Server calling accept...
Server calling recv...
Terminating the server
Terminating the client

but I'm curious if on your system it will trigger an assertion failure in line 39...

isaacgerg commented 2 months ago

Here is the output. What do you think?

(base) isaac.gerg@highgarden:~/nccl-tests$ ./cs
Server listening...
Client calling connect
Client calling send
Client calling shutdown
Client calling close
Client calling pthread_join...
Server calling accept...
Server calling recv...
cs: cs.c:39: server_f: Assertion `recv(client, &id, 8, MSG_DONTWAIT) == 8' failed.
Aborted (core dumped)

kiskra-nvidia commented 2 months ago

As I suspected. For some reason, on your system, when a client sends a message to the server over a newly established TCP connection and then closes the connection before the server even accepts the connection, the message just disappears. That's rather unexpected and frankly it strikes me as broken, but at least it fails consistently so hopefully we can get to the bottom of this.

First, what distro is this? Are you running a distro kernel (and what version?) or something custom? Anything you can think of regarding any tweaks to the networking stack? In particular, any customizations to /etc/sysctl.conf / /etc/sysctl.d/* or the kernel command line (cat /proc/cmdline)? Let's double-check on that firewall config that @sjeaugey asked you about -- does [sudo] iptables-save print any output?

Also, let's experiment with a few tweaks to that client-server code from https://github.com/NVIDIA/nccl-tests/issues/217#issuecomment-2125875092 to get a better feeling what triggers the problem:

Comment out the shutdown call in line 78. Any change?
Go back to the original version and move the sleep(5); from line 31 to line 36 (before the recv call). Any change? If not, also add a sleep(1); after the connect call (line 72 in the original version) and try again.
Go back to the original version and add a sleep(10); to line 76 (before the shutdown call). Any change?
Go back to the original version and replace MSG_DONTWAIT in the recv call in line 39 with 0 (zero). Any change?

isaacgerg commented 2 months ago

(base) isaac.gerg@highgarden:~$ hostnamectl Static hostname: highgarden Icon name: computer-server Chassis: server Machine ID: 078b15e05dfd4619adc5dd6738b608e8 Boot ID: 7601aa74c4234ecea1bdbcf4ef3e1910 Operating System: Ubuntu 20.04.6 LTS Kernel: Linux 5.15.0-105-generic Architecture: x86-64

Anything seem out of the ordinary here?

Will work out experiments shortly....

kiskra-nvidia commented 2 months ago

We've got the very same kernel version on a couple of boxes though they seem to be running Ubuntu 22.04.4 LTS? On the 20.04.6 LTS boxes we're running various 5.4.0 kernel variants. Could it be that you are running a kernel version that's not meant for your distro version? Though how that would result in this strange network stack behavior, I have no idea...

Either way, I tried with the same kernel version on Ubuntu 22.04.4 LTS and could not reproduce it there...

isaacgerg commented 2 months ago

localadmin@highgarden:~$ sudo iptables -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination         

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination         
DOCKER-USER  all  --  anywhere             anywhere            
DOCKER-ISOLATION-STAGE-1  all  --  anywhere             anywhere            
ACCEPT     all  --  anywhere             anywhere             ctstate RELATED,ESTABLISHED
DOCKER     all  --  anywhere             anywhere            
ACCEPT     all  --  anywhere             anywhere            
ACCEPT     all  --  anywhere             anywhere            
ACCEPT     all  --  anywhere             anywhere             ctstate RELATED,ESTABLISHED
DOCKER     all  --  anywhere             anywhere            
ACCEPT     all  --  anywhere             anywhere            
ACCEPT     all  --  anywhere             anywhere            
ACCEPT     all  --  anywhere             anywhere             ctstate RELATED,ESTABLISHED
DOCKER     all  --  anywhere             anywhere            
ACCEPT     all  --  anywhere             anywhere            
ACCEPT     all  --  anywhere             anywhere            

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination         

Chain DOCKER (3 references)
target     prot opt source               destination         

Chain DOCKER-ISOLATION-STAGE-1 (1 references)
target     prot opt source               destination         
DOCKER-ISOLATION-STAGE-2  all  --  anywhere             anywhere            
DOCKER-ISOLATION-STAGE-2  all  --  anywhere             anywhere            
DOCKER-ISOLATION-STAGE-2  all  --  anywhere             anywhere            
RETURN     all  --  anywhere             anywhere            

Chain DOCKER-ISOLATION-STAGE-2 (3 references)
target     prot opt source               destination         
DROP       all  --  anywhere             anywhere            
DROP       all  --  anywhere             anywhere            
DROP       all  --  anywhere             anywhere            
RETURN     all  --  anywhere             anywhere            

Chain DOCKER-USER (1 references)
target     prot opt source               destination         
RETURN     all  --  anywhere             anywhere            
localadmin@highgarden:~$

We reset the ESET service (which is our antivirus) and now the C test you gave us works. So I am thinking this issue may be linked to our antivirus. Are there any other files you want us to examine?

kiskra-nvidia commented 2 months ago

Hey, if you figured out a way to make the C test work, no point in doing the other crazy stuff work -- just try again with NCCL's all_reduce_perf!

isaacgerg commented 2 months ago

all_reduce_perf now works okay after ESET service restart.

sjeaugey commented 2 months ago

Thanks for the confirmation @isaacgerg. Feel free to close the issue if the problem is solved. Thanks.

isaacgerg commented 2 months ago

Quick update: Disabling ESET's Web Access Protection fixed this issue.

HERO-YDX commented 4 days ago

Quick update: Disabling ESET's Web Access Protection fixed this issue.

I think I encountered the same issue as you. Are you using the EEA version? Could you please briefly tell me how to disable ESET's Web Access Protection using commands?

isaacgerg commented 4 days ago

I am unable to provide you this information.

HERO-YDX commented 4 days ago

Sorry to bother you. Thank you.

NVIDIA / nccl-tests

all_reduce_perf hangs; using a single GPU on a 4GPU machine #217