Open minghungchen opened 1 year ago
I saw the similar issue in 2.19.3 code when tested ALLReduce with dual-port cx-7 NICs by specifying the NCCL_ALGO = NVLSTree. It looks nccl doesn't support dual-port NICs well. Please refer to issue #1305.
NCCL should run fine on the dual port NICs platform. The issue is NCCL currently cannot fully utilize all ports for the best perf (we are fixing it). The hang looks weird. Can you provide me the backtrace of all threads in rank 0 and output of nvidia-smi topo -m
? @minghungchen
@KaimingOuyang You can download the full NCCL logs including the backtrace etc. from the issue.. Let me know if you are looking for something else. Here is the output of nvidia-smi topo -m
$ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 PIX NODE NODE NODE SYS SYS SYS SYS 0-47,96-143 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 NODE PIX NODE NODE SYS SYS SYS SYS 0-47,96-143 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 NODE NODE PIX NODE SYS SYS SYS SYS 0-47,96-143 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 NODE NODE NODE PIX SYS SYS SYS SYS 0-47,96-143 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS SYS SYS PIX NODE NODE NODE 48-95,144-191 1 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS SYS SYS NODE PIX NODE NODE 48-95,144-191 1 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS SYS NODE NODE PIX NODE 48-95,144-191 1 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS SYS NODE NODE NODE PIX 48-95,144-191 1 N/A
NIC0 PIX NODE NODE NODE SYS SYS SYS SYS X NODE NODE NODE SYS SYS SYS SYS
NIC1 NODE PIX NODE NODE SYS SYS SYS SYS NODE X NODE NODE SYS SYS SYS SYS
NIC2 NODE NODE PIX NODE SYS SYS SYS SYS NODE NODE X NODE SYS SYS SYS SYS
NIC3 NODE NODE NODE PIX SYS SYS SYS SYS NODE NODE NODE X SYS SYS SYS SYS
NIC4 SYS SYS SYS SYS PIX NODE NODE NODE SYS SYS SYS SYS X NODE NODE NODE
NIC5 SYS SYS SYS SYS NODE PIX NODE NODE SYS SYS SYS SYS NODE X NODE NODE
NIC6 SYS SYS SYS SYS NODE NODE PIX NODE SYS SYS SYS SYS NODE NODE X NODE
NIC7 SYS SYS SYS SYS NODE NODE NODE PIX SYS SYS SYS SYS NODE NODE NODE X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
NIC6: mlx5_6
NIC7: mlx5_7
I don't see the backtrace from 2.19.3 log. For both 2.18.5 and 2.19.3, could you please provide me backtrace from gdb?
Any reason you must use NIC 2,3,4,5,6,7? Can you try NICs 1,2,4,5,6,7?
I don't have the backtrace from gdb but if you can suggest what steps I should take to get it, I can give it a try.
This validation case requires to use both ports on the dual port NICs, so we do not use NIC 1 (or 2) without NIC 0 (or 3) in this test case.
OK. Actually, I want to make sure you are using 2.18.5 or 2.18.6. I see the log shows 2.18.6. For 2.18.6, could you please provide me the libnccl.so binary? For 2.19.3, when the process hangs, you can get process pid from top
, and gdb -p [pid]
to attach the process. Then print the backtrace by thread apply all bt
.
This validation case requires to use both ports on the dual port NICs, so we do not use NIC 1 (or 2) without NIC 0 (or 3) in this test case.
If so, can you use GPU 2,3,4,5,6,7?
I tried the same cmd with GPU 2,3,4,5,6,7 by updating the parameter CUDA_VISIBLE_DEVICES=2,3,4,5,6,7. It did not help and all_reduce_perf still got stuck with NCCL 2.19.3.
The libnccl.so.2.18.6 binary is around 290MB and github does not allow to upload such a big file. https://github.com/NVIDIA/nccl/issues/1043#issue-1966021425 has the git hash for each version I used. The 2.18.6 I built is with this branch https://github.com/NVIDIA/nccl/tree/4365458757e4107ecbf629b2fd6e0e19a5d237c2 Let me know if you need the binary file, and I will find somewhere else to upload it.
After all_reduce_perf with NCCL 2.19.3 got stuck for some time, it appears it put the system in a weird state. Now the same cmd fails at a different point and here is the new debug log. nccl-allreduce-6gpu-3nic-6lnk-2.19.3-after-long-stuck.log
I will update gdb backtrace when available.
Here is the gdb backtrace from PID of rank 0
$ sudo gdb -p 6042
GNU gdb (Ubuntu 12.1-0ubuntu1~22.04) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 6042
[New LWP 6058]
[New LWP 6060]
[New LWP 6085]
[New LWP 6096]
[New LWP 6108]
[New LWP 6109]
[New LWP 6111]
[New LWP 6115]
[New LWP 6120]
[New LWP 6122]
[New LWP 6351]
[New LWP 6357]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f59364f8c9b in sched_yield () at ../sysdeps/unix/syscall-template.S:120
120 ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) thread apply all bt
Thread 13 (Thread 0x7f54ccbfe000 (LWP 6357) "all_reduce_perf"):
#0 0x00007f59364f8c9b in sched_yield () at ../sysdeps/unix/syscall-template.S:120
#1 0x00007f593698919d in ncclProxyProgress (proxyState_=<optimized out>) at proxy.cc:889
#2 0x00007f5936484ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#3 0x00007f5936516a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
Thread 12 (Thread 0x7f592d161000 (LWP 6351) "all_reduce_perf"):
#0 0x00007f5936508dbf in __GI___poll (fds=fds@entry=0x7f592d158ac0, nfds=nfds@entry=65, timeout=500) at ../sysdeps/unix/sysv/linux/poll.c:29
#1 0x00007f593698c1fc in poll (__timeout=<optimized out>, __nfds=65, __fds=0x7f592d158ac0) at /usr/include/x86_64-linux-gnu/bits/poll2.h:39
#2 ncclProxyService (_args=0x7f54d06c7230) at proxy.cc:1475
#3 0x00007f5936484ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#4 0x00007f5936516a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
Thread 11 (Thread 0x7f54ce7fc000 (LWP 6122) "all_reduce_perf"):
#0 __GI___libc_read (nbytes=16, buf=0x7f54ce7f5880, fd=78) at ../sysdeps/unix/sysv/linux/read.c:26
#1 __GI___libc_read (fd=78, buf=buf@entry=0x7f54ce7f5880, nbytes=nbytes@entry=16) at ../sysdeps/unix/sysv/linux/read.c:24
#2 0x00007f593402f4c4 in read (__nbytes=16, __buf=0x7f54ce7f5880, __fd=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/unistd.h:38
#3 __ibv_get_async_event_1_1 (context=0x7f54d0672720, event=0x7f54ce7f58e0) at ./libibverbs/device.c:459
#4 0x00007f59369b7676 in wrap_ibv_get_async_event (context=context@entry=0x7f54d0672720, event=event@entry=0x7f54ce7f58e0) at misc/ibvwrap.cc:121
#5 0x00007f59369cc18b in ncclIbAsyncThreadMain (args=0x7f54d0672720) at transport/net_ib.cc:91
#6 0x00007f5936484ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#7 0x00007f5936516a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
Thread 10 (Thread 0x7f54ceffd000 (LWP 6120) "all_reduce_perf"):
#0 __GI___libc_read (nbytes=16, buf=0x7f54ceff6880, fd=76) at ../sysdeps/unix/sysv/linux/read.c:26
#1 __GI___libc_read (fd=76, buf=buf@entry=0x7f54ceff6880, nbytes=nbytes@entry=16) at ../sysdeps/unix/sysv/linux/read.c:24
#2 0x00007f593402f4c4 in read (__nbytes=16, __buf=0x7f54ceff6880, __fd=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/unistd.h:38
#3 __ibv_get_async_event_1_1 (context=0x7f54d0630ca0, event=0x7f54ceff68e0) at ./libibverbs/device.c:459
#4 0x00007f59369b7676 in wrap_ibv_get_async_event (context=context@entry=0x7f54d0630ca0, event=event@entry=0x7f54ceff68e0) at misc/ibvwrap.cc:121
#5 0x00007f59369cc18b in ncclIbAsyncThreadMain (args=0x7f54d0630ca0) at transport/net_ib.cc:91
#6 0x00007f5936484ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#7 0x00007f5936516a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
Thread 9 (Thread 0x7f54cf7fe000 (LWP 6115) "all_reduce_perf"):
#0 __GI___libc_read (nbytes=16, buf=0x7f54cf7f7880, fd=74) at ../sysdeps/unix/sysv/linux/read.c:26
#1 __GI___libc_read (fd=74, buf=buf@entry=0x7f54cf7f7880, nbytes=nbytes@entry=16) at ../sysdeps/unix/sysv/linux/read.c:24
#2 0x00007f593402f4c4 in read (__nbytes=16, __buf=0x7f54cf7f7880, __fd=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/unistd.h:38
#3 __ibv_get_async_event_1_1 (context=0x7f54d05ef610, event=0x7f54cf7f78e0) at ./libibverbs/device.c:459
#4 0x00007f59369b7676 in wrap_ibv_get_async_event (context=context@entry=0x7f54d05ef610, event=event@entry=0x7f54cf7f78e0) at misc/ibvwrap.cc:121
#5 0x00007f59369cc18b in ncclIbAsyncThreadMain (args=0x7f54d05ef610) at transport/net_ib.cc:91
#6 0x00007f5936484ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#7 0x00007f5936516a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
Thread 8 (Thread 0x7f54cffff000 (LWP 6111) "all_reduce_perf"):
#0 __GI___libc_read (nbytes=16, buf=0x7f54cfff8880, fd=72) at ../sysdeps/unix/sysv/linux/read.c:26
#1 __GI___libc_read (fd=72, buf=buf@entry=0x7f54cfff8880, nbytes=nbytes@entry=16) at ../sysdeps/unix/sysv/linux/read.c:24
--Type <RET> for more, q to quit, c to continue without paging--c
#2 0x00007f593402f4c4 in read (__nbytes=16, __buf=0x7f54cfff8880, __fd=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/unistd.h:38
#3 __ibv_get_async_event_1_1 (context=0x7f54d05adf80, event=0x7f54cfff88e0) at ./libibverbs/device.c:459
#4 0x00007f59369b7676 in wrap_ibv_get_async_event (context=context@entry=0x7f54d05adf80, event=event@entry=0x7f54cfff88e0) at misc/ibvwrap.cc:121
#5 0x00007f59369cc18b in ncclIbAsyncThreadMain (args=0x7f54d05adf80) at transport/net_ib.cc:91
#6 0x00007f5936484ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#7 0x00007f5936516a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
Thread 7 (Thread 0x7f5706b5c000 (LWP 6109) "all_reduce_perf"):
#0 __GI___libc_read (nbytes=16, buf=0x7f5706b55880, fd=70) at ../sysdeps/unix/sysv/linux/read.c:26
#1 __GI___libc_read (fd=70, buf=buf@entry=0x7f5706b55880, nbytes=nbytes@entry=16) at ../sysdeps/unix/sysv/linux/read.c:24
#2 0x00007f593402f4c4 in read (__nbytes=16, __buf=0x7f5706b55880, __fd=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/unistd.h:38
#3 __ibv_get_async_event_1_1 (context=0x7f54d056c8f0, event=0x7f5706b558e0) at ./libibverbs/device.c:459
#4 0x00007f59369b7676 in wrap_ibv_get_async_event (context=context@entry=0x7f54d056c8f0, event=event@entry=0x7f5706b558e0) at misc/ibvwrap.cc:121
#5 0x00007f59369cc18b in ncclIbAsyncThreadMain (args=0x7f54d056c8f0) at transport/net_ib.cc:91
#6 0x00007f5936484ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#7 0x00007f5936516a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
Thread 6 (Thread 0x7f570735d000 (LWP 6108) "all_reduce_perf"):
#0 __GI___libc_read (nbytes=16, buf=0x7f5707356880, fd=68) at ../sysdeps/unix/sysv/linux/read.c:26
#1 __GI___libc_read (fd=68, buf=buf@entry=0x7f5707356880, nbytes=nbytes@entry=16) at ../sysdeps/unix/sysv/linux/read.c:24
#2 0x00007f593402f4c4 in read (__nbytes=16, __buf=0x7f5707356880, __fd=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/unistd.h:38
#3 __ibv_get_async_event_1_1 (context=0x7f54d052b260, event=0x7f57073568e0) at ./libibverbs/device.c:459
#4 0x00007f59369b7676 in wrap_ibv_get_async_event (context=context@entry=0x7f54d052b260, event=event@entry=0x7f57073568e0) at misc/ibvwrap.cc:121
#5 0x00007f59369cc18b in ncclIbAsyncThreadMain (args=0x7f54d052b260) at transport/net_ib.cc:91
#6 0x00007f5936484ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#7 0x00007f5936516a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
Thread 5 (Thread 0x7f592c960000 (LWP 6096) "cuda-EvtHandlr"):
#0 0x00007f5936508dbf in __GI___poll (fds=0x7f570c000c20, nfds=11, timeout=100) at ../sysdeps/unix/sysv/linux/poll.c:29
#1 0x00007f592e68fd09 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#2 0x00007f592e74bebb in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#3 0x00007f592e6891a8 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#4 0x00007f5936484ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#5 0x00007f5936516a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
Thread 4 (Thread 0x7f592d962000 (LWP 6085) "cuda-EvtHandlr"):
#0 0x00007f5936508dbf in __GI___poll (fds=0x5605fc6136e0, nfds=2, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1 0x00007f592e68fd09 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#2 0x00007f592e74bebb in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#3 0x00007f592e6891a8 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#4 0x00007f5936484ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#5 0x00007f5936516a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
Thread 3 (Thread 0x7f5935831000 (LWP 6060) "all_reduce_perf"):
#0 0x00007f593651601e in epoll_wait (epfd=10, events=events@entry=0x5605fc336a50, maxevents=32, timeout=timeout@entry=119852) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1 0x00007f593629f950 in epoll_dispatch (base=0x5605fc3367a0, tv=<optimized out>) at epoll.c:407
#2 0x00007f59362a29c5 in opal_libevent2022_event_base_loop (base=0x5605fc3367a0, flags=1) at event.c:1630
#3 0x00007f59359958c6 in progress_engine () from /net/storage149/mnt/md0/mhchen/openmpi/lib/openmpi/mca_pmix_pmix3x.so
#4 0x00007f5936484ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#5 0x00007f5936516a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
Thread 2 (Thread 0x7f593621d000 (LWP 6058) "all_reduce_perf"):
#0 0x00007f5936508dbf in __GI___poll (fds=fds@entry=0x7f5930000b70, nfds=nfds@entry=1, timeout=timeout@entry=3599997) at ../sysdeps/unix/sysv/linux/poll.c:29
#1 0x00007f59365266e2 in __poll_chk (fds=fds@entry=0x7f5930000b70, nfds=nfds@entry=1, timeout=timeout@entry=3599997, fdslen=fdslen@entry=18446744073709551615) at ./debug/poll_chk.c:27
#2 0x00007f59362aa8e9 in poll (__timeout=<optimized out>, __nfds=1, __fds=0x7f5930000b70) at /usr/include/x86_64-linux-gnu/bits/poll2.h:39
#3 poll_dispatch (base=0x5605fc3121a0, tv=<optimized out>) at poll.c:165
#4 0x00007f59362a29c5 in opal_libevent2022_event_base_loop (base=0x5605fc3121a0, flags=1) at event.c:1630
#5 0x00007f593625e636 in progress_engine () from /net/storage149/mnt/md0/mhchen/openmpi/lib/libopen-pal.so.40
#6 0x00007f5936484ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#7 0x00007f5936516a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
Thread 1 (Thread 0x7f59434f5000 (LWP 6042) "all_reduce_perf"):
#0 0x00007f59364f8c9b in sched_yield () at ../sysdeps/unix/syscall-template.S:120
#1 0x00005605fa2fa3ad in testStreamSynchronize (ngpus=<optimized out>, streams=0x7ffdc27dc330, comms=0x5605fd15d7a0) at /home/mhchen/nccl-tests/src/common.cu:323
#2 0x00005605fa2feff5 in completeColl (args=0x7ffdc27dc180) at /home/mhchen/nccl-tests/src/common.cu:401
#3 completeColl (args=0x7ffdc27dc180) at /home/mhchen/nccl-tests/src/common.cu:398
#4 TimeTest (args=args@entry=0x7ffdc27dc180, type=ncclFloat32, typeName=0x5605fa33632e "float", op=ncclSum, opName=0x5605fa336311 "sum", root=root@entry=-1) at /home/mhchen/nccl-tests/src/common.cu:588
#5 0x00005605fa2f8ba4 in AllReduceRunTest (args=0x7ffdc27dc180, root=<optimized out>, type=<optimized out>, typeName=<optimized out>, op=<optimized out>, opName=<optimized out>) at /home/mhchen/nccl-tests/src/all_reduce.cu:90
#6 0x00005605fa2f9360 in threadRunTests (args=0x7ffdc27dc180) at /home/mhchen/nccl-tests/src/common.cu:615
#7 0x00005605fa2fd048 in run () at /home/mhchen/nccl-tests/src/common.cu:1019
#8 0x00005605fa2f60d4 in main (argc=<optimized out>, argv=<optimized out>) at /home/mhchen/nccl-tests/src/common.cu:844
(gdb)
For css-host-182 and 183 node, can you check your fabric manager is operating properly? It seems nvswitch is not in good state
I saw the following fabicmanager log on one of the nodes. I am not sure how to reproduce it, though. The issue was gone after a reboot. The gdb backtrace above were collected after the reboot.
...
[Oct 31 2023 20:45:27] [INFO] [tid 526845] Successfully configured all the available NVSwitches to route GPU NVLink traffic. NVLink Peer-to-Peer support will be enabled once the GPUs are successfu
lly registered with the NVLink fabric.
[Oct 31 2023 20:59:37] [INFO] [tid 527113] Received an inband message: Message header details: magic Id:adbc request Id:37f37b2d0f6aa49b status:0 type:4 length:14
Message payload details:Team release request: Team Handle:ed8c03d28a369634 Flags:0
[Oct 31 2023 20:59:37] [ERROR] [tid 527118] failed to release multicast team with handle 17117060486525261364, cannot find the team
[Oct 31 2023 21:04:55] [INFO] [tid 527113] Received an inband message: Message header details: magic Id:adbc request Id:ba9c9cdd87cbe5f status:0 type:2 length:46
Message payload details:Team setup request: Allocation Size:60000000 Flags:0 Number of GPUs:6 GPU Handles:b12d2e84230d90a9 699cdb27be093689 8fe31ad077adb013 41692f9f9d6aa4c6 c52da87fccd138d6 5ee30
9238856a86b
[Oct 31 2023 21:04:55] [ERROR] [tid 527118] failed to find the GPU handle 12766911663723876521 in the multicast team request setup 840424691418840671.
[Oct 31 2023 21:04:55] [ERROR] [tid 527118] Handle: 0 Request ID: 840424691418840671 Request Memory: 1610612736 Group ID: 0 GPUs: 4713350847607252166 6837318707494430827 7610198434087777929
10368160249800405011 12766911663723876521 14208197666274359510
[Oct 31 2023 21:04:55] [INFO] [tid 526859] Sending inband response message: Message header details: magic Id:adbc request Id:ba9c9cdd87cbe5f status:57 type:3 length:24
Message payload details:Team setup response: Team Handle:0 Flags:0 Address Base:0 Address Size:0
[Oct 31 2023 21:06:01] [INFO] [tid 527113] Received an inband message: Message header details: magic Id:adbc request Id:3f5e2ee47c72b3e5 status:0 type:2 length:46
Message payload details:Team setup request: Allocation Size:60000000 Flags:0 Number of GPUs:6 GPU Handles:b12d2e84230d90a9 c52da87fccd138d6 699cdb27be093689 8fe31ad077adb013 5ee309238856a86b 41692
f9f9d6aa4c6
[Oct 31 2023 21:06:01] [ERROR] [tid 527118] failed to find the GPU handle 12766911663723876521 in the multicast team request setup 4566138631075574757.
[Oct 31 2023 21:06:01] [ERROR] [tid 527118] Handle: 0 Request ID: 4566138631075574757 Request Memory: 1610612736 Group ID: 0 GPUs: 4713350847607252166 6837318707494430827 7610198434087777929
10368160249800405011 12766911663723876521 14208197666274359510
[Oct 31 2023 21:06:01] [INFO] [tid 526859] Sending inband response message: Message header details: magic Id:adbc request Id:3f5e2ee47c72b3e5 status:57 type:3 length:24
Message payload details:Team setup response: Team Handle:0 Flags:0 Address Base:0 Address Size:0
[Oct 31 2023 21:07:47] [INFO] [tid 527113] Received an inband message: Message header details: magic Id:adbc request Id:b83f17aff8e8c562 status:0 type:2 length:46
Message payload details:Team setup request: Allocation Size:60000000 Flags:0 Number of GPUs:6 GPU Handles:bdaa8119d3ffbf7b b12d2e84230d90a9 699cdb27be093689 41692f9f9d6aa4c6 5ee309238856a86b c52da
87fccd138d6
[Oct 31 2023 21:07:47] [ERROR] [tid 527118] failed to find the GPU handle 13666877967140110203 in the multicast team request setup 13276356271074231650.
[Oct 31 2023 21:07:47] [ERROR] [tid 527118] Handle: 0 Request ID: 13276356271074231650 Request Memory: 1610612736 Group ID: 0 GPUs: 4713350847607252166 6837318707494430827 761019843408777792
9 12766911663723876521 13666877967140110203 14208197666274359510
[Oct 31 2023 21:07:47] [INFO] [tid 526859] Sending inband response message: Message header details: magic Id:adbc request Id:b83f17aff8e8c562 status:57 type:3 length:24
Message payload details:Team setup response: Team Handle:0 Flags:0 Address Base:0 Address Size:0
[Oct 31 2023 21:10:07] [INFO] [tid 527113] Received an inband message: Message header details: magic Id:adbc request Id:79743a616996a3b9 status:0 type:2 length:46
Message payload details:Team setup request: Allocation Size:60000000 Flags:0 Number of GPUs:6 GPU Handles:8fe31ad077adb013 b12d2e84230d90a9 41692f9f9d6aa4c6 5ee309238856a86b 699cdb27be093689 c52da
87fccd138d6
[Oct 31 2023 21:10:07] [ERROR] [tid 527118] failed to find the GPU handle 10368160249800405011 in the multicast team request setup 8751684165945435065.
[Oct 31 2023 21:10:07] [ERROR] [tid 527118] Handle: 0 Request ID: 8751684165945435065 Request Memory: 1610612736 Group ID: 0 GPUs: 4713350847607252166 6837318707494430827 7610198434087777929
10368160249800405011 12766911663723876521 14208197666274359510
[Oct 31 2023 21:07:47] [INFO] [tid 526859] Sending inband response message: Message header details: magic Id:adbc request Id:b83f17aff8e8c562 status:57 type:3 length:24
Message payload details:Team setup response: Team Handle:0 Flags:0 Address Base:0 Address Size:0
[Oct 31 2023 21:10:07] [INFO] [tid 527113] Received an inband message: Message header details: magic Id:adbc request Id:79743a616996a3b9 status:0 type:2 length:46
Message payload details:Team setup request: Allocation Size:60000000 Flags:0 Number of GPUs:6 GPU Handles:8fe31ad077adb013 b12d2e84230d90a9 41692f9f9d6aa4c6 5ee309238856a86b 699cdb27be093689 c52da
87fccd138d6
[Oct 31 2023 21:10:07] [ERROR] [tid 527118] failed to find the GPU handle 10368160249800405011 in the multicast team request setup 8751684165945435065.
[Oct 31 2023 21:10:07] [ERROR] [tid 527118] Handle: 0 Request ID: 8751684165945435065 Request Memory: 1610612736 Group ID: 0 GPUs: 4713350847607252166 6837318707494430827 7610198434087777929
10368160249800405011 12766911663723876521 14208197666274359510
[Oct 31 2023 21:10:07] [INFO] [tid 526859] Sending inband response message: Message header details: magic Id:adbc request Id:79743a616996a3b9 status:57 type:3 length:24
Message payload details:Team setup response: Team Handle:0 Flags:0 Address Base:0 Address Size:0
[Oct 31 2023 21:11:20] [INFO] [tid 527113] Received an inband message: Message header details: magic Id:adbc request Id:3b794938a9edb4d3 status:0 type:2 length:46
Message payload details:Team setup request: Allocation Size:60000000 Flags:0 Number of GPUs:6 GPU Handles:b12d2e84230d90a9 8fe31ad077adb013 41692f9f9d6aa4c6 699cdb27be093689 5ee309238856a86b c52da
87fccd138d6
[Oct 31 2023 21:11:20] [ERROR] [tid 527118] failed to find the GPU handle 12766911663723876521 in the multicast team request setup 4285537028137661651.
[Oct 31 2023 21:11:20] [ERROR] [tid 527118] Handle: 0 Request ID: 4285537028137661651 Request Memory: 1610612736 Group ID: 0 GPUs: 4713350847607252166 6837318707494430827 7610198434087777929
10368160249800405011 12766911663723876521 14208197666274359510
[Oct 31 2023 21:11:20] [INFO] [tid 526859] Sending inband response message: Message header details: magic Id:adbc request Id:3b794938a9edb4d3 status:57 type:3 length:24
Message payload details:Team setup response: Team Handle:0 Flags:0 Address Base:0 Address Size:0
[Oct 31 2023 21:12:11] [INFO] [tid 527113] Received an inband message: Message header details: magic Id:adbc request Id:d8db96bba13216be status:0 type:2 length:46
Message payload details:Team setup request: Allocation Size:60000000 Flags:0 Number of GPUs:6 GPU Handles:8fe31ad077adb013 b12d2e84230d90a9 c52da87fccd138d6 699cdb27be093689 41692f9f9d6aa4c6 5ee30
9238856a86b
[Oct 31 2023 21:12:11] [ERROR] [tid 527118] failed to find the GPU handle 10368160249800405011 in the multicast team request setup 15626249064699532990.
[Oct 31 2023 21:12:11] [ERROR] [tid 527118] Handle: 0 Request ID: 15626249064699532990 Request Memory: 1610612736 Group ID: 0 GPUs: 4713350847607252166 6837318707494430827 761019843408777792
9 10368160249800405011 12766911663723876521 14208197666274359510
[Oct 31 2023 21:12:11] [INFO] [tid 526859] Sending inband response message: Message header details: magic Id:adbc request Id:d8db96bba13216be status:57 type:3 length:24
Message payload details:Team setup response: Team Handle:0 Flags:0 Address Base:0 Address Size:0
Can you provide me the output of nvidia-smi -q | grep -A 4 Fabric
?
Let's rule out the reason one by one. Could you please run the following test:
Sure. nvidia-smi -q | grep -A 4 Fabric output is the same on both nodes.
$ nvidia-smi -q | grep -A 4 Fabric
Fabric
State : Completed
Status : Success
Processes : None
--
Fabric
State : Completed
Status : Success
Processes : None
--
Fabric
State : Completed
Status : Success
Processes : None
--
Fabric
State : Completed
Status : Success
Processes : None
--
Fabric
State : Completed
Status : Success
Processes : None
--
Fabric
State : Completed
Status : Success
Processes : None
--
Fabric
State : Completed
Status : Success
Processes : None
--
Fabric
State : Completed
Status : Success
Processes : None
For the test cases you mentioned, 1 runs fine, 2 runs fine, but 3 failed. I ran 2 and 3 with all GPUs and mlx5 interfaces. The symptoms of 3 is the same as what I reported in https://github.com/NVIDIA/nccl/issues/1043#issue-1966021425 when NCCL 2.19.3. It just got stuck.
I think this is the fabric issue. Could you please do the following command on both nodes
nvidia-smi -pm 0
nvidia-smi --gpu-reset
systemctl restart nvidia-fabricmanager
wait until all GPUs report a "Completed, status Success" fabric status
Then test the NCCL again.
Tried, but resetting GPU does not help.
@KaimingOuyang We also tried to power cycle the two nodes, but it also did not help. Please let me know if you need some other information.
I feel this could be related to the current NVLS/NVLSTree implementation in NCCL 2.19.3.
I am not sure that's the reason since I can run 2.19.3 without the problem on DGX H100.
Could you please provide me the output of ifconfig
of both nodes? BTW, are ens110f0np0,ens110f1np1,ens112f0np0,ens112f1np1,ens114f0np0,ens114f1np1
interfaces all working?
Can you also try to set NCCL_SOCKET_IFNAME=eth0
?
Sure. Please see below for the info you requested. I tried to set NCCL_SOCKET_IFNAME=eth0, but it did not help and all_reduce_perf got stuck at the same point.
Here is the ifconfig from 182
$ ifconfig
docker0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
inet 172.17.0.1 netmask 255.255.0.0 broadcast 172.17.255.255
ether 02:42:d8:2f:c1:de txqueuelen 0 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
ens108f0np0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
inet 10.10.1.182 netmask 255.255.255.0 broadcast 10.10.1.255
inet6 fe80::ba3f:d2ff:febe:f0da prefixlen 64 scopeid 0x20<link>
ether b8:3f:d2:be:f0:da txqueuelen 1000 (Ethernet)
RX packets 48335 bytes 2900100 (2.9 MB)
RX errors 0 dropped 7 overruns 0 frame 0
TX packets 41 bytes 2966 (2.9 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
ens108f1np1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
inet 10.10.2.182 netmask 255.255.255.0 broadcast 10.10.2.255
inet6 fe80::ba3f:d2ff:febe:f0db prefixlen 64 scopeid 0x20<link>
ether b8:3f:d2:be:f0:db txqueuelen 1000 (Ethernet)
RX packets 48335 bytes 2900100 (2.9 MB)
RX errors 0 dropped 7 overruns 0 frame 0
TX packets 41 bytes 2966 (2.9 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
ens110f0np0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
inet 10.10.3.182 netmask 255.255.255.0 broadcast 10.10.3.255
inet6 fe80::ba3f:d2ff:febe:fca2 prefixlen 64 scopeid 0x20<link>
ether b8:3f:d2:be:fc:a2 txqueuelen 1000 (Ethernet)
RX packets 48335 bytes 2900100 (2.9 MB)
RX errors 0 dropped 7 overruns 0 frame 0
TX packets 41 bytes 2966 (2.9 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
ens110f1np1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
inet 10.10.4.182 netmask 255.255.255.0 broadcast 10.10.4.255
inet6 fe80::ba3f:d2ff:febe:fca3 prefixlen 64 scopeid 0x20<link>
ether b8:3f:d2:be:fc:a3 txqueuelen 1000 (Ethernet)
RX packets 48335 bytes 2900100 (2.9 MB)
RX errors 0 dropped 7 overruns 0 frame 0
TX packets 42 bytes 3036 (3.0 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
ens112f0np0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
inet 10.10.5.182 netmask 255.255.255.0 broadcast 10.10.5.255
inet6 fe80::ba3f:d2ff:febe:f99e prefixlen 64 scopeid 0x20<link>
ether b8:3f:d2:be:f9:9e txqueuelen 1000 (Ethernet)
RX packets 48335 bytes 2900100 (2.9 MB)
RX errors 0 dropped 7 overruns 0 frame 0
TX packets 41 bytes 2966 (2.9 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
ens112f1np1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
inet 10.10.6.182 netmask 255.255.255.0 broadcast 10.10.6.255
inet6 fe80::ba3f:d2ff:febe:f99f prefixlen 64 scopeid 0x20<link>
ether b8:3f:d2:be:f9:9f txqueuelen 1000 (Ethernet)
RX packets 48335 bytes 2900100 (2.9 MB)
RX errors 0 dropped 7 overruns 0 frame 0
TX packets 41 bytes 2966 (2.9 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
ens114f0np0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
inet 10.10.7.182 netmask 255.255.255.0 broadcast 10.10.7.255
inet6 fe80::ba3f:d2ff:fedf:82b0 prefixlen 64 scopeid 0x20<link>
ether b8:3f:d2:df:82:b0 txqueuelen 1000 (Ethernet)
RX packets 48335 bytes 2900100 (2.9 MB)
RX errors 0 dropped 7 overruns 0 frame 0
TX packets 41 bytes 2966 (2.9 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
ens114f1np1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
inet 10.10.8.182 netmask 255.255.255.0 broadcast 10.10.8.255
inet6 fe80::ba3f:d2ff:fedf:82b1 prefixlen 64 scopeid 0x20<link>
ether b8:3f:d2:df:82:b1 txqueuelen 1000 (Ethernet)
RX packets 48335 bytes 2900100 (2.9 MB)
RX errors 0 dropped 7 overruns 0 frame 0
TX packets 41 bytes 2966 (2.9 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 9.2.131.182 netmask 255.255.254.0 broadcast 9.2.131.255
inet6 fe80::b696:91ff:fea9:3a50 prefixlen 64 scopeid 0x20<link>
ether b4:96:91:a9:3a:50 txqueuelen 1000 (Ethernet)
RX packets 2875646 bytes 2859114911 (2.8 GB)
RX errors 0 dropped 7 overruns 0 frame 0
TX packets 192736 bytes 20126466 (20.1 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 152748 bytes 23411480 (23.4 MB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 152748 bytes 23411480 (23.4 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
virbr0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
inet 192.168.122.1 netmask 255.255.255.0 broadcast 192.168.122.255
ether 52:54:00:f9:80:2c txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
ifconfig from css-host-183
$ ifconfig
cni0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450
inet 10.42.0.1 netmask 255.255.255.0 broadcast 10.42.0.255
inet6 fe80::401a:b5ff:fefd:e206 prefixlen 64 scopeid 0x20<link>
ether 42:1a:b5:fd:e2:06 txqueuelen 1000 (Ethernet)
RX packets 1151701 bytes 240597787 (240.5 MB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 1439582 bytes 169338127 (169.3 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
docker0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
inet 172.17.0.1 netmask 255.255.0.0 broadcast 172.17.255.255
ether 02:42:65:04:33:50 txqueuelen 0 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
ens108f0np0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
inet 10.10.1.183 netmask 255.255.255.0 broadcast 10.10.1.255
inet6 fe80::ba3f:d2ff:febe:fbda prefixlen 64 scopeid 0x20<link>
ether b8:3f:d2:be:fb:da txqueuelen 1000 (Ethernet)
RX packets 48305 bytes 2898300 (2.8 MB)
RX errors 0 dropped 7 overruns 0 frame 0
TX packets 41 bytes 2966 (2.9 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
ens108f1np1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
inet 10.10.2.183 netmask 255.255.255.0 broadcast 10.10.2.255
inet6 fe80::ba3f:d2ff:febe:fbdb prefixlen 64 scopeid 0x20<link>
ether b8:3f:d2:be:fb:db txqueuelen 1000 (Ethernet)
RX packets 48305 bytes 2898300 (2.8 MB)
RX errors 0 dropped 7 overruns 0 frame 0
TX packets 42 bytes 3036 (3.0 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
ens110f0np0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
inet 10.10.3.183 netmask 255.255.255.0 broadcast 10.10.3.255
inet6 fe80::ba3f:d2ff:febe:fcea prefixlen 64 scopeid 0x20<link>
ether b8:3f:d2:be:fc:ea txqueuelen 1000 (Ethernet)
RX packets 48305 bytes 2898300 (2.8 MB)
RX errors 0 dropped 7 overruns 0 frame 0
TX packets 41 bytes 2966 (2.9 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
ens110f1np1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
inet 10.10.4.183 netmask 255.255.255.0 broadcast 10.10.4.255
inet6 fe80::ba3f:d2ff:febe:fceb prefixlen 64 scopeid 0x20<link>
ether b8:3f:d2:be:fc:eb txqueuelen 1000 (Ethernet)
RX packets 48305 bytes 2898300 (2.8 MB)
RX errors 0 dropped 7 overruns 0 frame 0
TX packets 42 bytes 3036 (3.0 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
ens112f0np0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
inet 10.10.5.183 netmask 255.255.255.0 broadcast 10.10.5.255
inet6 fe80::ba3f:d2ff:fedf:8238 prefixlen 64 scopeid 0x20<link>
ether b8:3f:d2:df:82:38 txqueuelen 1000 (Ethernet)
RX packets 48305 bytes 2898300 (2.8 MB)
RX errors 0 dropped 7 overruns 0 frame 0
TX packets 41 bytes 2966 (2.9 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
ens112f1np1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
inet 10.10.6.183 netmask 255.255.255.0 broadcast 10.10.6.255
inet6 fe80::ba3f:d2ff:fedf:8239 prefixlen 64 scopeid 0x20<link>
ether b8:3f:d2:df:82:39 txqueuelen 1000 (Ethernet)
RX packets 48305 bytes 2898300 (2.8 MB)
RX errors 0 dropped 7 overruns 0 frame 0
TX packets 42 bytes 3036 (3.0 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
ens114f0np0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
inet 10.10.7.183 netmask 255.255.255.0 broadcast 10.10.7.255
inet6 fe80::ba3f:d2ff:febe:f002 prefixlen 64 scopeid 0x20<link>
ether b8:3f:d2:be:f0:02 txqueuelen 1000 (Ethernet)
RX packets 48305 bytes 2898300 (2.8 MB)
RX errors 0 dropped 7 overruns 0 frame 0
TX packets 41 bytes 2966 (2.9 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
ens114f1np1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 9000
inet 10.10.8.183 netmask 255.255.255.0 broadcast 10.10.8.255
inet6 fe80::ba3f:d2ff:febe:f003 prefixlen 64 scopeid 0x20<link>
ether b8:3f:d2:be:f0:03 txqueuelen 1000 (Ethernet)
RX packets 48305 bytes 2898300 (2.8 MB)
RX errors 0 dropped 7 overruns 0 frame 0
TX packets 41 bytes 2966 (2.9 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 9.2.131.183 netmask 255.255.254.0 broadcast 9.2.131.255
inet6 fe80::b696:91ff:fea9:39d4 prefixlen 64 scopeid 0x20<link>
ether b4:96:91:a9:39:d4 txqueuelen 1000 (Ethernet)
RX packets 97310440 bytes 144824884803 (144.8 GB)
RX errors 0 dropped 7 overruns 0 frame 0
TX packets 8682844 bytes 951133437 (951.1 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
flannel.1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450
inet 10.42.0.0 netmask 255.255.255.255 broadcast 0.0.0.0
inet6 fe80::b8ce:29ff:fed2:8001 prefixlen 64 scopeid 0x20<link>
ether ba:ce:29:d2:80:01 txqueuelen 0 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 5 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 1355611 bytes 521520189 (521.5 MB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 1355611 bytes 521520189 (521.5 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
veth0587b345: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450
inet6 fe80::64b5:65ff:fea1:41c8 prefixlen 64 scopeid 0x20<link>
ether 66:b5:65:a1:41:c8 txqueuelen 0 (Ethernet)
RX packets 691346 bytes 214424786 (214.4 MB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 910357 bytes 123082563 (123.0 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
veth62ba3cb6: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450
inet6 fe80::3486:faff:fe24:917a prefixlen 64 scopeid 0x20<link>
ether 36:86:fa:24:91:7a txqueuelen 0 (Ethernet)
RX packets 8781 bytes 639475 (639.4 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 8288 bytes 603922 (603.9 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
vethc9428062: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450
inet6 fe80::40f3:c3ff:feab:3786 prefixlen 64 scopeid 0x20<link>
ether 42:f3:c3:ab:37:86 txqueuelen 0 (Ethernet)
RX packets 350900 bytes 31410481 (31.4 MB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 399416 bytes 34662510 (34.6 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
vethc727a4d3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450
inet6 fe80::2cf6:3aff:fe9c:53b9 prefixlen 64 scopeid 0x20<link>
ether 2e:f6:3a:9c:53:b9 txqueuelen 0 (Ethernet)
RX packets 100759 bytes 10263785 (10.2 MB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 122518 bytes 11070450 (11.0 MB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
vetheee50ffd: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450
inet6 fe80::44e1:c2ff:fed3:b0d2 prefixlen 64 scopeid 0x20<link>
ether 46:e1:c2:d3:b0:d2 txqueuelen 0 (Ethernet)
RX packets 41 bytes 2922 (2.9 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 200 bytes 13872 (13.8 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
all interfaces work
$ for i in $(seq 1 1 8); do ping -c 1 10.10.$i.182; done
PING 10.10.1.182 (10.10.1.182) 56(84) bytes of data.
64 bytes from 10.10.1.182: icmp_seq=1 ttl=64 time=0.259 ms
--- 10.10.1.182 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.259/0.259/0.259/0.000 ms
PING 10.10.2.182 (10.10.2.182) 56(84) bytes of data.
64 bytes from 10.10.2.182: icmp_seq=1 ttl=64 time=0.253 ms
--- 10.10.2.182 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.253/0.253/0.253/0.000 ms
PING 10.10.3.182 (10.10.3.182) 56(84) bytes of data.
64 bytes from 10.10.3.182: icmp_seq=1 ttl=64 time=0.207 ms
--- 10.10.3.182 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.207/0.207/0.207/0.000 ms
PING 10.10.4.182 (10.10.4.182) 56(84) bytes of data.
64 bytes from 10.10.4.182: icmp_seq=1 ttl=64 time=0.251 ms
--- 10.10.4.182 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.251/0.251/0.251/0.000 ms
PING 10.10.5.182 (10.10.5.182) 56(84) bytes of data.
64 bytes from 10.10.5.182: icmp_seq=1 ttl=64 time=0.236 ms
--- 10.10.5.182 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.236/0.236/0.236/0.000 ms
PING 10.10.6.182 (10.10.6.182) 56(84) bytes of data.
64 bytes from 10.10.6.182: icmp_seq=1 ttl=64 time=0.223 ms
--- 10.10.6.182 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.223/0.223/0.223/0.000 ms
PING 10.10.7.182 (10.10.7.182) 56(84) bytes of data.
64 bytes from 10.10.7.182: icmp_seq=1 ttl=64 time=0.214 ms
--- 10.10.7.182 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.214/0.214/0.214/0.000 ms
PING 10.10.8.182 (10.10.8.182) 56(84) bytes of data.
64 bytes from 10.10.8.182: icmp_seq=1 ttl=64 time=0.205 ms
--- 10.10.8.182 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.205/0.205/0.205/0.000 ms
It is weird that nvidia-smi topo -m
output is different from what I get from NCCL log. I think I might find the reason why your machine causes the issue.
I attach a patch here. Can you apply it to 2.19.3 and see if it works?
Sorry, updated the patch a bit. Please check this. 0001-Fix-duplicate-NVLS-head-rank.patch
OK. Never mind, I found a more fundamental issue (so the patch I gave you might not work). I would like to discuss with my teammate and see whether we can solve it quickly. Please bear it with me for now.
Thanks for the update. I just tested the patch a bit. It appears to fix some NVLS/NVLSTree issues on 2.19.3 but the performance looks not right, and it does not help NVLS/NVLSTree on 2.18.6 although the patched region looks the same. I will wait for the new patch and feel free to let me know if you need anything else.
@minghungchen Can you apply the patch I provide here on the top of master and try whether it solves the issue? 0001-Support-NVLS-dual-port-NIC-transmission.patch
@KaimingOuyang Thanks for the new patch. I did some quick tests, and it appears the patch can resolve the issue observed.
Will this patch be backported to older NCCL releases?
Thanks. Do you want to stick to the older NCCL version? Usually, we don't backport the patch.
I think it should be ok because eventually we will migrate to new NCCL versions. Some colleagues are using PyTorch with affected NCCL versions, but it could be fine as long as they will not run the jobs on H100.
I hope the NGC PyTorch release can have the patched NCCL included soon. Thanks again for your help.
HW Environment:
SW Environment:
NCCL Parameters and test command:
Description:
When running two-node nccl-tests all_reduce_perf and 6 GPU and 3 NIC(6 Port) on each node are used:
Tuner: plugin load '(null)' returned error (11 : (null))
and all_reduce_perf got stuck.free(): invalid next size (fast)
and the job aborted.The issue is reproducible on HGX-H100 systems from two different vendors.
Debug logs with NCCL parameters and cmd info at beginning:
nccl-allreduce-6gpu-3nic-6lnk-2.18.5.log nccl-allreduce-6gpu-3nic-6lnk-2.19.3.log