nccl-tests get stuck and free(): invalid next size (fast) error with 2.19.3 and 2.18.5, but no error with 2.16.5

minghungchen commented 1 year ago

HW Environment:

2x HGX-H100 systems
4x Mellanox CX6-DX Dual-port NICs on each HGX-H100
Flat network topology with one Mellanox SN2700

SW Environment:

Ubuntu 22.04
Driver Version: 535.104.05
CUDA Version: 12.2
NCCL: 2.16.5(v2.16.5-1), 2.18.5(4365458), 2.19.3(0e35f5d)

NCCL Parameters and test command:

CUDA_VISIBLE_DEVICES=1,2,4,5,6,7 
NCCL_IGNORE_CPU_AFFINITY=1
NCCL_IB_DISABLE=0
NCCL_DEBUG=INFO
NCCL_DEBUG_SUBSYS=INIT,GRAPH,ENV,TUNING
NCCL_SOCKET_IFNAME=ens110f0np0,ens110f1np1,ens112f0np0,ens112f1np1,ens114f0np0,ens114f1np1
NCCL_IB_HCA=mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7

mpirun -np 12 -host 10.2.131.182:6,10.2.131.183:6 -x CUDA_VISIBLE_DEVICES=1,2,4,5,6,7 -x NCCL_IGNORE_CPU_AFFINITY=1 -x NCCL_IB_DISABLE=0 -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=INIT,GRAPH,ENV,TUNING -x NCCL_SOCKET_IFNAME=ens110f0np0,ens110f1np1,ens112f0np0,ens112f1np1,ens114f0np0,ens114f1np1 -x NCCL_IB_HCA=mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7 --mca btl tcp,self --mca btl_tcp_if_include eth0 all_reduce_perf -c 0 -b 2M -e 4G -f 2 -g 1

Description:

When running two-node nccl-tests all_reduce_perf and 6 GPU and 3 NIC(6 Port) on each node are used:

With NCCL 2.19.3, we observed errors such as Tuner: plugin load '(null)' returned error (11 : (null)) and all_reduce_perf got stuck.
With NCCL 2.18.5, we observed errors such as free(): invalid next size (fast) and the job aborted.
With NCCL 2.16.5, there is no errors running all_perf_reduce and the performance numbers are as excepted.

The issue is reproducible on HGX-H100 systems from two different vendors.

Debug logs with NCCL parameters and cmd info at beginning:

nccl-allreduce-6gpu-3nic-6lnk-2.18.5.log nccl-allreduce-6gpu-3nic-6lnk-2.19.3.log

yanminjia commented 1 year ago

I saw the similar issue in 2.19.3 code when tested ALLReduce with dual-port cx-7 NICs by specifying the NCCL_ALGO = NVLSTree. It looks nccl doesn't support dual-port NICs well. Please refer to issue #1305.

KaimingOuyang commented 1 year ago

NCCL should run fine on the dual port NICs platform. The issue is NCCL currently cannot fully utilize all ports for the best perf (we are fixing it). The hang looks weird. Can you provide me the backtrace of all threads in rank 0 and output of nvidia-smi topo -m? @minghungchen

minghungchen commented 1 year ago

@KaimingOuyang You can download the full NCCL logs including the backtrace etc. from the issue.. Let me know if you are looking for something else. Here is the output of nvidia-smi topo -m

$ nvidia-smi topo -m
    GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X  NV18    NV18    NV18    NV18    NV18    NV18    NV18    PIX NODE    NODE    NODE    SYS SYS SYS SYS 0-47,96-143 0       N/A
GPU1    NV18     X  NV18    NV18    NV18    NV18    NV18    NV18    NODE    PIX NODE    NODE    SYS SYS SYS SYS 0-47,96-143 0       N/A
GPU2    NV18    NV18     X  NV18    NV18    NV18    NV18    NV18    NODE    NODE    PIX NODE    SYS SYS SYS SYS 0-47,96-143 0       N/A
GPU3    NV18    NV18    NV18     X  NV18    NV18    NV18    NV18    NODE    NODE    NODE    PIX SYS SYS SYS SYS 0-47,96-143 0       N/A
GPU4    NV18    NV18    NV18    NV18     X  NV18    NV18    NV18    SYS SYS SYS SYS PIX NODE    NODE    NODE    48-95,144-191   1       N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X  NV18    NV18    SYS SYS SYS SYS NODE    PIX NODE    NODE    48-95,144-191   1       N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X  NV18    SYS SYS SYS SYS NODE    NODE    PIX NODE    48-95,144-191   1       N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X  SYS SYS SYS SYS NODE    NODE    NODE    PIX 48-95,144-191   1       N/A
NIC0    PIX NODE    NODE    NODE    SYS SYS SYS SYS  X  NODE    NODE    NODE    SYS SYS SYS SYS
NIC1    NODE    PIX NODE    NODE    SYS SYS SYS SYS NODE     X  NODE    NODE    SYS SYS SYS SYS
NIC2    NODE    NODE    PIX NODE    SYS SYS SYS SYS NODE    NODE     X  NODE    SYS SYS SYS SYS
NIC3    NODE    NODE    NODE    PIX SYS SYS SYS SYS NODE    NODE    NODE     X  SYS SYS SYS SYS
NIC4    SYS SYS SYS SYS PIX NODE    NODE    NODE    SYS SYS SYS SYS  X  NODE    NODE    NODE
NIC5    SYS SYS SYS SYS NODE    PIX NODE    NODE    SYS SYS SYS SYS NODE     X  NODE    NODE
NIC6    SYS SYS SYS SYS NODE    NODE    PIX NODE    SYS SYS SYS SYS NODE    NODE     X  NODE
NIC7    SYS SYS SYS SYS NODE    NODE    NODE    PIX SYS SYS SYS SYS NODE    NODE    NODE     X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7

KaimingOuyang commented 1 year ago

I don't see the backtrace from 2.19.3 log. For both 2.18.5 and 2.19.3, could you please provide me backtrace from gdb?

Any reason you must use NIC 2,3,4,5,6,7? Can you try NICs 1,2,4,5,6,7?

minghungchen commented 1 year ago

I don't have the backtrace from gdb but if you can suggest what steps I should take to get it, I can give it a try.

This validation case requires to use both ports on the dual port NICs, so we do not use NIC 1 (or 2) without NIC 0 (or 3) in this test case.

KaimingOuyang commented 1 year ago

OK. Actually, I want to make sure you are using 2.18.5 or 2.18.6. I see the log shows 2.18.6. For 2.18.6, could you please provide me the libnccl.so binary? For 2.19.3, when the process hangs, you can get process pid from top, and gdb -p [pid] to attach the process. Then print the backtrace by thread apply all bt.

This validation case requires to use both ports on the dual port NICs, so we do not use NIC 1 (or 2) without NIC 0 (or 3) in this test case.

If so, can you use GPU 2,3,4,5,6,7?

minghungchen commented 1 year ago

I tried the same cmd with GPU 2,3,4,5,6,7 by updating the parameter CUDA_VISIBLE_DEVICES=2,3,4,5,6,7. It did not help and all_reduce_perf still got stuck with NCCL 2.19.3.

The libnccl.so.2.18.6 binary is around 290MB and github does not allow to upload such a big file. https://github.com/NVIDIA/nccl/issues/1043#issue-1966021425 has the git hash for each version I used. The 2.18.6 I built is with this branch https://github.com/NVIDIA/nccl/tree/4365458757e4107ecbf629b2fd6e0e19a5d237c2 Let me know if you need the binary file, and I will find somewhere else to upload it.

After all_reduce_perf with NCCL 2.19.3 got stuck for some time, it appears it put the system in a weird state. Now the same cmd fails at a different point and here is the new debug log. nccl-allreduce-6gpu-3nic-6lnk-2.19.3-after-long-stuck.log

I will update gdb backtrace when available.

minghungchen commented 1 year ago

Here is the gdb backtrace from PID of rank 0

$ sudo gdb -p 6042
GNU gdb (Ubuntu 12.1-0ubuntu1~22.04) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 6042
[New LWP 6058]
[New LWP 6060]
[New LWP 6085]
[New LWP 6096]
[New LWP 6108]
[New LWP 6109]
[New LWP 6111]
[New LWP 6115]
[New LWP 6120]
[New LWP 6122]
[New LWP 6351]
[New LWP 6357]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f59364f8c9b in sched_yield () at ../sysdeps/unix/syscall-template.S:120
120 ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) thread apply all bt

Thread 13 (Thread 0x7f54ccbfe000 (LWP 6357) "all_reduce_perf"):
#0  0x00007f59364f8c9b in sched_yield () at ../sysdeps/unix/syscall-template.S:120
#1  0x00007f593698919d in ncclProxyProgress (proxyState_=<optimized out>) at proxy.cc:889
#2  0x00007f5936484ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#3  0x00007f5936516a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 12 (Thread 0x7f592d161000 (LWP 6351) "all_reduce_perf"):
#0  0x00007f5936508dbf in __GI___poll (fds=fds@entry=0x7f592d158ac0, nfds=nfds@entry=65, timeout=500) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007f593698c1fc in poll (__timeout=<optimized out>, __nfds=65, __fds=0x7f592d158ac0) at /usr/include/x86_64-linux-gnu/bits/poll2.h:39
#2  ncclProxyService (_args=0x7f54d06c7230) at proxy.cc:1475
#3  0x00007f5936484ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#4  0x00007f5936516a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 11 (Thread 0x7f54ce7fc000 (LWP 6122) "all_reduce_perf"):
#0  __GI___libc_read (nbytes=16, buf=0x7f54ce7f5880, fd=78) at ../sysdeps/unix/sysv/linux/read.c:26
#1  __GI___libc_read (fd=78, buf=buf@entry=0x7f54ce7f5880, nbytes=nbytes@entry=16) at ../sysdeps/unix/sysv/linux/read.c:24
#2  0x00007f593402f4c4 in read (__nbytes=16, __buf=0x7f54ce7f5880, __fd=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/unistd.h:38
#3  __ibv_get_async_event_1_1 (context=0x7f54d0672720, event=0x7f54ce7f58e0) at ./libibverbs/device.c:459
#4  0x00007f59369b7676 in wrap_ibv_get_async_event (context=context@entry=0x7f54d0672720, event=event@entry=0x7f54ce7f58e0) at misc/ibvwrap.cc:121
#5  0x00007f59369cc18b in ncclIbAsyncThreadMain (args=0x7f54d0672720) at transport/net_ib.cc:91
#6  0x00007f5936484ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#7  0x00007f5936516a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 10 (Thread 0x7f54ceffd000 (LWP 6120) "all_reduce_perf"):
#0  __GI___libc_read (nbytes=16, buf=0x7f54ceff6880, fd=76) at ../sysdeps/unix/sysv/linux/read.c:26
#1  __GI___libc_read (fd=76, buf=buf@entry=0x7f54ceff6880, nbytes=nbytes@entry=16) at ../sysdeps/unix/sysv/linux/read.c:24
#2  0x00007f593402f4c4 in read (__nbytes=16, __buf=0x7f54ceff6880, __fd=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/unistd.h:38
#3  __ibv_get_async_event_1_1 (context=0x7f54d0630ca0, event=0x7f54ceff68e0) at ./libibverbs/device.c:459
#4  0x00007f59369b7676 in wrap_ibv_get_async_event (context=context@entry=0x7f54d0630ca0, event=event@entry=0x7f54ceff68e0) at misc/ibvwrap.cc:121
#5  0x00007f59369cc18b in ncclIbAsyncThreadMain (args=0x7f54d0630ca0) at transport/net_ib.cc:91
#6  0x00007f5936484ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#7  0x00007f5936516a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 9 (Thread 0x7f54cf7fe000 (LWP 6115) "all_reduce_perf"):
#0  __GI___libc_read (nbytes=16, buf=0x7f54cf7f7880, fd=74) at ../sysdeps/unix/sysv/linux/read.c:26
#1  __GI___libc_read (fd=74, buf=buf@entry=0x7f54cf7f7880, nbytes=nbytes@entry=16) at ../sysdeps/unix/sysv/linux/read.c:24
#2  0x00007f593402f4c4 in read (__nbytes=16, __buf=0x7f54cf7f7880, __fd=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/unistd.h:38
#3  __ibv_get_async_event_1_1 (context=0x7f54d05ef610, event=0x7f54cf7f78e0) at ./libibverbs/device.c:459
#4  0x00007f59369b7676 in wrap_ibv_get_async_event (context=context@entry=0x7f54d05ef610, event=event@entry=0x7f54cf7f78e0) at misc/ibvwrap.cc:121
#5  0x00007f59369cc18b in ncclIbAsyncThreadMain (args=0x7f54d05ef610) at transport/net_ib.cc:91
#6  0x00007f5936484ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#7  0x00007f5936516a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 8 (Thread 0x7f54cffff000 (LWP 6111) "all_reduce_perf"):
#0  __GI___libc_read (nbytes=16, buf=0x7f54cfff8880, fd=72) at ../sysdeps/unix/sysv/linux/read.c:26
#1  __GI___libc_read (fd=72, buf=buf@entry=0x7f54cfff8880, nbytes=nbytes@entry=16) at ../sysdeps/unix/sysv/linux/read.c:24
--Type <RET> for more, q to quit, c to continue without paging--c
#2  0x00007f593402f4c4 in read (__nbytes=16, __buf=0x7f54cfff8880, __fd=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/unistd.h:38
#3  __ibv_get_async_event_1_1 (context=0x7f54d05adf80, event=0x7f54cfff88e0) at ./libibverbs/device.c:459
#4  0x00007f59369b7676 in wrap_ibv_get_async_event (context=context@entry=0x7f54d05adf80, event=event@entry=0x7f54cfff88e0) at misc/ibvwrap.cc:121
#5  0x00007f59369cc18b in ncclIbAsyncThreadMain (args=0x7f54d05adf80) at transport/net_ib.cc:91
#6  0x00007f5936484ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#7  0x00007f5936516a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 7 (Thread 0x7f5706b5c000 (LWP 6109) "all_reduce_perf"):
#0  __GI___libc_read (nbytes=16, buf=0x7f5706b55880, fd=70) at ../sysdeps/unix/sysv/linux/read.c:26
#1  __GI___libc_read (fd=70, buf=buf@entry=0x7f5706b55880, nbytes=nbytes@entry=16) at ../sysdeps/unix/sysv/linux/read.c:24
#2  0x00007f593402f4c4 in read (__nbytes=16, __buf=0x7f5706b55880, __fd=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/unistd.h:38
#3  __ibv_get_async_event_1_1 (context=0x7f54d056c8f0, event=0x7f5706b558e0) at ./libibverbs/device.c:459
#4  0x00007f59369b7676 in wrap_ibv_get_async_event (context=context@entry=0x7f54d056c8f0, event=event@entry=0x7f5706b558e0) at misc/ibvwrap.cc:121
#5  0x00007f59369cc18b in ncclIbAsyncThreadMain (args=0x7f54d056c8f0) at transport/net_ib.cc:91
#6  0x00007f5936484ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#7  0x00007f5936516a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 6 (Thread 0x7f570735d000 (LWP 6108) "all_reduce_perf"):
#0  __GI___libc_read (nbytes=16, buf=0x7f5707356880, fd=68) at ../sysdeps/unix/sysv/linux/read.c:26
#1  __GI___libc_read (fd=68, buf=buf@entry=0x7f5707356880, nbytes=nbytes@entry=16) at ../sysdeps/unix/sysv/linux/read.c:24
#2  0x00007f593402f4c4 in read (__nbytes=16, __buf=0x7f5707356880, __fd=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/unistd.h:38
#3  __ibv_get_async_event_1_1 (context=0x7f54d052b260, event=0x7f57073568e0) at ./libibverbs/device.c:459
#4  0x00007f59369b7676 in wrap_ibv_get_async_event (context=context@entry=0x7f54d052b260, event=event@entry=0x7f57073568e0) at misc/ibvwrap.cc:121
#5  0x00007f59369cc18b in ncclIbAsyncThreadMain (args=0x7f54d052b260) at transport/net_ib.cc:91
#6  0x00007f5936484ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#7  0x00007f5936516a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 5 (Thread 0x7f592c960000 (LWP 6096) "cuda-EvtHandlr"):
#0  0x00007f5936508dbf in __GI___poll (fds=0x7f570c000c20, nfds=11, timeout=100) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007f592e68fd09 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#2  0x00007f592e74bebb in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007f592e6891a8 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007f5936484ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#5  0x00007f5936516a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 4 (Thread 0x7f592d962000 (LWP 6085) "cuda-EvtHandlr"):
#0  0x00007f5936508dbf in __GI___poll (fds=0x5605fc6136e0, nfds=2, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007f592e68fd09 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#2  0x00007f592e74bebb in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007f592e6891a8 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#4  0x00007f5936484ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#5  0x00007f5936516a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 3 (Thread 0x7f5935831000 (LWP 6060) "all_reduce_perf"):
#0  0x00007f593651601e in epoll_wait (epfd=10, events=events@entry=0x5605fc336a50, maxevents=32, timeout=timeout@entry=119852) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00007f593629f950 in epoll_dispatch (base=0x5605fc3367a0, tv=<optimized out>) at epoll.c:407
#2  0x00007f59362a29c5 in opal_libevent2022_event_base_loop (base=0x5605fc3367a0, flags=1) at event.c:1630
#3  0x00007f59359958c6 in progress_engine () from /net/storage149/mnt/md0/mhchen/openmpi/lib/openmpi/mca_pmix_pmix3x.so
#4  0x00007f5936484ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#5  0x00007f5936516a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 2 (Thread 0x7f593621d000 (LWP 6058) "all_reduce_perf"):
#0  0x00007f5936508dbf in __GI___poll (fds=fds@entry=0x7f5930000b70, nfds=nfds@entry=1, timeout=timeout@entry=3599997) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007f59365266e2 in __poll_chk (fds=fds@entry=0x7f5930000b70, nfds=nfds@entry=1, timeout=timeout@entry=3599997, fdslen=fdslen@entry=18446744073709551615) at ./debug/poll_chk.c:27
#2  0x00007f59362aa8e9 in poll (__timeout=<optimized out>, __nfds=1, __fds=0x7f5930000b70) at /usr/include/x86_64-linux-gnu/bits/poll2.h:39
#3  poll_dispatch (base=0x5605fc3121a0, tv=<optimized out>) at poll.c:165
#4  0x00007f59362a29c5 in opal_libevent2022_event_base_loop (base=0x5605fc3121a0, flags=1) at event.c:1630
#5  0x00007f593625e636 in progress_engine () from /net/storage149/mnt/md0/mhchen/openmpi/lib/libopen-pal.so.40
#6  0x00007f5936484ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#7  0x00007f5936516a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 1 (Thread 0x7f59434f5000 (LWP 6042) "all_reduce_perf"):
#0  0x00007f59364f8c9b in sched_yield () at ../sysdeps/unix/syscall-template.S:120
#1  0x00005605fa2fa3ad in testStreamSynchronize (ngpus=<optimized out>, streams=0x7ffdc27dc330, comms=0x5605fd15d7a0) at /home/mhchen/nccl-tests/src/common.cu:323
#2  0x00005605fa2feff5 in completeColl (args=0x7ffdc27dc180) at /home/mhchen/nccl-tests/src/common.cu:401
#3  completeColl (args=0x7ffdc27dc180) at /home/mhchen/nccl-tests/src/common.cu:398
#4  TimeTest (args=args@entry=0x7ffdc27dc180, type=ncclFloat32, typeName=0x5605fa33632e "float", op=ncclSum, opName=0x5605fa336311 "sum", root=root@entry=-1) at /home/mhchen/nccl-tests/src/common.cu:588
#5  0x00005605fa2f8ba4 in AllReduceRunTest (args=0x7ffdc27dc180, root=<optimized out>, type=<optimized out>, typeName=<optimized out>, op=<optimized out>, opName=<optimized out>) at /home/mhchen/nccl-tests/src/all_reduce.cu:90
#6  0x00005605fa2f9360 in threadRunTests (args=0x7ffdc27dc180) at /home/mhchen/nccl-tests/src/common.cu:615
#7  0x00005605fa2fd048 in run () at /home/mhchen/nccl-tests/src/common.cu:1019
#8  0x00005605fa2f60d4 in main (argc=<optimized out>, argv=<optimized out>) at /home/mhchen/nccl-tests/src/common.cu:844
(gdb)

KaimingOuyang commented 1 year ago

For css-host-182 and 183 node, can you check your fabric manager is operating properly? It seems nvswitch is not in good state

minghungchen commented 1 year ago

I saw the following fabicmanager log on one of the nodes. I am not sure how to reproduce it, though. The issue was gone after a reboot. The gdb backtrace above were collected after the reboot.

...
[Oct 31 2023 20:45:27] [INFO] [tid 526845] Successfully configured all the available NVSwitches to route GPU NVLink traffic. NVLink Peer-to-Peer support will be enabled once the GPUs are successfu
lly registered with the NVLink fabric.
[Oct 31 2023 20:59:37] [INFO] [tid 527113] Received an inband message:  Message header details: magic Id:adbc request Id:37f37b2d0f6aa49b status:0 type:4 length:14
Message payload details:Team release request: Team Handle:ed8c03d28a369634 Flags:0

[Oct 31 2023 20:59:37] [ERROR] [tid 527118] failed to release multicast team with handle 17117060486525261364, cannot find the team
[Oct 31 2023 21:04:55] [INFO] [tid 527113] Received an inband message:  Message header details: magic Id:adbc request Id:ba9c9cdd87cbe5f status:0 type:2 length:46
Message payload details:Team setup request: Allocation Size:60000000 Flags:0 Number of GPUs:6 GPU Handles:b12d2e84230d90a9 699cdb27be093689 8fe31ad077adb013 41692f9f9d6aa4c6 c52da87fccd138d6 5ee30
9238856a86b

[Oct 31 2023 21:04:55] [ERROR] [tid 527118] failed to find the GPU handle 12766911663723876521 in the multicast team request setup 840424691418840671.
[Oct 31 2023 21:04:55] [ERROR] [tid 527118]   Handle: 0  Request ID: 840424691418840671  Request Memory: 1610612736  Group ID: 0  GPUs: 4713350847607252166 6837318707494430827 7610198434087777929
10368160249800405011 12766911663723876521 14208197666274359510

[Oct 31 2023 21:04:55] [INFO] [tid 526859] Sending inband response message:  Message header details: magic Id:adbc request Id:ba9c9cdd87cbe5f status:57 type:3 length:24
Message payload details:Team setup response: Team Handle:0 Flags:0 Address Base:0 Address Size:0

[Oct 31 2023 21:06:01] [INFO] [tid 527113] Received an inband message:  Message header details: magic Id:adbc request Id:3f5e2ee47c72b3e5 status:0 type:2 length:46
Message payload details:Team setup request: Allocation Size:60000000 Flags:0 Number of GPUs:6 GPU Handles:b12d2e84230d90a9 c52da87fccd138d6 699cdb27be093689 8fe31ad077adb013 5ee309238856a86b 41692
f9f9d6aa4c6

[Oct 31 2023 21:06:01] [ERROR] [tid 527118] failed to find the GPU handle 12766911663723876521 in the multicast team request setup 4566138631075574757.
[Oct 31 2023 21:06:01] [ERROR] [tid 527118]   Handle: 0  Request ID: 4566138631075574757  Request Memory: 1610612736  Group ID: 0  GPUs: 4713350847607252166 6837318707494430827 7610198434087777929
 10368160249800405011 12766911663723876521 14208197666274359510

[Oct 31 2023 21:06:01] [INFO] [tid 526859] Sending inband response message:  Message header details: magic Id:adbc request Id:3f5e2ee47c72b3e5 status:57 type:3 length:24
Message payload details:Team setup response: Team Handle:0 Flags:0 Address Base:0 Address Size:0

[Oct 31 2023 21:07:47] [INFO] [tid 527113] Received an inband message:  Message header details: magic Id:adbc request Id:b83f17aff8e8c562 status:0 type:2 length:46
Message payload details:Team setup request: Allocation Size:60000000 Flags:0 Number of GPUs:6 GPU Handles:bdaa8119d3ffbf7b b12d2e84230d90a9 699cdb27be093689 41692f9f9d6aa4c6 5ee309238856a86b c52da
87fccd138d6

[Oct 31 2023 21:07:47] [ERROR] [tid 527118] failed to find the GPU handle 13666877967140110203 in the multicast team request setup 13276356271074231650.
[Oct 31 2023 21:07:47] [ERROR] [tid 527118]   Handle: 0  Request ID: 13276356271074231650  Request Memory: 1610612736  Group ID: 0  GPUs: 4713350847607252166 6837318707494430827 761019843408777792
9 12766911663723876521 13666877967140110203 14208197666274359510

[Oct 31 2023 21:07:47] [INFO] [tid 526859] Sending inband response message:  Message header details: magic Id:adbc request Id:b83f17aff8e8c562 status:57 type:3 length:24
Message payload details:Team setup response: Team Handle:0 Flags:0 Address Base:0 Address Size:0

[Oct 31 2023 21:10:07] [INFO] [tid 527113] Received an inband message:  Message header details: magic Id:adbc request Id:79743a616996a3b9 status:0 type:2 length:46
Message payload details:Team setup request: Allocation Size:60000000 Flags:0 Number of GPUs:6 GPU Handles:8fe31ad077adb013 b12d2e84230d90a9 41692f9f9d6aa4c6 5ee309238856a86b 699cdb27be093689 c52da
87fccd138d6

[Oct 31 2023 21:10:07] [ERROR] [tid 527118] failed to find the GPU handle 10368160249800405011 in the multicast team request setup 8751684165945435065.
[Oct 31 2023 21:10:07] [ERROR] [tid 527118]   Handle: 0  Request ID: 8751684165945435065  Request Memory: 1610612736  Group ID: 0  GPUs: 4713350847607252166 6837318707494430827 7610198434087777929
 10368160249800405011 12766911663723876521 14208197666274359510

 [Oct 31 2023 21:07:47] [INFO] [tid 526859] Sending inband response message:  Message header details: magic Id:adbc request Id:b83f17aff8e8c562 status:57 type:3 length:24
Message payload details:Team setup response: Team Handle:0 Flags:0 Address Base:0 Address Size:0

[Oct 31 2023 21:10:07] [INFO] [tid 527113] Received an inband message:  Message header details: magic Id:adbc request Id:79743a616996a3b9 status:0 type:2 length:46
Message payload details:Team setup request: Allocation Size:60000000 Flags:0 Number of GPUs:6 GPU Handles:8fe31ad077adb013 b12d2e84230d90a9 41692f9f9d6aa4c6 5ee309238856a86b 699cdb27be093689 c52da
87fccd138d6

[Oct 31 2023 21:10:07] [ERROR] [tid 527118] failed to find the GPU handle 10368160249800405011 in the multicast team request setup 8751684165945435065.
[Oct 31 2023 21:10:07] [ERROR] [tid 527118]   Handle: 0  Request ID: 8751684165945435065  Request Memory: 1610612736  Group ID: 0  GPUs: 4713350847607252166 6837318707494430827 7610198434087777929
 10368160249800405011 12766911663723876521 14208197666274359510

[Oct 31 2023 21:10:07] [INFO] [tid 526859] Sending inband response message:  Message header details: magic Id:adbc request Id:79743a616996a3b9 status:57 type:3 length:24
Message payload details:Team setup response: Team Handle:0 Flags:0 Address Base:0 Address Size:0

[Oct 31 2023 21:11:20] [INFO] [tid 527113] Received an inband message:  Message header details: magic Id:adbc request Id:3b794938a9edb4d3 status:0 type:2 length:46
Message payload details:Team setup request: Allocation Size:60000000 Flags:0 Number of GPUs:6 GPU Handles:b12d2e84230d90a9 8fe31ad077adb013 41692f9f9d6aa4c6 699cdb27be093689 5ee309238856a86b c52da
87fccd138d6

[Oct 31 2023 21:11:20] [ERROR] [tid 527118] failed to find the GPU handle 12766911663723876521 in the multicast team request setup 4285537028137661651.
[Oct 31 2023 21:11:20] [ERROR] [tid 527118]   Handle: 0  Request ID: 4285537028137661651  Request Memory: 1610612736  Group ID: 0  GPUs: 4713350847607252166 6837318707494430827 7610198434087777929
 10368160249800405011 12766911663723876521 14208197666274359510

[Oct 31 2023 21:11:20] [INFO] [tid 526859] Sending inband response message:  Message header details: magic Id:adbc request Id:3b794938a9edb4d3 status:57 type:3 length:24
Message payload details:Team setup response: Team Handle:0 Flags:0 Address Base:0 Address Size:0

[Oct 31 2023 21:12:11] [INFO] [tid 527113] Received an inband message:  Message header details: magic Id:adbc request Id:d8db96bba13216be status:0 type:2 length:46
Message payload details:Team setup request: Allocation Size:60000000 Flags:0 Number of GPUs:6 GPU Handles:8fe31ad077adb013 b12d2e84230d90a9 c52da87fccd138d6 699cdb27be093689 41692f9f9d6aa4c6 5ee30
9238856a86b

[Oct 31 2023 21:12:11] [ERROR] [tid 527118] failed to find the GPU handle 10368160249800405011 in the multicast team request setup 15626249064699532990.
[Oct 31 2023 21:12:11] [ERROR] [tid 527118]   Handle: 0  Request ID: 15626249064699532990  Request Memory: 1610612736  Group ID: 0  GPUs: 4713350847607252166 6837318707494430827 761019843408777792
9 10368160249800405011 12766911663723876521 14208197666274359510

[Oct 31 2023 21:12:11] [INFO] [tid 526859] Sending inband response message:  Message header details: magic Id:adbc request Id:d8db96bba13216be status:57 type:3 length:24
Message payload details:Team setup response: Team Handle:0 Flags:0 Address Base:0 Address Size:0

KaimingOuyang commented 1 year ago

Can you provide me the output of nvidia-smi -q | grep -A 4 Fabric?

Let's rule out the reason one by one. Could you please run the following test:

Single node with 8 gpus
Two nodes with 16 gpus with NCCL_NVLS_ENABLE=0
Two nodes with 16 gpus with NCCL_NVLS_ENABLE=1

minghungchen commented 1 year ago

Sure. nvidia-smi -q | grep -A 4 Fabric output is the same on both nodes.

$ nvidia-smi -q | grep -A 4 Fabric
    Fabric
        State                             : Completed
        Status                            : Success
    Processes                             : None

--
    Fabric
        State                             : Completed
        Status                            : Success
    Processes                             : None

--
    Fabric
        State                             : Completed
        Status                            : Success
    Processes                             : None

--
    Fabric
        State                             : Completed
        Status                            : Success
    Processes                             : None

--
    Fabric
        State                             : Completed
        Status                            : Success
    Processes                             : None

--
    Fabric
        State                             : Completed
        Status                            : Success
    Processes                             : None

--
    Fabric
        State                             : Completed
        Status                            : Success
    Processes                             : None

--
    Fabric
        State                             : Completed
        Status                            : Success
    Processes                             : None

For the test cases you mentioned, 1 runs fine, 2 runs fine, but 3 failed. I ran 2 and 3 with all GPUs and mlx5 interfaces. The symptoms of 3 is the same as what I reported in https://github.com/NVIDIA/nccl/issues/1043#issue-1966021425 when NCCL 2.19.3. It just got stuck.

KaimingOuyang commented 1 year ago

I think this is the fabric issue. Could you please do the following command on both nodes

nvidia-smi -pm 0
nvidia-smi --gpu-reset
systemctl restart nvidia-fabricmanager
wait until all GPUs report a "Completed, status Success" fabric status

Then test the NCCL again.

minghungchen commented 1 year ago

Tried, but resetting GPU does not help.

minghungchen commented 1 year ago

@KaimingOuyang We also tried to power cycle the two nodes, but it also did not help. Please let me know if you need some other information.

I feel this could be related to the current NVLS/NVLSTree implementation in NCCL 2.19.3.

KaimingOuyang commented 1 year ago

I am not sure that's the reason since I can run 2.19.3 without the problem on DGX H100.

Could you please provide me the output of ifconfig of both nodes? BTW, are ens110f0np0,ens110f1np1,ens112f0np0,ens112f1np1,ens114f0np0,ens114f1np1 interfaces all working?

Can you also try to set NCCL_SOCKET_IFNAME=eth0?

minghungchen commented 1 year ago

Sure. Please see below for the info you requested. I tried to set NCCL_SOCKET_IFNAME=eth0, but it did not help and all_reduce_perf got stuck at the same point.

Here is the ifconfig from 182

$ ifconfig
docker0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        inet 172.17.0.1  netmask 255.255.0.0  broadcast 172.17.255.255
        ether 02:42:d8:2f:c1:de  txqueuelen 0  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ens108f0np0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
        inet 10.10.1.182  netmask 255.255.255.0  broadcast 10.10.1.255
        inet6 fe80::ba3f:d2ff:febe:f0da  prefixlen 64  scopeid 0x20<link>
        ether b8:3f:d2:be:f0:da  txqueuelen 1000  (Ethernet)
        RX packets 48335  bytes 2900100 (2.9 MB)
        RX errors 0  dropped 7  overruns 0  frame 0
        TX packets 41  bytes 2966 (2.9 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ens108f1np1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
        inet 10.10.2.182  netmask 255.255.255.0  broadcast 10.10.2.255
        inet6 fe80::ba3f:d2ff:febe:f0db  prefixlen 64  scopeid 0x20<link>
        ether b8:3f:d2:be:f0:db  txqueuelen 1000  (Ethernet)
        RX packets 48335  bytes 2900100 (2.9 MB)
        RX errors 0  dropped 7  overruns 0  frame 0
        TX packets 41  bytes 2966 (2.9 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ens110f0np0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
        inet 10.10.3.182  netmask 255.255.255.0  broadcast 10.10.3.255
        inet6 fe80::ba3f:d2ff:febe:fca2  prefixlen 64  scopeid 0x20<link>
        ether b8:3f:d2:be:fc:a2  txqueuelen 1000  (Ethernet)
        RX packets 48335  bytes 2900100 (2.9 MB)
        RX errors 0  dropped 7  overruns 0  frame 0
        TX packets 41  bytes 2966 (2.9 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ens110f1np1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
        inet 10.10.4.182  netmask 255.255.255.0  broadcast 10.10.4.255
        inet6 fe80::ba3f:d2ff:febe:fca3  prefixlen 64  scopeid 0x20<link>
        ether b8:3f:d2:be:fc:a3  txqueuelen 1000  (Ethernet)
        RX packets 48335  bytes 2900100 (2.9 MB)
        RX errors 0  dropped 7  overruns 0  frame 0
        TX packets 42  bytes 3036 (3.0 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ens112f0np0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
        inet 10.10.5.182  netmask 255.255.255.0  broadcast 10.10.5.255
        inet6 fe80::ba3f:d2ff:febe:f99e  prefixlen 64  scopeid 0x20<link>
        ether b8:3f:d2:be:f9:9e  txqueuelen 1000  (Ethernet)
        RX packets 48335  bytes 2900100 (2.9 MB)
        RX errors 0  dropped 7  overruns 0  frame 0
        TX packets 41  bytes 2966 (2.9 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ens112f1np1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
        inet 10.10.6.182  netmask 255.255.255.0  broadcast 10.10.6.255
        inet6 fe80::ba3f:d2ff:febe:f99f  prefixlen 64  scopeid 0x20<link>
        ether b8:3f:d2:be:f9:9f  txqueuelen 1000  (Ethernet)
        RX packets 48335  bytes 2900100 (2.9 MB)
        RX errors 0  dropped 7  overruns 0  frame 0
        TX packets 41  bytes 2966 (2.9 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ens114f0np0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
        inet 10.10.7.182  netmask 255.255.255.0  broadcast 10.10.7.255
        inet6 fe80::ba3f:d2ff:fedf:82b0  prefixlen 64  scopeid 0x20<link>
        ether b8:3f:d2:df:82:b0  txqueuelen 1000  (Ethernet)
        RX packets 48335  bytes 2900100 (2.9 MB)
        RX errors 0  dropped 7  overruns 0  frame 0
        TX packets 41  bytes 2966 (2.9 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ens114f1np1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
        inet 10.10.8.182  netmask 255.255.255.0  broadcast 10.10.8.255
        inet6 fe80::ba3f:d2ff:fedf:82b1  prefixlen 64  scopeid 0x20<link>
        ether b8:3f:d2:df:82:b1  txqueuelen 1000  (Ethernet)
        RX packets 48335  bytes 2900100 (2.9 MB)
        RX errors 0  dropped 7  overruns 0  frame 0
        TX packets 41  bytes 2966 (2.9 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 9.2.131.182  netmask 255.255.254.0  broadcast 9.2.131.255
        inet6 fe80::b696:91ff:fea9:3a50  prefixlen 64  scopeid 0x20<link>
        ether b4:96:91:a9:3a:50  txqueuelen 1000  (Ethernet)
        RX packets 2875646  bytes 2859114911 (2.8 GB)
        RX errors 0  dropped 7  overruns 0  frame 0
        TX packets 192736  bytes 20126466 (20.1 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 152748  bytes 23411480 (23.4 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 152748  bytes 23411480 (23.4 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

virbr0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        inet 192.168.122.1  netmask 255.255.255.0  broadcast 192.168.122.255
        ether 52:54:00:f9:80:2c  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ifconfig from css-host-183

$ ifconfig
cni0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1450
        inet 10.42.0.1  netmask 255.255.255.0  broadcast 10.42.0.255
        inet6 fe80::401a:b5ff:fefd:e206  prefixlen 64  scopeid 0x20<link>
        ether 42:1a:b5:fd:e2:06  txqueuelen 1000  (Ethernet)
        RX packets 1151701  bytes 240597787 (240.5 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 1439582  bytes 169338127 (169.3 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

docker0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        inet 172.17.0.1  netmask 255.255.0.0  broadcast 172.17.255.255
        ether 02:42:65:04:33:50  txqueuelen 0  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ens108f0np0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
        inet 10.10.1.183  netmask 255.255.255.0  broadcast 10.10.1.255
        inet6 fe80::ba3f:d2ff:febe:fbda  prefixlen 64  scopeid 0x20<link>
        ether b8:3f:d2:be:fb:da  txqueuelen 1000  (Ethernet)
        RX packets 48305  bytes 2898300 (2.8 MB)
        RX errors 0  dropped 7  overruns 0  frame 0
        TX packets 41  bytes 2966 (2.9 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ens108f1np1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
        inet 10.10.2.183  netmask 255.255.255.0  broadcast 10.10.2.255
        inet6 fe80::ba3f:d2ff:febe:fbdb  prefixlen 64  scopeid 0x20<link>
        ether b8:3f:d2:be:fb:db  txqueuelen 1000  (Ethernet)
        RX packets 48305  bytes 2898300 (2.8 MB)
        RX errors 0  dropped 7  overruns 0  frame 0
        TX packets 42  bytes 3036 (3.0 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ens110f0np0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
        inet 10.10.3.183  netmask 255.255.255.0  broadcast 10.10.3.255
        inet6 fe80::ba3f:d2ff:febe:fcea  prefixlen 64  scopeid 0x20<link>
        ether b8:3f:d2:be:fc:ea  txqueuelen 1000  (Ethernet)
        RX packets 48305  bytes 2898300 (2.8 MB)
        RX errors 0  dropped 7  overruns 0  frame 0
        TX packets 41  bytes 2966 (2.9 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ens110f1np1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
        inet 10.10.4.183  netmask 255.255.255.0  broadcast 10.10.4.255
        inet6 fe80::ba3f:d2ff:febe:fceb  prefixlen 64  scopeid 0x20<link>
        ether b8:3f:d2:be:fc:eb  txqueuelen 1000  (Ethernet)
        RX packets 48305  bytes 2898300 (2.8 MB)
        RX errors 0  dropped 7  overruns 0  frame 0
        TX packets 42  bytes 3036 (3.0 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ens112f0np0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
        inet 10.10.5.183  netmask 255.255.255.0  broadcast 10.10.5.255
        inet6 fe80::ba3f:d2ff:fedf:8238  prefixlen 64  scopeid 0x20<link>
        ether b8:3f:d2:df:82:38  txqueuelen 1000  (Ethernet)
        RX packets 48305  bytes 2898300 (2.8 MB)
        RX errors 0  dropped 7  overruns 0  frame 0
        TX packets 41  bytes 2966 (2.9 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ens112f1np1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
        inet 10.10.6.183  netmask 255.255.255.0  broadcast 10.10.6.255
        inet6 fe80::ba3f:d2ff:fedf:8239  prefixlen 64  scopeid 0x20<link>
        ether b8:3f:d2:df:82:39  txqueuelen 1000  (Ethernet)
        RX packets 48305  bytes 2898300 (2.8 MB)
        RX errors 0  dropped 7  overruns 0  frame 0
        TX packets 42  bytes 3036 (3.0 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ens114f0np0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
        inet 10.10.7.183  netmask 255.255.255.0  broadcast 10.10.7.255
        inet6 fe80::ba3f:d2ff:febe:f002  prefixlen 64  scopeid 0x20<link>
        ether b8:3f:d2:be:f0:02  txqueuelen 1000  (Ethernet)
        RX packets 48305  bytes 2898300 (2.8 MB)
        RX errors 0  dropped 7  overruns 0  frame 0
        TX packets 41  bytes 2966 (2.9 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ens114f1np1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
        inet 10.10.8.183  netmask 255.255.255.0  broadcast 10.10.8.255
        inet6 fe80::ba3f:d2ff:febe:f003  prefixlen 64  scopeid 0x20<link>
        ether b8:3f:d2:be:f0:03  txqueuelen 1000  (Ethernet)
        RX packets 48305  bytes 2898300 (2.8 MB)
        RX errors 0  dropped 7  overruns 0  frame 0
        TX packets 41  bytes 2966 (2.9 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 9.2.131.183  netmask 255.255.254.0  broadcast 9.2.131.255
        inet6 fe80::b696:91ff:fea9:39d4  prefixlen 64  scopeid 0x20<link>
        ether b4:96:91:a9:39:d4  txqueuelen 1000  (Ethernet)
        RX packets 97310440  bytes 144824884803 (144.8 GB)
        RX errors 0  dropped 7  overruns 0  frame 0
        TX packets 8682844  bytes 951133437 (951.1 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

flannel.1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1450
        inet 10.42.0.0  netmask 255.255.255.255  broadcast 0.0.0.0
        inet6 fe80::b8ce:29ff:fed2:8001  prefixlen 64  scopeid 0x20<link>
        ether ba:ce:29:d2:80:01  txqueuelen 0  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 5 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 1355611  bytes 521520189 (521.5 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 1355611  bytes 521520189 (521.5 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

veth0587b345: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1450
        inet6 fe80::64b5:65ff:fea1:41c8  prefixlen 64  scopeid 0x20<link>
        ether 66:b5:65:a1:41:c8  txqueuelen 0  (Ethernet)
        RX packets 691346  bytes 214424786 (214.4 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 910357  bytes 123082563 (123.0 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

veth62ba3cb6: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1450
        inet6 fe80::3486:faff:fe24:917a  prefixlen 64  scopeid 0x20<link>
        ether 36:86:fa:24:91:7a  txqueuelen 0  (Ethernet)
        RX packets 8781  bytes 639475 (639.4 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 8288  bytes 603922 (603.9 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

vethc9428062: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1450
        inet6 fe80::40f3:c3ff:feab:3786  prefixlen 64  scopeid 0x20<link>
        ether 42:f3:c3:ab:37:86  txqueuelen 0  (Ethernet)
        RX packets 350900  bytes 31410481 (31.4 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 399416  bytes 34662510 (34.6 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

vethc727a4d3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1450
        inet6 fe80::2cf6:3aff:fe9c:53b9  prefixlen 64  scopeid 0x20<link>
        ether 2e:f6:3a:9c:53:b9  txqueuelen 0  (Ethernet)
        RX packets 100759  bytes 10263785 (10.2 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 122518  bytes 11070450 (11.0 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

vetheee50ffd: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1450
        inet6 fe80::44e1:c2ff:fed3:b0d2  prefixlen 64  scopeid 0x20<link>
        ether 46:e1:c2:d3:b0:d2  txqueuelen 0  (Ethernet)
        RX packets 41  bytes 2922 (2.9 KB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 200  bytes 13872 (13.8 KB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

all interfaces work

$ for i in $(seq 1 1 8); do ping -c 1 10.10.$i.182; done
PING 10.10.1.182 (10.10.1.182) 56(84) bytes of data.
64 bytes from 10.10.1.182: icmp_seq=1 ttl=64 time=0.259 ms

--- 10.10.1.182 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.259/0.259/0.259/0.000 ms
PING 10.10.2.182 (10.10.2.182) 56(84) bytes of data.
64 bytes from 10.10.2.182: icmp_seq=1 ttl=64 time=0.253 ms

--- 10.10.2.182 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.253/0.253/0.253/0.000 ms
PING 10.10.3.182 (10.10.3.182) 56(84) bytes of data.
64 bytes from 10.10.3.182: icmp_seq=1 ttl=64 time=0.207 ms

--- 10.10.3.182 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.207/0.207/0.207/0.000 ms
PING 10.10.4.182 (10.10.4.182) 56(84) bytes of data.
64 bytes from 10.10.4.182: icmp_seq=1 ttl=64 time=0.251 ms

--- 10.10.4.182 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.251/0.251/0.251/0.000 ms
PING 10.10.5.182 (10.10.5.182) 56(84) bytes of data.
64 bytes from 10.10.5.182: icmp_seq=1 ttl=64 time=0.236 ms

--- 10.10.5.182 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.236/0.236/0.236/0.000 ms
PING 10.10.6.182 (10.10.6.182) 56(84) bytes of data.
64 bytes from 10.10.6.182: icmp_seq=1 ttl=64 time=0.223 ms

--- 10.10.6.182 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.223/0.223/0.223/0.000 ms
PING 10.10.7.182 (10.10.7.182) 56(84) bytes of data.
64 bytes from 10.10.7.182: icmp_seq=1 ttl=64 time=0.214 ms

--- 10.10.7.182 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.214/0.214/0.214/0.000 ms
PING 10.10.8.182 (10.10.8.182) 56(84) bytes of data.
64 bytes from 10.10.8.182: icmp_seq=1 ttl=64 time=0.205 ms

--- 10.10.8.182 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.205/0.205/0.205/0.000 ms

KaimingOuyang commented 1 year ago

It is weird that nvidia-smi topo -m output is different from what I get from NCCL log. I think I might find the reason why your machine causes the issue.

I attach a patch here. Can you apply it to 2.19.3 and see if it works?

KaimingOuyang commented 1 year ago

Sorry, updated the patch a bit. Please check this. 0001-Fix-duplicate-NVLS-head-rank.patch

KaimingOuyang commented 1 year ago

OK. Never mind, I found a more fundamental issue (so the patch I gave you might not work). I would like to discuss with my teammate and see whether we can solve it quickly. Please bear it with me for now.

minghungchen commented 1 year ago

Thanks for the update. I just tested the patch a bit. It appears to fix some NVLS/NVLSTree issues on 2.19.3 but the performance looks not right, and it does not help NVLS/NVLSTree on 2.18.6 although the patched region looks the same. I will wait for the new patch and feel free to let me know if you need anything else.

KaimingOuyang commented 11 months ago

@minghungchen Can you apply the patch I provide here on the top of master and try whether it solves the issue? 0001-Support-NVLS-dual-port-NIC-transmission.patch

minghungchen commented 11 months ago

@KaimingOuyang Thanks for the new patch. I did some quick tests, and it appears the patch can resolve the issue observed.

Will this patch be backported to older NCCL releases?

KaimingOuyang commented 11 months ago

Thanks. Do you want to stick to the older NCCL version? Usually, we don't backport the patch.

minghungchen commented 11 months ago

I think it should be ok because eventually we will migrate to new NCCL versions. Some colleagues are using PyTorch with affected NCCL versions, but it could be fine as long as they will not run the jobs on H100.

I hope the NGC PyTorch release can have the patched NCCL included soon. Thanks again for your help.

NVIDIA / nccl

nccl-tests get stuck and free(): invalid next size (fast) error with 2.19.3 and 2.18.5, but no error with 2.16.5 #1043