NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
809 stars 229 forks source link

Test NCCL failure common.cu:958 'internal error - please report this issue to the NCCL developers / ' #166

Closed kylematoba closed 9 months ago

kylematoba commented 11 months ago

Running mpirun --allow-run-as-root --mca btl_tcp_if_include bond0 -x LD_LIBRARY_PATH -x NCCL_DEBUG=INFO -x NCCL_COLLNET_ENABLE=1 -x NCCL_ALGO=Tree,NVLS -x NCCL_IB_HCA=mlx5 -x NCCL_IB_GID_INDEX=3 -H gpu004,gpu009 nccl-tests/build/all_reduce_perf -b 1G -e 16G -f 2 -g 1 -t 8 between two nodes of 8x A100 gives (not the full log -- LMK if more is needed):

gpu004:2200689:2200706 [3] misc/socket.cc:49 NCCL WARN socketProgress: Connection closed by remote peer gpu004.ds-a4-r02.abc.com<55693>
gpu004:2200689:2200706 [3] NCCL INFO misc/socket.cc:749 -> 6

gpu004:2200689:2200706 [3] proxy.cc:1143 NCCL WARN Socket recv failed while polling for opId=0x7f4774cc9620
gpu004:2200689:2200706 [3] NCCL INFO transport/net.cc:288 -> 3
gpu004:2200689:2200706 [3] NCCL INFO transport.cc:148 -> 3
gpu004:2200689:2200706 [3] NCCL INFO init.cc:1079 -> 3
gpu004:2200689:2200706 [3] NCCL INFO init.cc:1358 -> 3
gpu004:2200689:2200706 [3] NCCL INFO group.cc:65 -> 3 [Async thread]
gpu004:2200689:2200704 [1] NCCL INFO bootstrap.cc:546 -> 3
gpu004:2200689:2200704 [1] NCCL INFO bootstrap.cc:434 -> 3
gpu004:2200689:2200704 [1] NCCL INFO init.cc:1220 -> 3
gpu004:2200689:2200704 [1] NCCL INFO init.cc:1358 -> 3
gpu004:2200689:2200704 [1] NCCL INFO group.cc:65 -> 3 [Async thread]
gpu004:2200689:2200705 [2] NCCL INFO misc/socket.cc:46 -> 3
gpu004:2200689:2200705 [2] NCCL INFO misc/socket.cc:749 -> 3

gpu004:2200689:2200705 [2] proxy.cc:1143 NCCL WARN Socket recv failed while polling for opId=0x7f4778cdaae8
gpu004:2200689:2200705 [2] NCCL INFO transport/net.cc:362 -> 3
gpu004:2200689:2200705 [2] NCCL INFO transport.cc:168 -> 3
gpu004:2200689:2200705 [2] NCCL INFO init.cc:1079 -> 3
gpu004:2200689:2200705 [2] NCCL INFO init.cc:1358 -> 3
gpu004:2200689:2200705 [2] NCCL INFO group.cc:65 -> 3 [Async thread]

gpu004:2200689:2200715 [2] proxy.cc:1485 NCCL WARN [Service thread] Error encountered progressing operation=Connect, res=3, closing connection

gpu004:2200689:2200715 [2] proxy.cc:1519 NCCL WARN [Proxy Service 10] Failed to execute operation Connect from rank 10, retcode 3
gpu004:2200689:2200710 [7] NCCL INFO misc/socket.cc:46 -> 3
gpu004:2200689:2200710 [7] NCCL INFO misc/socket.cc:57 -> 3
gpu004:2200689:2200710 [7] NCCL INFO misc/socket.cc:772 -> 3
gpu004:2200689:2200710 [7] NCCL INFO proxy.cc:1107 -> 3
gpu004:2200689:2200710 [7] NCCL INFO proxy.cc:1193 -> 3
gpu004:2200689:2200710 [7] NCCL INFO proxy.cc:1047 -> 3
gpu004:2200689:2200710 [7] NCCL INFO transport/p2p.cc:437 -> 3
gpu004:2200689:2200710 [7] NCCL INFO transport.cc:33 -> 3
gpu004:2200689:2200710 [7] NCCL INFO transport.cc:97 -> 3
gpu004:2200689:2200710 [7] NCCL INFO init.cc:1089 -> 3
gpu004:2200689:2200710 [7] NCCL INFO init.cc:1358 -> 3
gpu004:2200689:2200710 [7] NCCL INFO group.cc:65 -> 3 [Async thread]
gpu004:2200689:2200721 [7] NCCL INFO misc/socket.cc:46 -> 3
gpu004:2200689:2200721 [7] NCCL INFO misc/socket.cc:57 -> 3
gpu004:2200689:2200721 [7] NCCL INFO misc/socket.cc:786 -> 3
gpu004:2200689:2200721 [7] NCCL INFO proxy.cc:1360 -> 3

gpu004:2200689:2200721 [7] proxy.cc:1519 NCCL WARN [Proxy Service 15] Failed to execute operation Init from rank 15, retcode 3
gpu004:2200689:2200707 [4] NCCL INFO misc/socket.cc:46 -> 3
gpu004:2200689:2200707 [4] NCCL INFO misc/socket.cc:749 -> 3

gpu004:2200689:2200707 [4] proxy.cc:1143 NCCL WARN Socket recv failed while polling for opId=0x559cf7e85540
gpu004:2200689:2200707 [4] NCCL INFO proxy.cc:1047 -> 3
gpu004:2200689:2200707 [4] NCCL INFO transport/p2p.cc:437 -> 3
gpu004:2200689:2200707 [4] NCCL INFO transport.cc:33 -> 3
gpu004:2200689:2200707 [4] NCCL INFO transport.cc:97 -> 3
gpu004:2200689:2200707 [4] NCCL INFO init.cc:1089 -> 3
gpu004:2200689:2200707 [4] NCCL INFO init.cc:1358 -> 3
gpu004:2200689:2200707 [4] NCCL INFO group.cc:65 -> 3 [Async thread]
gpu004:2200689:2200718 [4] NCCL INFO misc/socket.cc:46 -> 3
gpu004:2200689:2200718 [4] NCCL INFO misc/socket.cc:57 -> 3
gpu004:2200689:2200718 [4] NCCL INFO misc/socket.cc:786 -> 3
gpu004:2200689:2200718 [4] NCCL INFO proxy.cc:1360 -> 3

gpu004:2200689:2200718 [4] proxy.cc:1519 NCCL WARN [Proxy Service 12] Failed to execute operation Init from rank 12, retcode 3
gpu004:2200689:2200703 [0] NCCL INFO misc/socket.cc:46 -> 3
gpu004:2200689:2200703 [0] NCCL INFO misc/socket.cc:749 -> 3

gpu004:2200689:2200703 [0] proxy.cc:1143 NCCL WARN Socket recv failed while polling for opId=0x7f477c6ac760
gpu004:2200689:2200703 [0] NCCL INFO proxy.cc:1047 -> 3
gpu004:2200689:2200703 [0] NCCL INFO transport/p2p.cc:437 -> 3
gpu004:2200689:2200703 [0] NCCL INFO transport.cc:33 -> 3
gpu004:2200689:2200703 [0] NCCL INFO transport.cc:97 -> 3
gpu004:2200689:2200703 [0] NCCL INFO init.cc:1089 -> 3
gpu004:2200689:2200703 [0] NCCL INFO init.cc:1358 -> 3
gpu004:2200689:2200703 [0] NCCL INFO group.cc:65 -> 3 [Async thread]
gpu004:2200689:2200708 [5] NCCL INFO Connected all rings
gpu004:2200689:2200708 [5] NCCL INFO misc/socket.cc:46 -> 3
gpu004:2200689:2200708 [5] NCCL INFO misc/socket.cc:57 -> 3
gpu004:2200689:2200708 [5] NCCL INFO misc/socket.cc:772 -> 3
gpu004:2200689:2200708 [5] NCCL INFO proxy.cc:1107 -> 3
gpu004:2200689:2200708 [5] NCCL INFO proxy.cc:1193 -> 3
gpu004:2200689:2200708 [5] NCCL INFO proxy.cc:1047 -> 3
gpu004:2200689:2200708 [5] NCCL INFO transport/p2p.cc:437 -> 3
gpu004:2200689:2200708 [5] NCCL INFO transport.cc:33 -> 3
gpu004:2200689:2200708 [5] NCCL INFO transport.cc:97 -> 3
gpu004:2200689:2200708 [5] NCCL INFO init.cc:1089 -> 3
gpu004:2200689:2200708 [5] NCCL INFO init.cc:1358 -> 3
gpu004:2200689:2200708 [5] NCCL INFO group.cc:65 -> 3 [Async thread]
gpu004:2200689:2200719 [5] NCCL INFO misc/socket.cc:805 -> 3

gpu004:2200689:2200719 [5] proxy.cc:1495 NCCL WARN [Service thread] Could not receive type from localRank 5, res=3, closed=0

gpu004:2200689:2200719 [5] proxy.cc:1519 NCCL WARN [Proxy Service 13] Failed to execute operation Init from rank 13, retcode 3
gpu004:2200689:2200709 [6] NCCL INFO Connected all rings
gpu004:2200689:2200720 [6] NCCL INFO misc/socket.cc:805 -> 3

gpu004:2200689:2200720 [6] proxy.cc:1495 NCCL WARN [Service thread] Could not receive type from localRank 6, res=3, closed=0

gpu004:2200689:2200720 [6] proxy.cc:1519 NCCL WARN [Proxy Service 14] Failed to execute operation Init from rank 14, retcode 3
gpu004:2200689:2200709 [6] NCCL INFO misc/socket.cc:46 -> 3
gpu004:2200689:2200709 [6] NCCL INFO misc/socket.cc:57 -> 3
gpu004:2200689:2200709 [6] NCCL INFO misc/socket.cc:772 -> 3
gpu004:2200689:2200709 [6] NCCL INFO proxy.cc:1107 -> 3
gpu004:2200689:2200709 [6] NCCL INFO proxy.cc:1193 -> 3
gpu004:2200689:2200709 [6] NCCL INFO proxy.cc:1047 -> 3
gpu004:2200689:2200709 [6] NCCL INFO transport/p2p.cc:437 -> 3
gpu004:2200689:2200709 [6] NCCL INFO transport.cc:33 -> 3
gpu004:2200689:2200709 [6] NCCL INFO transport.cc:97 -> 3
gpu004:2200689:2200709 [6] NCCL INFO init.cc:1089 -> 3
gpu004:2200689:2200709 [6] NCCL INFO init.cc:1358 -> 3
gpu004:2200689:2200709 [6] NCCL INFO group.cc:65 -> 3 [Async thread]
gpu004:2200689:2200689 [7] NCCL INFO group.cc:406 -> 3
gpu004:2200689:2200689 [7] NCCL INFO group.cc:96 -> 3
gpu004: Test NCCL failure common.cu:958 'internal error - please report this issue to the NCCL developers / '
 .. gpu004 pid 2200689: Test failure common.cu:842

gpu004:2200689:2200715 [0] include/alloc.h:250 NCCL WARN Cuda failure 'driver shutting down'

gpu004:2200689:2200716 [0] include/alloc.h:250 NCCL WARN Cuda failure 'driver shutting down'

gpu004:2200689:2200716 [0] include/alloc.h:243 NCCL WARN Cuda failure 'driver shutting down'
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[8059,1],0]
  Exit code:    3
--------------------------------------------------------------------------
sjeaugey commented 11 months ago

I don't see much in the log. There was an issue on some version where aborting NCCL would make it propagate up as an internal error, so it could be a red herring; you're hanging and aborting is causing those errors.

Is this happening right away, or could it be it gets stuck for some time, then after a timeout, it tries to abort and you get those errors?

SweeneyJun commented 11 months ago

I'm encountering this issue in the nccl-tests too, and it has been bothering me for over a month. This issue seems to occur randomly (meaning, it persists even after restarting the machine, reinstalling CUDA/OpenMPI with the same or different versions, and updating GPU drivers. Conversely, sometimes, I can run nccl-tests without encountering this error if I wait a couple of days without making any changes). The error appears immediately after running the tests, and it occurs within 20 seconds of starting.

Environment:

If you require more detailed logs or information, please let me know, and I'll be happy to provide them.

Thank you for your assistance in resolving this issue.

The command I use is: mpirun --allow-run-as-root -np 2 --hostfile ./hostMPI -x SHELL=/bin/bash -x LD_LIBRARY_PATH=/usr/local/openmpi/lib:/usr/local/cuda/lib64:/usr/lib -x PATH=/usr/local/openmpi/bin:/root/anaconda3/bin:/root/anaconda3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/local/cuda/bin:/snap/bin -x MASTER_ADDR=10.0.0.1 /root/nccl-tests/build/alltoall_perf -b 8 -e 128M -f 2 -g 8 and the hostfile contains text like: 10.0.0.1 max-slots=1 10.0.0.2 max-slots=1

kylematoba commented 11 months ago

It also happens pretty soon for me after starting, without a noticeable period of being stuck.

sjeaugey commented 11 months ago

Ok thanks. The errors in the log above are all supposed to be side-effects. Usually there is another NCCL WARN indicating the source error which caused a process to exit, and close its connections to others, causing others to report Connection closed by ... and then Socket Recv failed polling for opId ....

It is important to make sure NCCL_DEBUG=INFO or at least NCCL_DEBUG=WARN is well propagated to all ranks using e.g. mpirun -x NCCL_DEBUG=WARN instead of just setting it in the environment (in which case it would only affect the local node).

SweeneyJun commented 11 months ago

Issue Description: I have successfully used the -x parameter to make environment variables effective on both remote nodes and the local node (in my case, the remote node is 10.0.0.2, and the local node is 10.0.0.1). Based on your previous advice, "This issue may be caused by hanging or aborting, and the error message might be a red herring", I have conducted further tests over the past few days and made some observations:

  1. In most cases, this issue occurs immediately after execution. Even before nvidia-smi and ps aux | grep all_reduce_perf shows there are processes running, I can see the error message (which aligns with my previous observation of it happening within 20 seconds). Notably, I have been using some unconventional NCCL Environment Variables to test performance, and I suspect that specific combinations (such as NCCL_SOCKET_NTHREADS=2, NCCL_COMM_BLOCKING=0) of these variables might be causing this error within 20 seconds (refer to the first log detail below).
    Details

mpirun --allow-run-as-root -np 2 --hostfile ./hostMPI -x SHELL -x LD_LIBRARY_PATH -x PATH -x LD_PRELOAD -x NCCL_ALGO -x NCCL_BUFFSIZE -x NCCL_CHECK_POINTERS -x NCCL_COMM_BLOCKING -x NCCL_CROSS_NIC -x NCCL_DEBUG -x NCCL_DMABUF_ENABLE -x NCCL_GDR_READ -x NCCL_GRAPH_MIXING_SUPPORT -x NCCL_GRAPH_REGISTER -x NCCL_IGNORE_CPU_AFFINITY -x NCCL_LAUNCH_MODE -x NCCL_MAX_NCHANNELS -x NCCL_MIN_NCHANNELS -x NCCL_NET_GDR_LEVEL -x NCCL_NET_SHARED_BUFFERS -x NCCL_NET_SHARED_COMMS -x NCCL_NSOCKS_PERTHREAD /root/cxwang/nccl-tests/build/all_reduce_perf -b 8 -e 128M -f 2 -g 8
# nThread 1 nGpus 8 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  39313 on     ubuntu device  0 [0x00] Tesla V100-SXM2-32GB
#  Rank  1 Group  0 Pid  39313 on     ubuntu device  1 [0x00] Tesla V100-SXM2-32GB
#  Rank  2 Group  0 Pid  39313 on     ubuntu device  2 [0x00] Tesla V100-SXM2-32GB
#  Rank  3 Group  0 Pid  39313 on     ubuntu device  3 [0x00] Tesla V100-SXM2-32GB
#  Rank  4 Group  0 Pid  39313 on     ubuntu device  4 [0x00] Tesla V100-SXM2-32GB
#  Rank  5 Group  0 Pid  39313 on     ubuntu device  5 [0x00] Tesla V100-SXM2-32GB
#  Rank  6 Group  0 Pid  39313 on     ubuntu device  6 [0x00] Tesla V100-SXM2-32GB
#  Rank  7 Group  0 Pid  39313 on     ubuntu device  7 [0x00] Tesla V100-SXM2-32GB
#  Rank  8 Group  0 Pid   8484 on     ubuntu device  0 [0x00] Tesla V100-SXM2-32GB
#  Rank  9 Group  0 Pid   8484 on     ubuntu device  1 [0x00] Tesla V100-SXM2-32GB
#  Rank 10 Group  0 Pid   8484 on     ubuntu device  2 [0x00] Tesla V100-SXM2-32GB
#  Rank 11 Group  0 Pid   8484 on     ubuntu device  3 [0x00] Tesla V100-SXM2-32GB
#  Rank 12 Group  0 Pid   8484 on     ubuntu device  4 [0x00] Tesla V100-SXM2-32GB
#  Rank 13 Group  0 Pid   8484 on     ubuntu device  5 [0x00] Tesla V100-SXM2-32GB
#  Rank 14 Group  0 Pid   8484 on     ubuntu device  6 [0x00] Tesla V100-SXM2-32GB
#  Rank 15 Group  0 Pid   8484 on     ubuntu device  7 [0x00] Tesla V100-SXM2-32GB
ubuntu:39313:39313 [0] NCCL INFO Bootstrap : Using ens4:10.0.0.1<0>
ubuntu:39313:39313 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
ubuntu:39313:39313 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation

ubuntu:8484:8484 [0] misc/cudawrap.cc:179 NCCL WARN Failed to find CUDA library libcuda.so (NCCL_CUDA_PATH='') : 
ubuntu:8484:8484 [0] NCCL INFO Bootstrap : Using ens4:10.0.0.2<0>
ubuntu:8484:8484 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
ubuntu:8484:8484 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
ubuntu:8484:8484 [0] NCCL INFO NCCL_COMM_BLOCKING set by environment to 0.
ubuntu:39313:39313 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.18.3+cuda12.2
ubuntu:39313:39313 [0] NCCL INFO NCCL_COMM_BLOCKING set by environment to 0.
ubuntu: Test NCCL failure common.cu:958 'NCCL operation in progress / '
 .. ubuntu pid 8484: Test failure common.cu:842
ubuntu: Test NCCL failure common.cu:958 'NCCL operation in progress / '
 .. ubuntu pid 39313: Test failure common.cu:842

--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[39714,1],1]
  Exit code:    3
--------------------------------------------------------------------------

  1. In rare cases, all_reduce_perf starts and runs for a while before triggering this error. The details of this scenario are in the second and third log detail below. I still suspect that it might be due to incorrect configurations of my NCCL Environment Variables, but the frequency of these errors is quite high, with many combinations causing them. This leads me to wonder if there might be a bug in the nccl-tests code itself.
Details

``` mpirun --allow-run-as-root -np 2 --hostfile ./hostMPI -x SHELL -x LD_LIBRARY_PATH -x PATH -x LD_PRELOAD -x NCCL_ALGO -x NCCL_BUFFSIZE -x NCCL_CHECK_POINTERS -x NCCL_COMM_BLOCKING -x NCCL_CROSS_NIC -x NCCL_DEBUG -x NCCL_DMABUF_ENABLE -x NCCL_GDR_READ -x NCCL_GRAPH_MIXING_SUPPORT -x NCCL_GRAPH_REGISTER -x NCCL_IGNORE_CPU_AFFINITY -x NCCL_LAUNCH_MODE -x NCCL_MAX_NCHANNELS -x NCCL_MIN_NCHANNELS -x NCCL_NET_GDR_LEVEL -x NCCL_NET_SHARED_BUFFERS -x NCCL_NET_SHARED_COMMS -x NCCL_NSOCKS_PERTHREAD -x NCCL_NTHREADS -x NCCL_NVB_DISABLE -x NCCL_P2P_DIRECT_DISABLE -x NCCL_P2P_DISABLE /root/cxwang/nccl-tests/build/all_reduce_perf -b 8 -e 128M -f 2 -g 8 # nThread 1 nGpus 8 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0 # # Using devices # Rank 0 Group 0 Pid 39964 on ubuntu device 0 [0x00] Tesla V100-SXM2-32GB # Rank 1 Group 0 Pid 39964 on ubuntu device 1 [0x00] Tesla V100-SXM2-32GB # Rank 2 Group 0 Pid 39964 on ubuntu device 2 [0x00] Tesla V100-SXM2-32GB # Rank 3 Group 0 Pid 39964 on ubuntu device 3 [0x00] Tesla V100-SXM2-32GB # Rank 4 Group 0 Pid 39964 on ubuntu device 4 [0x00] Tesla V100-SXM2-32GB # Rank 5 Group 0 Pid 39964 on ubuntu device 5 [0x00] Tesla V100-SXM2-32GB # Rank 6 Group 0 Pid 39964 on ubuntu device 6 [0x00] Tesla V100-SXM2-32GB # Rank 7 Group 0 Pid 39964 on ubuntu device 7 [0x00] Tesla V100-SXM2-32GB # Rank 8 Group 0 Pid 9621 on ubuntu device 0 [0x00] Tesla V100-SXM2-32GB # Rank 9 Group 0 Pid 9621 on ubuntu device 1 [0x00] Tesla V100-SXM2-32GB # Rank 10 Group 0 Pid 9621 on ubuntu device 2 [0x00] Tesla V100-SXM2-32GB # Rank 11 Group 0 Pid 9621 on ubuntu device 3 [0x00] Tesla V100-SXM2-32GB # Rank 12 Group 0 Pid 9621 on ubuntu device 4 [0x00] Tesla V100-SXM2-32GB # Rank 13 Group 0 Pid 9621 on ubuntu device 5 [0x00] Tesla V100-SXM2-32GB # Rank 14 Group 0 Pid 9621 on ubuntu device 6 [0x00] Tesla V100-SXM2-32GB # Rank 15 Group 0 Pid 9621 on ubuntu device 7 [0x00] Tesla V100-SXM2-32GB ubuntu:39964:39964 [0] NCCL INFO Bootstrap : Using ens4:10.0.0.1<0> ubuntu:39964:39964 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory ubuntu:39964:39964 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation ubuntu:9621:9621 [0] misc/cudawrap.cc:179 NCCL WARN Failed to find CUDA library libcuda.so (NCCL_CUDA_PATH='') : ubuntu:9621:9621 [0] NCCL INFO Bootstrap : Using ens4:10.0.0.2<0> ubuntu:9621:9621 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory ubuntu:9621:9621 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation ubuntu:9621:9621 [0] NCCL INFO NCCL_COMM_BLOCKING set by environment to 1. ubuntu:39964:39964 [0] NCCL INFO cudaDriverVersion 12020 NCCL version 2.18.3+cuda12.2 ubuntu:39964:39964 [0] NCCL INFO NCCL_COMM_BLOCKING set by environment to 1. ubuntu:39964:40050 [6] NCCL INFO Failed to open libibverbs.so[.1] ubuntu:39964:40050 [6] NCCL INFO NET/Socket : Using [0]ens4:10.0.0.1<0> ubuntu:39964:40050 [6] NCCL INFO Using network Socket ubuntu:39964:40050 [6] NCCL INFO NCCL_CHECK_POINTERS set by environment to 1. ubuntu:39964:40050 [6] NCCL INFO NCCL_DMABUF_ENABLE set by environment to 0. ubuntu:39964:40046 [2] NCCL INFO Using network Socket ubuntu:9621:9691 [6] NCCL INFO Failed to open libibverbs.so[.1] ubuntu:9621:9691 [6] NCCL INFO NET/Socket : Using [0]ens4:10.0.0.2<0> ubuntu:9621:9691 [6] NCCL INFO Using network Socket ubuntu:9621:9691 [6] NCCL INFO NCCL_CHECK_POINTERS set by environment to 1. ubuntu:9621:9691 [6] NCCL INFO NCCL_DMABUF_ENABLE set by environment to 0. ubuntu:9621:9692 [7] NCCL INFO Using network Socket ubuntu:39964:40044 [0] NCCL INFO Using network Socket ubuntu:9621:9689 [4] NCCL INFO Using network Socket ubuntu:39964:40049 [5] NCCL INFO Using network Socket ubuntu:9621:9687 [2] NCCL INFO Using network Socket ubuntu:39964:40047 [3] NCCL INFO Using network Socket ubuntu:9621:9690 [5] NCCL INFO Using network Socket ubuntu:9621:9686 [1] NCCL INFO Using network Socket ubuntu:39964:40045 [1] NCCL INFO Using network Socket ubuntu:39964:40048 [4] NCCL INFO Using network Socket ubuntu:39964:40051 [7] NCCL INFO Using network Socket ubuntu:9621:9688 [3] NCCL INFO Using network Socket ubuntu:9621:9685 [0] NCCL INFO Using network Socket ubuntu:9621:9689 [4] NCCL INFO comm 0x55fb7cc1d700 rank 12 nranks 16 cudaDev 4 nvmlDev 4 busId 90 commId 0xdcdce334deb797e8 - Init START ubuntu:39964:40048 [4] NCCL INFO comm 0x561ce245f140 rank 4 nranks 16 cudaDev 4 nvmlDev 4 busId 90 commId 0xdcdce334deb797e8 - Init START ubuntu:9621:9687 [2] NCCL INFO comm 0x55fb7cbf5000 rank 10 nranks 16 cudaDev 2 nvmlDev 2 busId 70 commId 0xdcdce334deb797e8 - Init START ubuntu:39964:40051 [7] NCCL INFO comm 0x561ce249bda0 rank 7 nranks 16 cudaDev 7 nvmlDev 7 busId c0 commId 0xdcdce334deb797e8 - Init START ubuntu:39964:40046 [2] NCCL INFO comm 0x561ce2436a40 rank 2 nranks 16 cudaDev 2 nvmlDev 2 busId 70 commId 0xdcdce334deb797e8 - Init START ubuntu:9621:9686 [1] NCCL INFO comm 0x55fb7cbe0be0 rank 9 nranks 16 cudaDev 1 nvmlDev 1 busId 60 commId 0xdcdce334deb797e8 - Init START ubuntu:39964:40049 [5] NCCL INFO comm 0x561ce2473560 rank 5 nranks 16 cudaDev 5 nvmlDev 5 busId a0 commId 0xdcdce334deb797e8 - Init START ubuntu:9621:9692 [7] NCCL INFO comm 0x55fb7cc5a360 rank 15 nranks 16 cudaDev 7 nvmlDev 7 busId c0 commId 0xdcdce334deb797e8 - Init START ubuntu:39964:40044 [0] NCCL INFO comm 0x561ce240e260 rank 0 nranks 16 cudaDev 0 nvmlDev 0 busId 50 commId 0xdcdce334deb797e8 - Init START ubuntu:9621:9690 [5] NCCL INFO comm 0x55fb7cc31b20 rank 13 nranks 16 cudaDev 5 nvmlDev 5 busId a0 commId 0xdcdce334deb797e8 - Init START ubuntu:39964:40045 [1] NCCL INFO comm 0x561ce2422620 rank 1 nranks 16 cudaDev 1 nvmlDev 1 busId 60 commId 0xdcdce334deb797e8 - Init START ubuntu:9621:9685 [0] NCCL INFO comm 0x55fb7cbcc860 rank 8 nranks 16 cudaDev 0 nvmlDev 0 busId 50 commId 0xdcdce334deb797e8 - Init START ubuntu:39964:40050 [6] NCCL INFO comm 0x561ce2487980 rank 6 nranks 16 cudaDev 6 nvmlDev 6 busId b0 commId 0xdcdce334deb797e8 - Init START ubuntu:9621:9688 [3] NCCL INFO comm 0x55fb7cc092e0 rank 11 nranks 16 cudaDev 3 nvmlDev 3 busId 80 commId 0xdcdce334deb797e8 - Init START ubuntu:39964:40047 [3] NCCL INFO comm 0x561ce244ad20 rank 3 nranks 16 cudaDev 3 nvmlDev 3 busId 80 commId 0xdcdce334deb797e8 - Init START ubuntu:9621:9691 [6] NCCL INFO comm 0x55fb7cc45f40 rank 14 nranks 16 cudaDev 6 nvmlDev 6 busId b0 commId 0xdcdce334deb797e8 - Init START ubuntu:9621:9686 [1] NCCL INFO NCCL_NVB_DISABLE set by environment to 1. ubuntu:9621:9686 [1] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC ubuntu:9621:9686 [1] NCCL INFO NCCL_IGNORE_CPU_AFFINITY set by environment to 1. ubuntu:9621:9686 [1] NCCL INFO NCCL_CROSS_NIC set by environment to 2. ubuntu:39964:40047 [3] NCCL INFO NCCL_NVB_DISABLE set by environment to 1. ubuntu:39964:40047 [3] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC ubuntu:39964:40047 [3] NCCL INFO NCCL_IGNORE_CPU_AFFINITY set by environment to 1. ubuntu:39964:40047 [3] NCCL INFO NVLS multicast support is not available on dev 3 ubuntu:39964:40047 [3] NCCL INFO NCCL_CROSS_NIC set by environment to 2. ubuntu:39964:40050 [6] NCCL INFO NVLS multicast support is not available on dev 6 ubuntu:39964:40048 [4] NCCL INFO NVLS multicast support is not available on dev 4 ubuntu:39964:40051 [7] NCCL INFO NVLS multicast support is not available on dev 7 ubuntu:39964:40046 [2] NCCL INFO NVLS multicast support is not available on dev 2 ubuntu:39964:40049 [5] NCCL INFO NVLS multicast support is not available on dev 5 ubuntu:39964:40044 [0] NCCL INFO NVLS multicast support is not available on dev 0 ubuntu:39964:40045 [1] NCCL INFO NVLS multicast support is not available on dev 1 ubuntu:39964:40050 [6] NCCL INFO NCCL_MAX_NCHANNELS set by environment to 63. ubuntu:39964:40050 [6] NCCL INFO NCCL_MIN_NCHANNELS set by environment to 22. ubuntu:39964:40050 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 [21] 7/-1/-1->6->5 ubuntu:39964:40050 [6] NCCL INFO NCCL_BUFFSIZE set by environment to 524288. ubuntu:39964:40050 [6] NCCL INFO P2P Chunksize set to 131072 ubuntu:39964:40050 [6] NCCL INFO NCCL_GRAPH_MIXING_SUPPORT set by environment to 1. ubuntu:39964:40049 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 [16] 6/-1/-1->5->4 [17] 6/-1/-1->5->4 [18] 6/-1/-1->5->4 [19] 6/-1/-1->5->4 [20] 6/-1/-1->5->4 [21] 6/-1/-1->5->4 ubuntu:39964:40049 [5] NCCL INFO P2P Chunksize set to 131072 ubuntu:9621:9687 [2] NCCL INFO NCCL_MAX_NCHANNELS set by environment to 63. ubuntu:9621:9687 [2] NCCL INFO NCCL_MIN_NCHANNELS set by environment to 22. ubuntu:9621:9687 [2] NCCL INFO Trees [0] 11/-1/-1->10->9 [1] 11/-1/-1->10->9 [2] 11/-1/-1->10->9 [3] 11/-1/-1->10->9 [4] 11/-1/-1->10->9 [5] 11/-1/-1->10->9 [6] 11/-1/-1->10->9 [7] 11/-1/-1->10->9 [8] 11/-1/-1->10->9 [9] 11/-1/-1->10->9 [10] 11/-1/-1->10->9 [11] 11/-1/-1->10->9 [12] 11/-1/-1->10->9 [13] 11/-1/-1->10->9 [14] 11/-1/-1->10->9 [15] 11/-1/-1->10->9 [16] 11/-1/-1->10->9 [17] 11/-1/-1->10->9 [18] 11/-1/-1->10->9 [19] 11/-1/-1->10->9 [20] 11/-1/-1->10->9 [21] 11/-1/-1->10->9 ubuntu:9621:9687 [2] NCCL INFO NCCL_BUFFSIZE set by environment to 524288. ubuntu:9621:9687 [2] NCCL INFO P2P Chunksize set to 131072 ubuntu:9621:9687 [2] NCCL INFO NCCL_GRAPH_MIXING_SUPPORT set by environment to 1. ubuntu:39964:40051 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 [21] -1/-1/-1->7->6 ubuntu:39964:40051 [7] NCCL INFO P2P Chunksize set to 131072 ubuntu:39964:40046 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 3/-1/-1->2->1 [17] 3/-1/-1->2->1 [18] 3/-1/-1->2->1 [19] 3/-1/-1->2->1 [20] 3/-1/-1->2->1 [21] 3/-1/-1->2->1 ubuntu:39964:40046 [2] NCCL INFO P2P Chunksize set to 131072 ubuntu:39964:40047 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 [16] 4/-1/-1->3->2 [17] 4/-1/-1->3->2 [18] 4/-1/-1->3->2 [19] 4/-1/-1->3->2 [20] 4/-1/-1->3->2 [21] 4/-1/-1->3->2 ubuntu:39964:40047 [3] NCCL INFO P2P Chunksize set to 131072 ubuntu:9621:9685 [0] NCCL INFO Trees [0] 9/-1/-1->8->0 [1] 9/0/-1->8->-1 [2] 9/-1/-1->8->0 [3] 9/0/-1->8->-1 [4] 9/-1/-1->8->0 [5] 9/0/-1->8->-1 [6] 9/-1/-1->8->0 [7] 9/0/-1->8->-1 [8] 9/-1/-1->8->0 [9] 9/0/-1->8->-1 [10] 9/-1/-1->8->0 [11] 9/0/-1->8->-1 [12] 9/-1/-1->8->0 [13] 9/0/-1->8->-1 [14] 9/-1/-1->8->0 [15] 9/0/-1->8->-1 [16] 9/-1/-1->8->0 [17] 9/0/-1->8->-1 [18] 9/-1/-1->8->0 [19] 9/0/-1->8->-1 [20] 9/-1/-1->8->0 [21] 9/0/-1->8->-1 ubuntu:9621:9685 [0] NCCL INFO P2P Chunksize set to 131072 ubuntu:9621:9688 [3] NCCL INFO Trees [0] 12/-1/-1->11->10 [1] 12/-1/-1->11->10 [2] 12/-1/-1->11->10 [3] 12/-1/-1->11->10 [4] 12/-1/-1->11->10 [5] 12/-1/-1->11->10 [6] 12/-1/-1->11->10 [7] 12/-1/-1->11->10 [8] 12/-1/-1->11->10 [9] 12/-1/-1->11->10 [10] 12/-1/-1->11->10 [11] 12/-1/-1->11->10 [12] 12/-1/-1->11->10 [13] 12/-1/-1->11->10 [14] 12/-1/-1->11->10 [15] 12/-1/-1->11->10 [16] 12/-1/-1->11->10 [17] 12/-1/-1->11->10 [18] 12/-1/-1->11->10 [19] 12/-1/-1->11->10 [20] 12/-1/-1->11->10 [21] 12/-1/-1->11->10 ubuntu:9621:9688 [3] NCCL INFO P2P Chunksize set to 131072 ubuntu:9621:9689 [4] NCCL INFO Trees [0] 13/-1/-1->12->11 [1] 13/-1/-1->12->11 [2] 13/-1/-1->12->11 [3] 13/-1/-1->12->11 [4] 13/-1/-1->12->11 [5] 13/-1/-1->12->11 [6] 13/-1/-1->12->11 [7] 13/-1/-1->12->11 [8] 13/-1/-1->12->11 [9] 13/-1/-1->12->11 [10] 13/-1/-1->12->11 [11] 13/-1/-1->12->11 [12] 13/-1/-1->12->11 [13] 13/-1/-1->12->11 [14] 13/-1/-1->12->11 [15] 13/-1/-1->12->11 [16] 13/-1/-1->12->11 [17] 13/-1/-1->12->11 [18] 13/-1/-1->12->11 [19] 13/-1/-1->12->11 [20] 13/-1/-1->12->11 [21] 13/-1/-1->12->11 ubuntu:9621:9689 [4] NCCL INFO P2P Chunksize set to 131072 ubuntu:9621:9686 [1] NCCL INFO Trees [0] 10/-1/-1->9->8 [1] 10/-1/-1->9->8 [2] 10/-1/-1->9->8 [3] 10/-1/-1->9->8 [4] 10/-1/-1->9->8 [5] 10/-1/-1->9->8 [6] 10/-1/-1->9->8 [7] 10/-1/-1->9->8 [8] 10/-1/-1->9->8 [9] 10/-1/-1->9->8 [10] 10/-1/-1->9->8 [11] 10/-1/-1->9->8 [12] 10/-1/-1->9->8 [13] 10/-1/-1->9->8 [14] 10/-1/-1->9->8 [15] 10/-1/-1->9->8 [16] 10/-1/-1->9->8 [17] 10/-1/-1->9->8 [18] 10/-1/-1->9->8 [19] 10/-1/-1->9->8 [20] 10/-1/-1->9->8 [21] 10/-1/-1->9->8 ubuntu:9621:9686 [1] NCCL INFO P2P Chunksize set to 131072 ubuntu:9621:9690 [5] NCCL INFO Trees [0] 14/-1/-1->13->12 [1] 14/-1/-1->13->12 [2] 14/-1/-1->13->12 [3] 14/-1/-1->13->12 [4] 14/-1/-1->13->12 [5] 14/-1/-1->13->12 [6] 14/-1/-1->13->12 [7] 14/-1/-1->13->12 [8] 14/-1/-1->13->12 [9] 14/-1/-1->13->12 [10] 14/-1/-1->13->12 [11] 14/-1/-1->13->12 [12] 14/-1/-1->13->12 [13] 14/-1/-1->13->12 [14] 14/-1/-1->13->12 [15] 14/-1/-1->13->12 [16] 14/-1/-1->13->12 [17] 14/-1/-1->13->12 [18] 14/-1/-1->13->12 [19] 14/-1/-1->13->12 [20] 14/-1/-1->13->12 [21] 14/-1/-1->13->12 ubuntu:9621:9690 [5] NCCL INFO P2P Chunksize set to 131072 ubuntu:9621:9692 [7] NCCL INFO Trees [0] -1/-1/-1->15->14 [1] -1/-1/-1->15->14 [2] -1/-1/-1->15->14 [3] -1/-1/-1->15->14 [4] -1/-1/-1->15->14 [5] -1/-1/-1->15->14 [6] -1/-1/-1->15->14 [7] -1/-1/-1->15->14 [8] -1/-1/-1->15->14 [9] -1/-1/-1->15->14 [10] -1/-1/-1->15->14 [11] -1/-1/-1->15->14 [12] -1/-1/-1->15->14 [13] -1/-1/-1->15->14 [14] -1/-1/-1->15->14 [15] -1/-1/-1->15->14 [16] -1/-1/-1->15->14 [17] -1/-1/-1->15->14 [18] -1/-1/-1->15->14 [19] -1/-1/-1->15->14 [20] -1/-1/-1->15->14 [21] -1/-1/-1->15->14 ubuntu:9621:9692 [7] NCCL INFO P2P Chunksize set to 131072 ubuntu:9621:9691 [6] NCCL INFO Trees [0] 15/-1/-1->14->13 [1] 15/-1/-1->14->13 [2] 15/-1/-1->14->13 [3] 15/-1/-1->14->13 [4] 15/-1/-1->14->13 [5] 15/-1/-1->14->13 [6] 15/-1/-1->14->13 [7] 15/-1/-1->14->13 [8] 15/-1/-1->14->13 [9] 15/-1/-1->14->13 [10] 15/-1/-1->14->13 [11] 15/-1/-1->14->13 [12] 15/-1/-1->14->13 [13] 15/-1/-1->14->13 [14] 15/-1/-1->14->13 [15] 15/-1/-1->14->13 [16] 15/-1/-1->14->13 [17] 15/-1/-1->14->13 [18] 15/-1/-1->14->13 [19] 15/-1/-1->14->13 [20] 15/-1/-1->14->13 [21] 15/-1/-1->14->13 ubuntu:9621:9691 [6] NCCL INFO P2P Chunksize set to 131072 ubuntu:39964:40048 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 [16] 5/-1/-1->4->3 [17] 5/-1/-1->4->3 [18] 5/-1/-1->4->3 [19] 5/-1/-1->4->3 [20] 5/-1/-1->4->3 [21] 5/-1/-1->4->3 ubuntu:39964:40048 [4] NCCL INFO P2P Chunksize set to 131072 ubuntu:39964:40045 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 [21] 2/-1/-1->1->0 ubuntu:39964:40045 [1] NCCL INFO P2P Chunksize set to 131072 ubuntu:39964:40044 [0] NCCL INFO Channel 00/22 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ubuntu:39964:40044 [0] NCCL INFO Channel 01/22 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ubuntu:39964:40044 [0] NCCL INFO Channel 02/22 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ubuntu:39964:40044 [0] NCCL INFO Channel 03/22 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ubuntu:39964:40044 [0] NCCL INFO Channel 04/22 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ubuntu:39964:40044 [0] NCCL INFO Channel 05/22 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ubuntu:39964:40044 [0] NCCL INFO Channel 06/22 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ubuntu:39964:40044 [0] NCCL INFO Channel 07/22 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ubuntu:39964:40044 [0] NCCL INFO Channel 08/22 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ubuntu:39964:40044 [0] NCCL INFO Channel 09/22 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ubuntu:39964:40044 [0] NCCL INFO Channel 10/22 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ubuntu:39964:40044 [0] NCCL INFO Channel 11/22 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ubuntu:39964:40044 [0] NCCL INFO Channel 12/22 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ubuntu:39964:40044 [0] NCCL INFO Channel 13/22 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ubuntu:39964:40044 [0] NCCL INFO Channel 14/22 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ubuntu:39964:40044 [0] NCCL INFO Channel 15/22 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ubuntu:39964:40044 [0] NCCL INFO Channel 16/22 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ubuntu:39964:40044 [0] NCCL INFO Channel 17/22 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ubuntu:39964:40044 [0] NCCL INFO Channel 18/22 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ubuntu:39964:40044 [0] NCCL INFO Channel 19/22 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ubuntu:39964:40044 [0] NCCL INFO Channel 20/22 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ubuntu:39964:40044 [0] NCCL INFO Channel 21/22 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ubuntu:39964:40044 [0] NCCL INFO Trees [0] 1/8/-1->0->-1 [1] 1/-1/-1->0->8 [2] 1/8/-1->0->-1 [3] 1/-1/-1->0->8 [4] 1/8/-1->0->-1 [5] 1/-1/-1->0->8 [6] 1/8/-1->0->-1 [7] 1/-1/-1->0->8 [8] 1/8/-1->0->-1 [9] 1/-1/-1->0->8 [10] 1/8/-1->0->-1 [11] 1/-1/-1->0->8 [12] 1/8/-1->0->-1 [13] 1/-1/-1->0->8 [14] 1/8/-1->0->-1 [15] 1/-1/-1->0->8 [16] 1/8/-1->0->-1 [17] 1/-1/-1->0->8 [18] 1/8/-1->0->-1 [19] 1/-1/-1->0->8 [20] 1/8/-1->0->-1 [21] 1/-1/-1->0->8 ubuntu:39964:40044 [0] NCCL INFO P2P Chunksize set to 131072 ubuntu:9621:9694 [0] NCCL INFO NCCL_NSOCKS_PERTHREAD set by environment to 10. ubuntu:9621:9685 [0] NCCL INFO Channel 00/0 : 7[7] -> 8[0] [receive] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 01/0 : 7[7] -> 8[0] [receive] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 02/0 : 7[7] -> 8[0] [receive] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 03/0 : 7[7] -> 8[0] [receive] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 04/0 : 7[7] -> 8[0] [receive] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 05/0 : 7[7] -> 8[0] [receive] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 06/0 : 7[7] -> 8[0] [receive] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 07/0 : 7[7] -> 8[0] [receive] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 08/0 : 7[7] -> 8[0] [receive] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 09/0 : 7[7] -> 8[0] [receive] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 10/0 : 7[7] -> 8[0] [receive] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 11/0 : 7[7] -> 8[0] [receive] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 12/0 : 7[7] -> 8[0] [receive] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 13/0 : 7[7] -> 8[0] [receive] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 14/0 : 7[7] -> 8[0] [receive] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 15/0 : 7[7] -> 8[0] [receive] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 16/0 : 7[7] -> 8[0] [receive] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 17/0 : 7[7] -> 8[0] [receive] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 18/0 : 7[7] -> 8[0] [receive] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 19/0 : 7[7] -> 8[0] [receive] via NET/Socket/0 ubuntu:39964:40075 [0] NCCL INFO NCCL_NSOCKS_PERTHREAD set by environment to 10. ubuntu:39964:40044 [0] NCCL INFO Channel 00/0 : 15[7] -> 0[0] [receive] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 20/0 : 7[7] -> 8[0] [receive] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 21/0 : 7[7] -> 8[0] [receive] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 01/0 : 15[7] -> 0[0] [receive] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 02/0 : 15[7] -> 0[0] [receive] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 03/0 : 15[7] -> 0[0] [receive] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 04/0 : 15[7] -> 0[0] [receive] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 05/0 : 15[7] -> 0[0] [receive] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 06/0 : 15[7] -> 0[0] [receive] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 07/0 : 15[7] -> 0[0] [receive] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 08/0 : 15[7] -> 0[0] [receive] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 00 : 8[0] -> 9[1] via SHM/direct/direct ubuntu:39964:40044 [0] NCCL INFO Channel 09/0 : 15[7] -> 0[0] [receive] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 01 : 8[0] -> 9[1] via SHM/direct/direct ubuntu:39964:40044 [0] NCCL INFO Channel 10/0 : 15[7] -> 0[0] [receive] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 11/0 : 15[7] -> 0[0] [receive] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 12/0 : 15[7] -> 0[0] [receive] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 13/0 : 15[7] -> 0[0] [receive] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 14/0 : 15[7] -> 0[0] [receive] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 15/0 : 15[7] -> 0[0] [receive] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 16/0 : 15[7] -> 0[0] [receive] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 17/0 : 15[7] -> 0[0] [receive] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 02 : 8[0] -> 9[1] via SHM/direct/direct ubuntu:9621:9685 [0] NCCL INFO Channel 03 : 8[0] -> 9[1] via SHM/direct/direct ubuntu:9621:9685 [0] NCCL INFO Channel 04 : 8[0] -> 9[1] via SHM/direct/direct ubuntu:39964:40044 [0] NCCL INFO Channel 18/0 : 15[7] -> 0[0] [receive] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 19/0 : 15[7] -> 0[0] [receive] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 20/0 : 15[7] -> 0[0] [receive] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 21/0 : 15[7] -> 0[0] [receive] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 05 : 8[0] -> 9[1] via SHM/direct/direct ubuntu:9621:9685 [0] NCCL INFO Channel 06 : 8[0] -> 9[1] via SHM/direct/direct ubuntu:9621:9685 [0] NCCL INFO Channel 07 : 8[0] -> 9[1] via SHM/direct/direct ubuntu:39964:40044 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct ubuntu:39964:40044 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct ubuntu:39964:40044 [0] NCCL INFO Channel 02 : 0[0] -> 1[1] via SHM/direct/direct ubuntu:39964:40044 [0] NCCL INFO Channel 03 : 0[0] -> 1[1] via SHM/direct/direct ubuntu:9621:9685 [0] NCCL INFO Channel 08 : 8[0] -> 9[1] via SHM/direct/direct ubuntu:9621:9685 [0] NCCL INFO Channel 09 : 8[0] -> 9[1] via SHM/direct/direct ubuntu:9621:9685 [0] NCCL INFO Channel 10 : 8[0] -> 9[1] via SHM/direct/direct ubuntu:39964:40044 [0] NCCL INFO Channel 04 : 0[0] -> 1[1] via SHM/direct/direct ubuntu:9621:9685 [0] NCCL INFO Channel 11 : 8[0] -> 9[1] via SHM/direct/direct ubuntu:39964:40044 [0] NCCL INFO Channel 05 : 0[0] -> 1[1] via SHM/direct/direct ubuntu:39964:40044 [0] NCCL INFO Channel 06 : 0[0] -> 1[1] via SHM/direct/direct ubuntu:39964:40044 [0] NCCL INFO Channel 07 : 0[0] -> 1[1] via SHM/direct/direct ubuntu:39964:40044 [0] NCCL INFO Channel 08 : 0[0] -> 1[1] via SHM/direct/direct ubuntu:9621:9685 [0] NCCL INFO Channel 12 : 8[0] -> 9[1] via SHM/direct/direct ubuntu:9621:9685 [0] NCCL INFO Channel 13 : 8[0] -> 9[1] via SHM/direct/direct ubuntu:9621:9685 [0] NCCL INFO Channel 14 : 8[0] -> 9[1] via SHM/direct/direct ubuntu:9621:9685 [0] NCCL INFO Channel 15 : 8[0] -> 9[1] via SHM/direct/direct ubuntu:9621:9685 [0] NCCL INFO Channel 16 : 8[0] -> 9[1] via SHM/direct/direct ubuntu:39964:40044 [0] NCCL INFO Channel 09 : 0[0] -> 1[1] via SHM/direct/direct ubuntu:9621:9685 [0] NCCL INFO Channel 17 : 8[0] -> 9[1] via SHM/direct/direct ubuntu:39964:40044 [0] NCCL INFO Channel 10 : 0[0] -> 1[1] via SHM/direct/direct ubuntu:39964:40044 [0] NCCL INFO Channel 11 : 0[0] -> 1[1] via SHM/direct/direct ubuntu:39964:40044 [0] NCCL INFO Channel 12 : 0[0] -> 1[1] via SHM/direct/direct ubuntu:39964:40044 [0] NCCL INFO Channel 13 : 0[0] -> 1[1] via SHM/direct/direct ubuntu:9621:9685 [0] NCCL INFO Channel 18 : 8[0] -> 9[1] via SHM/direct/direct ubuntu:9621:9685 [0] NCCL INFO Channel 19 : 8[0] -> 9[1] via SHM/direct/direct ubuntu:9621:9685 [0] NCCL INFO Channel 20 : 8[0] -> 9[1] via SHM/direct/direct ubuntu:39964:40044 [0] NCCL INFO Channel 14 : 0[0] -> 1[1] via SHM/direct/direct ubuntu:9621:9685 [0] NCCL INFO Channel 21 : 8[0] -> 9[1] via SHM/direct/direct ubuntu:39964:40044 [0] NCCL INFO Channel 15 : 0[0] -> 1[1] via SHM/direct/direct ubuntu:39964:40044 [0] NCCL INFO Channel 16 : 0[0] -> 1[1] via SHM/direct/direct ubuntu:39964:40044 [0] NCCL INFO Channel 17 : 0[0] -> 1[1] via SHM/direct/direct ubuntu:39964:40044 [0] NCCL INFO Channel 18 : 0[0] -> 1[1] via SHM/direct/direct ubuntu:39964:40044 [0] NCCL INFO Channel 19 : 0[0] -> 1[1] via SHM/direct/direct ubuntu:39964:40044 [0] NCCL INFO Channel 20 : 0[0] -> 1[1] via SHM/direct/direct ubuntu:39964:40044 [0] NCCL INFO Channel 21 : 0[0] -> 1[1] via SHM/direct/direct ubuntu:39964:40050 [6] NCCL INFO Channel 00 : 6[6] -> 7[7] via SHM/direct/direct ubuntu:39964:40050 [6] NCCL INFO Channel 01 : 6[6] -> 7[7] via SHM/direct/direct ubuntu:9621:9687 [2] NCCL INFO Channel 00 : 10[2] -> 11[3] via SHM/direct/direct ubuntu:9621:9687 [2] NCCL INFO Channel 01 : 10[2] -> 11[3] via SHM/direct/direct ubuntu:39964:40049 [5] NCCL INFO Channel 00 : 5[5] -> 6[6] via SHM/direct/direct ubuntu:39964:40049 [5] NCCL INFO Channel 01 : 5[5] -> 6[6] via SHM/direct/direct ubuntu:39964:40050 [6] NCCL INFO Channel 02 : 6[6] -> 7[7] via SHM/direct/direct ubuntu:9621:9686 [1] NCCL INFO Channel 00 : 9[1] -> 10[2] via SHM/direct/direct ubuntu:39964:40046 [2] NCCL INFO Channel 00 : 2[2] -> 3[3] via SHM/direct/direct ubuntu:9621:9686 [1] NCCL INFO Channel 01 : 9[1] -> 10[2] via SHM/direct/direct ubuntu:39964:40049 [5] NCCL INFO Channel 02 : 5[5] -> 6[6] via SHM/direct/direct ubuntu:9621:9689 [4] NCCL INFO Channel 00 : 12[4] -> 13[5] via SHM/direct/direct ubuntu:9621:9687 [2] NCCL INFO Channel 02 : 10[2] -> 11[3] via SHM/direct/direct ubuntu:9621:9688 [3] NCCL INFO Channel 00 : 11[3] -> 12[4] via SHM/direct/direct ubuntu:9621:9690 [5] NCCL INFO Channel 00 : 13[5] -> 14[6] via SHM/direct/direct ubuntu:9621:9691 [6] NCCL INFO Channel 00 : 14[6] -> 15[7] via SHM/direct/direct ubuntu:39964:40050 [6] NCCL INFO Channel 03 : 6[6] -> 7[7] via SHM/direct/direct ubuntu:39964:40047 [3] NCCL INFO Channel 00 : 3[3] -> 4[4] via SHM/direct/direct ubuntu:39964:40046 [2] NCCL INFO Channel 01 : 2[2] -> 3[3] via SHM/direct/direct ubuntu:39964:40046 [2] NCCL INFO Channel 02 : 2[2] -> 3[3] via SHM/direct/direct ubuntu:39964:40051 [7] NCCL INFO Channel 00/0 : 7[7] -> 8[0] [send] via NET/Socket/0 ubuntu:39964:40051 [7] NCCL INFO Channel 01/0 : 7[7] -> 8[0] [send] via NET/Socket/0 ubuntu:39964:40051 [7] NCCL INFO Channel 02/0 : 7[7] -> 8[0] [send] via NET/Socket/0 ubuntu:39964:40051 [7] NCCL INFO Channel 03/0 : 7[7] -> 8[0] [send] via NET/Socket/0 ubuntu:39964:40051 [7] NCCL INFO Channel 04/0 : 7[7] -> 8[0] [send] via NET/Socket/0 ubuntu:39964:40051 [7] NCCL INFO Channel 05/0 : 7[7] -> 8[0] [send] via NET/Socket/0 ubuntu:39964:40051 [7] NCCL INFO Channel 06/0 : 7[7] -> 8[0] [send] via NET/Socket/0 ubuntu:39964:40051 [7] NCCL INFO Channel 07/0 : 7[7] -> 8[0] [send] via NET/Socket/0 ubuntu:39964:40051 [7] NCCL INFO Channel 08/0 : 7[7] -> 8[0] [send] via NET/Socket/0 ubuntu:39964:40051 [7] NCCL INFO Channel 09/0 : 7[7] -> 8[0] [send] via NET/Socket/0 ubuntu:39964:40051 [7] NCCL INFO Channel 10/0 : 7[7] -> 8[0] [send] via NET/Socket/0 ubuntu:39964:40051 [7] NCCL INFO Channel 11/0 : 7[7] -> 8[0] [send] via NET/Socket/0 ubuntu:39964:40051 [7] NCCL INFO Channel 12/0 : 7[7] -> 8[0] [send] via NET/Socket/0 ubuntu:39964:40051 [7] NCCL INFO Channel 13/0 : 7[7] -> 8[0] [send] via NET/Socket/0 ubuntu:39964:40051 [7] NCCL INFO Channel 14/0 : 7[7] -> 8[0] [send] via NET/Socket/0 ubuntu:39964:40051 [7] NCCL INFO Channel 15/0 : 7[7] -> 8[0] [send] via NET/Socket/0 ubuntu:39964:40051 [7] NCCL INFO Channel 16/0 : 7[7] -> 8[0] [send] via NET/Socket/0 ubuntu:39964:40051 [7] NCCL INFO Channel 17/0 : 7[7] -> 8[0] [send] via NET/Socket/0 ubuntu:39964:40051 [7] NCCL INFO Channel 18/0 : 7[7] -> 8[0] [send] via NET/Socket/0 ubuntu:39964:40051 [7] NCCL INFO Channel 19/0 : 7[7] -> 8[0] [send] via NET/Socket/0 ubuntu:39964:40051 [7] NCCL INFO Channel 20/0 : 7[7] -> 8[0] [send] via NET/Socket/0 ubuntu:39964:40046 [2] NCCL INFO Channel 03 : 2[2] -> 3[3] via SHM/direct/direct ubuntu:9621:9691 [6] NCCL INFO Channel 01 : 14[6] -> 15[7] via SHM/direct/direct ubuntu:39964:40050 [6] NCCL INFO Channel 04 : 6[6] -> 7[7] via SHM/direct/direct ubuntu:9621:9691 [6] NCCL INFO Channel 02 : 14[6] -> 15[7] via SHM/direct/direct ubuntu:39964:40045 [1] NCCL INFO Channel 00 : 1[1] -> 2[2] via SHM/direct/direct ubuntu:39964:40051 [7] NCCL INFO Channel 21/0 : 7[7] -> 8[0] [send] via NET/Socket/0 ubuntu:9621:9690 [5] NCCL INFO Channel 01 : 13[5] -> 14[6] via SHM/direct/direct ubuntu:9621:9690 [5] NCCL INFO Channel 02 : 13[5] -> 14[6] via SHM/direct/direct ubuntu:39964:40049 [5] NCCL INFO Channel 03 : 5[5] -> 6[6] via SHM/direct/direct ubuntu:9621:9690 [5] NCCL INFO Channel 03 : 13[5] -> 14[6] via SHM/direct/direct ubuntu:39964:40049 [5] NCCL INFO Channel 04 : 5[5] -> 6[6] via SHM/direct/direct ubuntu:39964:40050 [6] NCCL INFO Channel 05 : 6[6] -> 7[7] via SHM/direct/direct ubuntu:39964:40048 [4] NCCL INFO Channel 00 : 4[4] -> 5[5] via SHM/direct/direct ubuntu:9621:9687 [2] NCCL INFO Channel 03 : 10[2] -> 11[3] via SHM/direct/direct ubuntu:39964:40048 [4] NCCL INFO Channel 01 : 4[4] -> 5[5] via SHM/direct/direct ubuntu:9621:9687 [2] NCCL INFO Channel 04 : 10[2] -> 11[3] via SHM/direct/direct ubuntu:9621:9687 [2] NCCL INFO Channel 05 : 10[2] -> 11[3] via SHM/direct/direct ubuntu:39964:40048 [4] NCCL INFO Channel 02 : 4[4] -> 5[5] via SHM/direct/direct ubuntu:9621:9687 [2] NCCL INFO Channel 06 : 10[2] -> 11[3] via SHM/direct/direct ubuntu:39964:40049 [5] NCCL INFO Channel 05 : 5[5] -> 6[6] via SHM/direct/direct ubuntu:9621:9690 [5] NCCL INFO Channel 04 : 13[5] -> 14[6] via SHM/direct/direct ubuntu:39964:40046 [2] NCCL INFO Channel 04 : 2[2] -> 3[3] via SHM/direct/direct ubuntu:9621:9691 [6] NCCL INFO Channel 03 : 14[6] -> 15[7] via SHM/direct/direct ubuntu:39964:40047 [3] NCCL INFO Channel 01 : 3[3] -> 4[4] via SHM/direct/direct ubuntu:9621:9688 [3] NCCL INFO Channel 01 : 11[3] -> 12[4] via SHM/direct/direct ubuntu:39964:40047 [3] NCCL INFO Channel 02 : 3[3] -> 4[4] via SHM/direct/direct ubuntu:9621:9688 [3] NCCL INFO Channel 02 : 11[3] -> 12[4] via SHM/direct/direct ubuntu:9621:9686 [1] NCCL INFO Channel 02 : 9[1] -> 10[2] via SHM/direct/direct ubuntu:39964:40045 [1] NCCL INFO Channel 01 : 1[1] -> 2[2] via SHM/direct/direct ubuntu:9621:9686 [1] NCCL INFO Channel 03 : 9[1] -> 10[2] via SHM/direct/direct ubuntu:39964:40050 [6] NCCL INFO Channel 06 : 6[6] -> 7[7] via SHM/direct/direct ubuntu:9621:9686 [1] NCCL INFO Channel 04 : 9[1] -> 10[2] via SHM/direct/direct ubuntu:9621:9688 [3] NCCL INFO Channel 03 : 11[3] -> 12[4] via SHM/direct/direct ubuntu:9621:9686 [1] NCCL INFO Channel 05 : 9[1] -> 10[2] via SHM/direct/direct ubuntu:39964:40049 [5] NCCL INFO Channel 06 : 5[5] -> 6[6] via SHM/direct/direct ubuntu:9621:9686 [1] NCCL INFO Channel 06 : 9[1] -> 10[2] via SHM/direct/direct ubuntu:39964:40048 [4] NCCL INFO Channel 03 : 4[4] -> 5[5] via SHM/direct/direct ubuntu:39964:40046 [2] NCCL INFO Channel 05 : 2[2] -> 3[3] via SHM/direct/direct ubuntu:39964:40047 [3] NCCL INFO Channel 03 : 3[3] -> 4[4] via SHM/direct/direct ubuntu:39964:40045 [1] NCCL INFO Channel 02 : 1[1] -> 2[2] via SHM/direct/direct ubuntu:9621:9689 [4] NCCL INFO Channel 01 : 12[4] -> 13[5] via SHM/direct/direct ubuntu:9621:9689 [4] NCCL INFO Channel 02 : 12[4] -> 13[5] via SHM/direct/direct ubuntu:39964:40050 [6] NCCL INFO Channel 07 : 6[6] -> 7[7] via SHM/direct/direct ubuntu:39964:40049 [5] NCCL INFO Channel 07 : 5[5] -> 6[6] via SHM/direct/direct ubuntu:9621:9689 [4] NCCL INFO Channel 03 : 12[4] -> 13[5] via SHM/direct/direct ubuntu:9621:9689 [4] NCCL INFO Channel 04 : 12[4] -> 13[5] via SHM/direct/direct ubuntu:39964:40048 [4] NCCL INFO Channel 04 : 4[4] -> 5[5] via SHM/direct/direct ubuntu:39964:40046 [2] NCCL INFO Channel 06 : 2[2] -> 3[3] via SHM/direct/direct ubuntu:9621:9689 [4] NCCL INFO Channel 05 : 12[4] -> 13[5] via SHM/direct/direct ubuntu:39964:40047 [3] NCCL INFO Channel 04 : 3[3] -> 4[4] via SHM/direct/direct ubuntu:39964:40045 [1] NCCL INFO Channel 03 : 1[1] -> 2[2] via SHM/direct/direct ubuntu:9621:9686 [1] NCCL INFO Channel 07 : 9[1] -> 10[2] via SHM/direct/direct ubuntu:9621:9689 [4] NCCL INFO Channel 06 : 12[4] -> 13[5] via SHM/direct/direct ubuntu:9621:9692 [7] NCCL INFO Channel 00/0 : 15[7] -> 0[0] [send] via NET/Socket/0 ubuntu:9621:9692 [7] NCCL INFO Channel 01/0 : 15[7] -> 0[0] [send] via NET/Socket/0 ubuntu:9621:9692 [7] NCCL INFO Channel 02/0 : 15[7] -> 0[0] [send] via NET/Socket/0 ubuntu:39964:40050 [6] NCCL INFO Channel 08 : 6[6] -> 7[7] via SHM/direct/direct ubuntu:9621:9687 [2] NCCL INFO Channel 07 : 10[2] -> 11[3] via SHM/direct/direct ubuntu:39964:40049 [5] NCCL INFO Channel 08 : 5[5] -> 6[6] via SHM/direct/direct ubuntu:39964:40048 [4] NCCL INFO Channel 05 : 4[4] -> 5[5] via SHM/direct/direct ubuntu:9621:9690 [5] NCCL INFO Channel 05 : 13[5] -> 14[6] via SHM/direct/direct ubuntu:39964:40046 [2] NCCL INFO Channel 07 : 2[2] -> 3[3] via SHM/direct/direct ubuntu:9621:9688 [3] NCCL INFO Channel 04 : 11[3] -> 12[4] via SHM/direct/direct ubuntu:9621:9691 [6] NCCL INFO Channel 04 : 14[6] -> 15[7] via SHM/direct/direct ubuntu:9621:9690 [5] NCCL INFO Channel 06 : 13[5] -> 14[6] via SHM/direct/direct ubuntu:9621:9692 [7] NCCL INFO Channel 03/0 : 15[7] -> 0[0] [send] via NET/Socket/0 ubuntu:9621:9692 [7] NCCL INFO Channel 04/0 : 15[7] -> 0[0] [send] via NET/Socket/0 ubuntu:9621:9692 [7] NCCL INFO Channel 05/0 : 15[7] -> 0[0] [send] via NET/Socket/0 ubuntu:39964:40047 [3] NCCL INFO Channel 05 : 3[3] -> 4[4] via SHM/direct/direct ubuntu:9621:9692 [7] NCCL INFO Channel 06/0 : 15[7] -> 0[0] [send] via NET/Socket/0 ubuntu:9621:9692 [7] NCCL INFO Channel 07/0 : 15[7] -> 0[0] [send] via NET/Socket/0 ubuntu:9621:9692 [7] NCCL INFO Channel 08/0 : 15[7] -> 0[0] [send] via NET/Socket/0 ubuntu:9621:9692 [7] NCCL INFO Channel 09/0 : 15[7] -> 0[0] [send] via NET/Socket/0 ubuntu:9621:9692 [7] NCCL INFO Channel 10/0 : 15[7] -> 0[0] [send] via NET/Socket/0 ubuntu:9621:9692 [7] NCCL INFO Channel 11/0 : 15[7] -> 0[0] [send] via NET/Socket/0 ubuntu:39964:40045 [1] NCCL INFO Channel 04 : 1[1] -> 2[2] via SHM/direct/direct ubuntu:9621:9692 [7] NCCL INFO Channel 12/0 : 15[7] -> 0[0] [send] via NET/Socket/0 ubuntu:9621:9692 [7] NCCL INFO Channel 13/0 : 15[7] -> 0[0] [send] via NET/Socket/0 ubuntu:9621:9692 [7] NCCL INFO Channel 14/0 : 15[7] -> 0[0] [send] via NET/Socket/0 ubuntu:9621:9692 [7] NCCL INFO Channel 15/0 : 15[7] -> 0[0] [send] via NET/Socket/0 ubuntu:9621:9692 [7] NCCL INFO Channel 16/0 : 15[7] -> 0[0] [send] via NET/Socket/0 ubuntu:9621:9692 [7] NCCL INFO Channel 17/0 : 15[7] -> 0[0] [send] via NET/Socket/0 ubuntu:39964:40050 [6] NCCL INFO Channel 09 : 6[6] -> 7[7] via SHM/direct/direct ubuntu:9621:9688 [3] NCCL INFO Channel 05 : 11[3] -> 12[4] via SHM/direct/direct ubuntu:9621:9688 [3] NCCL INFO Channel 06 : 11[3] -> 12[4] via SHM/direct/direct ubuntu:39964:40048 [4] NCCL INFO Channel 06 : 4[4] -> 5[5] via SHM/direct/direct ubuntu:9621:9688 [3] NCCL INFO Channel 07 : 11[3] -> 12[4] via SHM/direct/direct ubuntu:9621:9692 [7] NCCL INFO Channel 18/0 : 15[7] -> 0[0] [send] via NET/Socket/0 ubuntu:9621:9690 [5] NCCL INFO Channel 07 : 13[5] -> 14[6] via SHM/direct/direct ubuntu:9621:9692 [7] NCCL INFO Channel 19/0 : 15[7] -> 0[0] [send] via NET/Socket/0 ubuntu:9621:9692 [7] NCCL INFO Channel 20/0 : 15[7] -> 0[0] [send] via NET/Socket/0 ubuntu:9621:9692 [7] NCCL INFO Channel 21/0 : 15[7] -> 0[0] [send] via NET/Socket/0 ubuntu:39964:40049 [5] NCCL INFO Channel 09 : 5[5] -> 6[6] via SHM/direct/direct ubuntu:39964:40046 [2] NCCL INFO Channel 08 : 2[2] -> 3[3] via SHM/direct/direct ubuntu:39964:40047 [3] NCCL INFO Channel 06 : 3[3] -> 4[4] via SHM/direct/direct ubuntu:39964:40045 [1] NCCL INFO Channel 05 : 1[1] -> 2[2] via SHM/direct/direct ubuntu:9621:9687 [2] NCCL INFO Channel 08 : 10[2] -> 11[3] via SHM/direct/direct ubuntu:39964:40050 [6] NCCL INFO Channel 10 : 6[6] -> 7[7] via SHM/direct/direct ubuntu:9621:9687 [2] NCCL INFO Channel 09 : 10[2] -> 11[3] via SHM/direct/direct ubuntu:9621:9689 [4] NCCL INFO Channel 07 : 12[4] -> 13[5] via SHM/direct/direct ubuntu:9621:9691 [6] NCCL INFO Channel 05 : 14[6] -> 15[7] via SHM/direct/direct ubuntu:9621:9691 [6] NCCL INFO Channel 06 : 14[6] -> 15[7] via SHM/direct/direct ubuntu:9621:9691 [6] NCCL INFO Channel 07 : 14[6] -> 15[7] via SHM/direct/direct ubuntu:39964:40049 [5] NCCL INFO Channel 10 : 5[5] -> 6[6] via SHM/direct/direct ubuntu:39964:40048 [4] NCCL INFO Channel 07 : 4[4] -> 5[5] via SHM/direct/direct ubuntu:39964:40046 [2] NCCL INFO Channel 09 : 2[2] -> 3[3] via SHM/direct/direct ubuntu:9621:9691 [6] NCCL INFO Channel 08 : 14[6] -> 15[7] via SHM/direct/direct ubuntu:9621:9688 [3] NCCL INFO Channel 08 : 11[3] -> 12[4] via SHM/direct/direct ubuntu:39964:40047 [3] NCCL INFO Channel 07 : 3[3] -> 4[4] via SHM/direct/direct ubuntu:9621:9686 [1] NCCL INFO Channel 08 : 9[1] -> 10[2] via SHM/direct/direct ubuntu:39964:40045 [1] NCCL INFO Channel 06 : 1[1] -> 2[2] via SHM/direct/direct ubuntu:39964:40050 [6] NCCL INFO Channel 11 : 6[6] -> 7[7] via SHM/direct/direct ubuntu:9621:9687 [2] NCCL INFO Channel 10 : 10[2] -> 11[3] via SHM/direct/direct ubuntu:9621:9690 [5] NCCL INFO Channel 08 : 13[5] -> 14[6] via SHM/direct/direct ubuntu:9621:9686 [1] NCCL INFO Channel 09 : 9[1] -> 10[2] via SHM/direct/direct ubuntu:39964:40049 [5] NCCL INFO Channel 11 : 5[5] -> 6[6] via SHM/direct/direct ubuntu:39964:40046 [2] NCCL INFO Channel 10 : 2[2] -> 3[3] via SHM/direct/direct ubuntu:9621:9688 [3] NCCL INFO Channel 09 : 11[3] -> 12[4] via SHM/direct/direct ubuntu:9621:9691 [6] NCCL INFO Channel 09 : 14[6] -> 15[7] via SHM/direct/direct ubuntu:39964:40047 [3] NCCL INFO Channel 08 : 3[3] -> 4[4] via SHM/direct/direct ubuntu:39964:40048 [4] NCCL INFO Channel 08 : 4[4] -> 5[5] via SHM/direct/direct ubuntu:9621:9689 [4] NCCL INFO Channel 08 : 12[4] -> 13[5] via SHM/direct/direct ubuntu:39964:40045 [1] NCCL INFO Channel 07 : 1[1] -> 2[2] via SHM/direct/direct ubuntu:9621:9687 [2] NCCL INFO Channel 11 : 10[2] -> 11[3] via SHM/direct/direct ubuntu:39964:40050 [6] NCCL INFO Channel 12 : 6[6] -> 7[7] via SHM/direct/direct ubuntu:9621:9686 [1] NCCL INFO Channel 10 : 9[1] -> 10[2] via SHM/direct/direct ubuntu:9621:9690 [5] NCCL INFO Channel 09 : 13[5] -> 14[6] via SHM/direct/direct ubuntu:39964:40049 [5] NCCL INFO Channel 12 : 5[5] -> 6[6] via SHM/direct/direct ubuntu:39964:40047 [3] NCCL INFO Channel 09 : 3[3] -> 4[4] via SHM/direct/direct ubuntu:9621:9688 [3] NCCL INFO Channel 10 : 11[3] -> 12[4] via SHM/direct/direct ubuntu:9621:9691 [6] NCCL INFO Channel 10 : 14[6] -> 15[7] via SHM/direct/direct ubuntu:39964:40045 [1] NCCL INFO Channel 08 : 1[1] -> 2[2] via SHM/direct/direct ubuntu:39964:40048 [4] NCCL INFO Channel 09 : 4[4] -> 5[5] via SHM/direct/direct ubuntu:9621:9689 [4] NCCL INFO Channel 09 : 12[4] -> 13[5] via SHM/direct/direct ubuntu:39964:40046 [2] NCCL INFO Channel 11 : 2[2] -> 3[3] via SHM/direct/direct ubuntu:9621:9687 [2] NCCL INFO Channel 12 : 10[2] -> 11[3] via SHM/direct/direct ubuntu:9621:9686 [1] NCCL INFO Channel 11 : 9[1] -> 10[2] via SHM/direct/direct ubuntu:39964:40050 [6] NCCL INFO Channel 13 : 6[6] -> 7[7] via SHM/direct/direct ubuntu:9621:9690 [5] NCCL INFO Channel 10 : 13[5] -> 14[6] via SHM/direct/direct ubuntu:39964:40047 [3] NCCL INFO Channel 10 : 3[3] -> 4[4] via SHM/direct/direct ubuntu:39964:40045 [1] NCCL INFO Channel 09 : 1[1] -> 2[2] via SHM/direct/direct ubuntu:9621:9688 [3] NCCL INFO Channel 11 : 11[3] -> 12[4] via SHM/direct/direct ubuntu:39964:40046 [2] NCCL INFO Channel 12 : 2[2] -> 3[3] via SHM/direct/direct ubuntu:9621:9691 [6] NCCL INFO Channel 11 : 14[6] -> 15[7] via SHM/direct/direct ubuntu:39964:40049 [5] NCCL INFO Channel 13 : 5[5] -> 6[6] via SHM/direct/direct ubuntu:9621:9689 [4] NCCL INFO Channel 10 : 12[4] -> 13[5] via SHM/direct/direct ubuntu:39964:40048 [4] NCCL INFO Channel 10 : 4[4] -> 5[5] via SHM/direct/direct ubuntu:9621:9686 [1] NCCL INFO Channel 12 : 9[1] -> 10[2] via SHM/direct/direct ubuntu:39964:40050 [6] NCCL INFO Channel 14 : 6[6] -> 7[7] via SHM/direct/direct ubuntu:9621:9687 [2] NCCL INFO Channel 13 : 10[2] -> 11[3] via SHM/direct/direct ubuntu:9621:9690 [5] NCCL INFO Channel 11 : 13[5] -> 14[6] via SHM/direct/direct ubuntu:39964:40047 [3] NCCL INFO Channel 11 : 3[3] -> 4[4] via SHM/direct/direct ubuntu:39964:40045 [1] NCCL INFO Channel 10 : 1[1] -> 2[2] via SHM/direct/direct ubuntu:39964:40046 [2] NCCL INFO Channel 13 : 2[2] -> 3[3] via SHM/direct/direct ubuntu:9621:9689 [4] NCCL INFO Channel 11 : 12[4] -> 13[5] via SHM/direct/direct ubuntu:9621:9688 [3] NCCL INFO Channel 12 : 11[3] -> 12[4] via SHM/direct/direct ubuntu:39964:40049 [5] NCCL INFO Channel 14 : 5[5] -> 6[6] via SHM/direct/direct ubuntu:9621:9691 [6] NCCL INFO Channel 12 : 14[6] -> 15[7] via SHM/direct/direct ubuntu:39964:40048 [4] NCCL INFO Channel 11 : 4[4] -> 5[5] via SHM/direct/direct ubuntu:9621:9686 [1] NCCL INFO Channel 13 : 9[1] -> 10[2] via SHM/direct/direct ubuntu:39964:40050 [6] NCCL INFO Channel 15 : 6[6] -> 7[7] via SHM/direct/direct ubuntu:9621:9687 [2] NCCL INFO Channel 14 : 10[2] -> 11[3] via SHM/direct/direct ubuntu:9621:9690 [5] NCCL INFO Channel 12 : 13[5] -> 14[6] via SHM/direct/direct ubuntu:9621:9689 [4] NCCL INFO Channel 12 : 12[4] -> 13[5] via SHM/direct/direct ubuntu:39964:40047 [3] NCCL INFO Channel 12 : 3[3] -> 4[4] via SHM/direct/direct ubuntu:39964:40046 [2] NCCL INFO Channel 14 : 2[2] -> 3[3] via SHM/direct/direct ubuntu:39964:40045 [1] NCCL INFO Channel 11 : 1[1] -> 2[2] via SHM/direct/direct ubuntu:9621:9688 [3] NCCL INFO Channel 13 : 11[3] -> 12[4] via SHM/direct/direct ubuntu:39964:40049 [5] NCCL INFO Channel 15 : 5[5] -> 6[6] via SHM/direct/direct ubuntu:39964:40048 [4] NCCL INFO Channel 12 : 4[4] -> 5[5] via SHM/direct/direct ubuntu:39964:40050 [6] NCCL INFO Channel 16 : 6[6] -> 7[7] via SHM/direct/direct ubuntu:9621:9686 [1] NCCL INFO Channel 14 : 9[1] -> 10[2] via SHM/direct/direct ubuntu:9621:9691 [6] NCCL INFO Channel 13 : 14[6] -> 15[7] via SHM/direct/direct ubuntu:9621:9687 [2] NCCL INFO Channel 15 : 10[2] -> 11[3] via SHM/direct/direct ubuntu:9621:9690 [5] NCCL INFO Channel 13 : 13[5] -> 14[6] via SHM/direct/direct ubuntu:9621:9689 [4] NCCL INFO Channel 13 : 12[4] -> 13[5] via SHM/direct/direct ubuntu:39964:40047 [3] NCCL INFO Channel 13 : 3[3] -> 4[4] via SHM/direct/direct ubuntu:39964:40046 [2] NCCL INFO Channel 15 : 2[2] -> 3[3] via SHM/direct/direct ubuntu:39964:40045 [1] NCCL INFO Channel 12 : 1[1] -> 2[2] via SHM/direct/direct ubuntu:9621:9687 [2] NCCL INFO Channel 16 : 10[2] -> 11[3] via SHM/direct/direct ubuntu:39964:40048 [4] NCCL INFO Channel 13 : 4[4] -> 5[5] via SHM/direct/direct ubuntu:9621:9690 [5] NCCL INFO Channel 14 : 13[5] -> 14[6] via SHM/direct/direct ubuntu:39964:40049 [5] NCCL INFO Channel 16 : 5[5] -> 6[6] via SHM/direct/direct ubuntu:9621:9688 [3] NCCL INFO Channel 14 : 11[3] -> 12[4] via SHM/direct/direct ubuntu:39964:40050 [6] NCCL INFO Channel 17 : 6[6] -> 7[7] via SHM/direct/direct ubuntu:9621:9691 [6] NCCL INFO Channel 14 : 14[6] -> 15[7] via SHM/direct/direct ubuntu:9621:9686 [1] NCCL INFO Channel 15 : 9[1] -> 10[2] via SHM/direct/direct ubuntu:9621:9689 [4] NCCL INFO Channel 14 : 12[4] -> 13[5] via SHM/direct/direct ubuntu:39964:40047 [3] NCCL INFO Channel 14 : 3[3] -> 4[4] via SHM/direct/direct ubuntu:39964:40046 [2] NCCL INFO Channel 16 : 2[2] -> 3[3] via SHM/direct/direct ubuntu:39964:40045 [1] NCCL INFO Channel 13 : 1[1] -> 2[2] via SHM/direct/direct ubuntu:9621:9687 [2] NCCL INFO Channel 17 : 10[2] -> 11[3] via SHM/direct/direct ubuntu:39964:40048 [4] NCCL INFO Channel 14 : 4[4] -> 5[5] via SHM/direct/direct ubuntu:39964:40049 [5] NCCL INFO Channel 17 : 5[5] -> 6[6] via SHM/direct/direct ubuntu:9621:9690 [5] NCCL INFO Channel 15 : 13[5] -> 14[6] via SHM/direct/direct ubuntu:9621:9688 [3] NCCL INFO Channel 15 : 11[3] -> 12[4] via SHM/direct/direct ubuntu:9621:9691 [6] NCCL INFO Channel 15 : 14[6] -> 15[7] via SHM/direct/direct ubuntu:39964:40050 [6] NCCL INFO Channel 18 : 6[6] -> 7[7] via SHM/direct/direct ubuntu:9621:9686 [1] NCCL INFO Channel 16 : 9[1] -> 10[2] via SHM/direct/direct ubuntu:9621:9689 [4] NCCL INFO Channel 15 : 12[4] -> 13[5] via SHM/direct/direct ubuntu:39964:40047 [3] NCCL INFO Channel 15 : 3[3] -> 4[4] via SHM/direct/direct ubuntu:39964:40046 [2] NCCL INFO Channel 17 : 2[2] -> 3[3] via SHM/direct/direct ubuntu:9621:9687 [2] NCCL INFO Channel 18 : 10[2] -> 11[3] via SHM/direct/direct ubuntu:39964:40048 [4] NCCL INFO Channel 15 : 4[4] -> 5[5] via SHM/direct/direct ubuntu:39964:40049 [5] NCCL INFO Channel 18 : 5[5] -> 6[6] via SHM/direct/direct ubuntu:39964:40045 [1] NCCL INFO Channel 14 : 1[1] -> 2[2] via SHM/direct/direct ubuntu:39964:40050 [6] NCCL INFO Channel 19 : 6[6] -> 7[7] via SHM/direct/direct ubuntu:9621:9691 [6] NCCL INFO Channel 16 : 14[6] -> 15[7] via SHM/direct/direct ubuntu:9621:9686 [1] NCCL INFO Channel 17 : 9[1] -> 10[2] via SHM/direct/direct ubuntu:9621:9690 [5] NCCL INFO Channel 16 : 13[5] -> 14[6] via SHM/direct/direct ubuntu:9621:9688 [3] NCCL INFO Channel 16 : 11[3] -> 12[4] via SHM/direct/direct ubuntu:9621:9689 [4] NCCL INFO Channel 16 : 12[4] -> 13[5] via SHM/direct/direct ubuntu:39964:40047 [3] NCCL INFO Channel 16 : 3[3] -> 4[4] via SHM/direct/direct ubuntu:39964:40046 [2] NCCL INFO Channel 18 : 2[2] -> 3[3] via SHM/direct/direct ubuntu:39964:40045 [1] NCCL INFO Channel 15 : 1[1] -> 2[2] via SHM/direct/direct ubuntu:9621:9691 [6] NCCL INFO Channel 17 : 14[6] -> 15[7] via SHM/direct/direct ubuntu:9621:9686 [1] NCCL INFO Channel 18 : 9[1] -> 10[2] via SHM/direct/direct ubuntu:9621:9690 [5] NCCL INFO Channel 17 : 13[5] -> 14[6] via SHM/direct/direct ubuntu:39964:40049 [5] NCCL INFO Channel 19 : 5[5] -> 6[6] via SHM/direct/direct ubuntu:39964:40048 [4] NCCL INFO Channel 16 : 4[4] -> 5[5] via SHM/direct/direct ubuntu:9621:9687 [2] NCCL INFO Channel 19 : 10[2] -> 11[3] via SHM/direct/direct ubuntu:39964:40050 [6] NCCL INFO Channel 20 : 6[6] -> 7[7] via SHM/direct/direct ubuntu:9621:9688 [3] NCCL INFO Channel 17 : 11[3] -> 12[4] via SHM/direct/direct ubuntu:9621:9689 [4] NCCL INFO Channel 17 : 12[4] -> 13[5] via SHM/direct/direct ubuntu:39964:40047 [3] NCCL INFO Channel 17 : 3[3] -> 4[4] via SHM/direct/direct ubuntu:39964:40046 [2] NCCL INFO Channel 19 : 2[2] -> 3[3] via SHM/direct/direct ubuntu:39964:40045 [1] NCCL INFO Channel 16 : 1[1] -> 2[2] via SHM/direct/direct ubuntu:39964:40048 [4] NCCL INFO Channel 17 : 4[4] -> 5[5] via SHM/direct/direct ubuntu:39964:40049 [5] NCCL INFO Channel 20 : 5[5] -> 6[6] via SHM/direct/direct ubuntu:9621:9686 [1] NCCL INFO Channel 19 : 9[1] -> 10[2] via SHM/direct/direct ubuntu:9621:9691 [6] NCCL INFO Channel 18 : 14[6] -> 15[7] via SHM/direct/direct ubuntu:9621:9687 [2] NCCL INFO Channel 20 : 10[2] -> 11[3] via SHM/direct/direct ubuntu:9621:9688 [3] NCCL INFO Channel 18 : 11[3] -> 12[4] via SHM/direct/direct ubuntu:39964:40050 [6] NCCL INFO Channel 21 : 6[6] -> 7[7] via SHM/direct/direct ubuntu:9621:9690 [5] NCCL INFO Channel 18 : 13[5] -> 14[6] via SHM/direct/direct ubuntu:9621:9689 [4] NCCL INFO Channel 18 : 12[4] -> 13[5] via SHM/direct/direct ubuntu:9621:9690 [5] NCCL INFO Channel 19 : 13[5] -> 14[6] via SHM/direct/direct ubuntu:9621:9686 [1] NCCL INFO Channel 20 : 9[1] -> 10[2] via SHM/direct/direct ubuntu:9621:9691 [6] NCCL INFO Channel 19 : 14[6] -> 15[7] via SHM/direct/direct ubuntu:9621:9688 [3] NCCL INFO Channel 19 : 11[3] -> 12[4] via SHM/direct/direct ubuntu:9621:9687 [2] NCCL INFO Channel 21 : 10[2] -> 11[3] via SHM/direct/direct ubuntu:9621:9689 [4] NCCL INFO Channel 19 : 12[4] -> 13[5] via SHM/direct/direct ubuntu:39964:40046 [2] NCCL INFO Channel 20 : 2[2] -> 3[3] via SHM/direct/direct ubuntu:39964:40047 [3] NCCL INFO Channel 18 : 3[3] -> 4[4] via SHM/direct/direct ubuntu:39964:40048 [4] NCCL INFO Channel 18 : 4[4] -> 5[5] via SHM/direct/direct ubuntu:39964:40045 [1] NCCL INFO Channel 17 : 1[1] -> 2[2] via SHM/direct/direct ubuntu:39964:40049 [5] NCCL INFO Channel 21 : 5[5] -> 6[6] via SHM/direct/direct ubuntu:9621:9688 [3] NCCL INFO Channel 20 : 11[3] -> 12[4] via SHM/direct/direct ubuntu:9621:9688 [3] NCCL INFO Channel 21 : 11[3] -> 12[4] via SHM/direct/direct ubuntu:9621:9690 [5] NCCL INFO Channel 20 : 13[5] -> 14[6] via SHM/direct/direct ubuntu:39964:40046 [2] NCCL INFO Channel 21 : 2[2] -> 3[3] via SHM/direct/direct ubuntu:9621:9690 [5] NCCL INFO Channel 21 : 13[5] -> 14[6] via SHM/direct/direct ubuntu:9621:9689 [4] NCCL INFO Channel 20 : 12[4] -> 13[5] via SHM/direct/direct ubuntu:9621:9689 [4] NCCL INFO Channel 21 : 12[4] -> 13[5] via SHM/direct/direct ubuntu:9621:9691 [6] NCCL INFO Channel 20 : 14[6] -> 15[7] via SHM/direct/direct ubuntu:39964:40047 [3] NCCL INFO Channel 19 : 3[3] -> 4[4] via SHM/direct/direct ubuntu:39964:40048 [4] NCCL INFO Channel 19 : 4[4] -> 5[5] via SHM/direct/direct ubuntu:39964:40045 [1] NCCL INFO Channel 18 : 1[1] -> 2[2] via SHM/direct/direct ubuntu:9621:9686 [1] NCCL INFO Channel 21 : 9[1] -> 10[2] via SHM/direct/direct ubuntu:39964:40047 [3] NCCL INFO Channel 20 : 3[3] -> 4[4] via SHM/direct/direct ubuntu:39964:40048 [4] NCCL INFO Channel 20 : 4[4] -> 5[5] via SHM/direct/direct ubuntu:9621:9691 [6] NCCL INFO Channel 21 : 14[6] -> 15[7] via SHM/direct/direct ubuntu:39964:40045 [1] NCCL INFO Channel 19 : 1[1] -> 2[2] via SHM/direct/direct ubuntu:39964:40045 [1] NCCL INFO Channel 20 : 1[1] -> 2[2] via SHM/direct/direct ubuntu:39964:40045 [1] NCCL INFO Channel 21 : 1[1] -> 2[2] via SHM/direct/direct ubuntu:39964:40047 [3] NCCL INFO Channel 21 : 3[3] -> 4[4] via SHM/direct/direct ubuntu:39964:40048 [4] NCCL INFO Channel 21 : 4[4] -> 5[5] via SHM/direct/direct ubuntu:39964:40051 [7] NCCL INFO Connected all rings ubuntu:39964:40051 [7] NCCL INFO Channel 00 : 7[7] -> 6[6] via SHM/direct/direct ubuntu:39964:40051 [7] NCCL INFO Channel 01 : 7[7] -> 6[6] via SHM/direct/direct ubuntu:39964:40051 [7] NCCL INFO Channel 02 : 7[7] -> 6[6] via SHM/direct/direct ubuntu:39964:40051 [7] NCCL INFO Channel 03 : 7[7] -> 6[6] via SHM/direct/direct ubuntu:9621:9685 [0] NCCL INFO Connected all rings ubuntu:9621:9685 [0] NCCL INFO Channel 00/0 : 0[0] -> 8[0] [receive] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 01/0 : 0[0] -> 8[0] [receive] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 02/0 : 0[0] -> 8[0] [receive] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 03/0 : 0[0] -> 8[0] [receive] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 04/0 : 0[0] -> 8[0] [receive] via NET/Socket/0 ubuntu:9621:9692 [7] NCCL INFO Connected all rings ubuntu:9621:9685 [0] NCCL INFO Channel 05/0 : 0[0] -> 8[0] [receive] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 06/0 : 0[0] -> 8[0] [receive] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 07/0 : 0[0] -> 8[0] [receive] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 08/0 : 0[0] -> 8[0] [receive] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 09/0 : 0[0] -> 8[0] [receive] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 10/0 : 0[0] -> 8[0] [receive] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 11/0 : 0[0] -> 8[0] [receive] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 12/0 : 0[0] -> 8[0] [receive] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 13/0 : 0[0] -> 8[0] [receive] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 14/0 : 0[0] -> 8[0] [receive] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 15/0 : 0[0] -> 8[0] [receive] via NET/Socket/0 ubuntu:39964:40051 [7] NCCL INFO Channel 04 : 7[7] -> 6[6] via SHM/direct/direct ubuntu:9621:9685 [0] NCCL INFO Channel 16/0 : 0[0] -> 8[0] [receive] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 17/0 : 0[0] -> 8[0] [receive] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 18/0 : 0[0] -> 8[0] [receive] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 19/0 : 0[0] -> 8[0] [receive] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 20/0 : 0[0] -> 8[0] [receive] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 21/0 : 0[0] -> 8[0] [receive] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 00/0 : 8[0] -> 0[0] [send] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 01/0 : 8[0] -> 0[0] [send] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 02/0 : 8[0] -> 0[0] [send] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Connected all rings ubuntu:39964:40044 [0] NCCL INFO Channel 00/0 : 8[0] -> 0[0] [receive] via NET/Socket/0 ubuntu:9621:9692 [7] NCCL INFO Channel 00 : 15[7] -> 14[6] via SHM/direct/direct ubuntu:9621:9685 [0] NCCL INFO Channel 03/0 : 8[0] -> 0[0] [send] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 04/0 : 8[0] -> 0[0] [send] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 05/0 : 8[0] -> 0[0] [send] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 06/0 : 8[0] -> 0[0] [send] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 07/0 : 8[0] -> 0[0] [send] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 08/0 : 8[0] -> 0[0] [send] via NET/Socket/0 ubuntu:39964:40051 [7] NCCL INFO Channel 05 : 7[7] -> 6[6] via SHM/direct/direct ubuntu:9621:9685 [0] NCCL INFO Channel 09/0 : 8[0] -> 0[0] [send] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 10/0 : 8[0] -> 0[0] [send] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 11/0 : 8[0] -> 0[0] [send] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 12/0 : 8[0] -> 0[0] [send] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 13/0 : 8[0] -> 0[0] [send] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 14/0 : 8[0] -> 0[0] [send] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 01/0 : 8[0] -> 0[0] [receive] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 02/0 : 8[0] -> 0[0] [receive] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 03/0 : 8[0] -> 0[0] [receive] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 04/0 : 8[0] -> 0[0] [receive] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 15/0 : 8[0] -> 0[0] [send] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 16/0 : 8[0] -> 0[0] [send] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 17/0 : 8[0] -> 0[0] [send] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 18/0 : 8[0] -> 0[0] [send] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 05/0 : 8[0] -> 0[0] [receive] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 06/0 : 8[0] -> 0[0] [receive] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 07/0 : 8[0] -> 0[0] [receive] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 19/0 : 8[0] -> 0[0] [send] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 20/0 : 8[0] -> 0[0] [send] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 08/0 : 8[0] -> 0[0] [receive] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 09/0 : 8[0] -> 0[0] [receive] via NET/Socket/0 ubuntu:9621:9685 [0] NCCL INFO Channel 21/0 : 8[0] -> 0[0] [send] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 10/0 : 8[0] -> 0[0] [receive] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 11/0 : 8[0] -> 0[0] [receive] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 12/0 : 8[0] -> 0[0] [receive] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 13/0 : 8[0] -> 0[0] [receive] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 14/0 : 8[0] -> 0[0] [receive] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 15/0 : 8[0] -> 0[0] [receive] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 16/0 : 8[0] -> 0[0] [receive] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 17/0 : 8[0] -> 0[0] [receive] via NET/Socket/0 ubuntu:39964:40051 [7] NCCL INFO Channel 06 : 7[7] -> 6[6] via SHM/direct/direct ubuntu:39964:40051 [7] NCCL INFO Channel 07 : 7[7] -> 6[6] via SHM/direct/direct ubuntu:9621:9692 [7] NCCL INFO Channel 01 : 15[7] -> 14[6] via SHM/direct/direct ubuntu:39964:40044 [0] NCCL INFO Channel 18/0 : 8[0] -> 0[0] [receive] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 19/0 : 8[0] -> 0[0] [receive] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 20/0 : 8[0] -> 0[0] [receive] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 21/0 : 8[0] -> 0[0] [receive] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 00/0 : 0[0] -> 8[0] [send] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 01/0 : 0[0] -> 8[0] [send] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 02/0 : 0[0] -> 8[0] [send] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 03/0 : 0[0] -> 8[0] [send] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 04/0 : 0[0] -> 8[0] [send] via NET/Socket/0 ubuntu:9621:9692 [7] NCCL INFO Channel 02 : 15[7] -> 14[6] via SHM/direct/direct ubuntu:39964:40044 [0] NCCL INFO Channel 05/0 : 0[0] -> 8[0] [send] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 06/0 : 0[0] -> 8[0] [send] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 07/0 : 0[0] -> 8[0] [send] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 08/0 : 0[0] -> 8[0] [send] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 09/0 : 0[0] -> 8[0] [send] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 10/0 : 0[0] -> 8[0] [send] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 11/0 : 0[0] -> 8[0] [send] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 12/0 : 0[0] -> 8[0] [send] via NET/Socket/0 ubuntu:9621:9692 [7] NCCL INFO Channel 03 : 15[7] -> 14[6] via SHM/direct/direct ubuntu:39964:40044 [0] NCCL INFO Channel 13/0 : 0[0] -> 8[0] [send] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 14/0 : 0[0] -> 8[0] [send] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 15/0 : 0[0] -> 8[0] [send] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 16/0 : 0[0] -> 8[0] [send] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 17/0 : 0[0] -> 8[0] [send] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 18/0 : 0[0] -> 8[0] [send] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 19/0 : 0[0] -> 8[0] [send] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 20/0 : 0[0] -> 8[0] [send] via NET/Socket/0 ubuntu:39964:40044 [0] NCCL INFO Channel 21/0 : 0[0] -> 8[0] [send] via NET/Socket/0 ubuntu:9621:9692 [7] NCCL INFO Channel 04 : 15[7] -> 14[6] via SHM/direct/direct ubuntu:39964:40051 [7] NCCL INFO Channel 08 : 7[7] -> 6[6] via SHM/direct/direct ubuntu:39964:40051 [7] NCCL INFO Channel 09 : 7[7] -> 6[6] via SHM/direct/direct ubuntu:9621:9692 [7] NCCL INFO Channel 05 : 15[7] -> 14[6] via SHM/direct/direct ubuntu:39964:40051 [7] NCCL INFO Channel 10 : 7[7] -> 6[6] via SHM/direct/direct ubuntu:39964:40051 [7] NCCL INFO Channel 11 : 7[7] -> 6[6] via SHM/direct/direct ubuntu:9621:9692 [7] NCCL INFO Channel 06 : 15[7] -> 14[6] via SHM/direct/direct ubuntu:9621:9692 [7] NCCL INFO Channel 07 : 15[7] -> 14[6] via SHM/direct/direct ubuntu:9621:9686 [1] NCCL INFO Connected all rings ubuntu:39964:40050 [6] NCCL INFO Connected all rings ubuntu:9621:9688 [3] NCCL INFO Connected all rings ubuntu:39964:40051 [7] NCCL INFO Channel 12 : 7[7] -> 6[6] via SHM/direct/direct ubuntu:9621:9691 [6] NCCL INFO Connected all rings ubuntu:9621:9689 [4] NCCL INFO Connected all rings ubuntu:9621:9692 [7] NCCL INFO Channel 08 : 15[7] -> 14[6] via SHM/direct/direct ubuntu:9621:9692 [7] NCCL INFO Channel 09 : 15[7] -> 14[6] via SHM/direct/direct ubuntu:9621:9692 [7] NCCL INFO Channel 10 : 15[7] -> 14[6] via SHM/direct/direct ubuntu:39964:40051 [7] NCCL INFO Channel 13 : 7[7] -> 6[6] via SHM/direct/direct ubuntu:39964:40045 [1] NCCL INFO Connected all rings ubuntu:9621:9687 [2] NCCL INFO Connected all rings ubuntu:39964:40046 [2] NCCL INFO Connected all rings ubuntu:39964:40048 [4] NCCL INFO Connected all rings ubuntu:9621:9690 [5] NCCL INFO Connected all rings ubuntu:39964:40051 [7] NCCL INFO Channel 14 : 7[7] -> 6[6] via SHM/direct/direct ubuntu:39964:40051 [7] NCCL INFO Channel 15 : 7[7] -> 6[6] via SHM/direct/direct ubuntu:39964:40047 [3] NCCL INFO Connected all rings ubuntu:39964:40051 [7] NCCL INFO Channel 16 : 7[7] -> 6[6] via SHM/direct/direct ubuntu:39964:40049 [5] NCCL INFO Connected all rings ubuntu:9621:9692 [7] NCCL INFO Channel 11 : 15[7] -> 14[6] via SHM/direct/direct ubuntu:9621:9692 [7] NCCL INFO Channel 12 : 15[7] -> 14[6] via SHM/direct/direct ubuntu:39964:40051 [7] NCCL INFO Channel 17 : 7[7] -> 6[6] via SHM/direct/direct ubuntu:9621:9687 [2] misc/shmutils.cc:71 NCCL WARN Error: failed to extend /dev/shm/nccl-mEeIEW to 5967876 bytes ubuntu:9621:9687 [2] misc/shmutils.cc:112 NCCL WARN Error while creating shared memory segment /dev/shm/nccl-mEeIEW (size 5967872) ubuntu:9621:9687 [2] NCCL INFO transport/shm.cc:114 -> 2 ubuntu:9621:9687 [2] NCCL INFO transport.cc:33 -> 2 ubuntu:9621:9687 [2] NCCL INFO transport.cc:97 -> 2 ubuntu:9621:9687 [2] NCCL INFO init.cc:1089 -> 2 ubuntu:9621:9687 [2] NCCL INFO init.cc:1358 -> 2 ubuntu:9621:9687 [2] NCCL INFO group.cc:65 -> 2 [Async thread] ubuntu:9621:9690 [5] misc/shmutils.cc:71 NCCL WARN Error: failed to extend /dev/shm/nccl-xhD4R4 to 5967876 bytes ubuntu:9621:9690 [5] misc/shmutils.cc:112 NCCL WARN Error while creating shared memory segment /dev/shm/nccl-xhD4R4 (size 5967872) ubuntu:9621:9690 [5] NCCL INFO transport/shm.cc:114 -> 2 ubuntu:9621:9690 [5] NCCL INFO transport.cc:33 -> 2 ubuntu:9621:9690 [5] NCCL INFO transport.cc:97 -> 2 ubuntu:9621:9690 [5] NCCL INFO init.cc:1089 -> 2 ubuntu:9621:9690 [5] NCCL INFO init.cc:1358 -> 2 ubuntu:9621:9690 [5] NCCL INFO group.cc:65 -> 2 [Async thread] ubuntu:9621:9692 [7] NCCL INFO Channel 13 : 15[7] -> 14[6] via SHM/direct/direct ubuntu:9621:9692 [7] misc/shmutils.cc:71 NCCL WARN Error: failed to extend /dev/shm/nccl-KDzU5c to 4100 bytes ubuntu:9621:9692 [7] misc/shmutils.cc:112 NCCL WARN Error while creating shared memory segment /dev/shm/nccl-KDzU5c (size 4096) ubuntu:9621:9692 [7] NCCL INFO transport/shm.cc:91 -> 2 ubuntu:9621:9692 [7] NCCL INFO transport.cc:33 -> 2 ubuntu:9621:9692 [7] NCCL INFO transport.cc:106 -> 2 ubuntu:9621:9692 [7] NCCL INFO init.cc:1089 -> 2 ubuntu:9621:9692 [7] NCCL INFO init.cc:1358 -> 2 ubuntu:9621:9692 [7] NCCL INFO group.cc:65 -> 2 [Async thread] ubuntu:9621:9686 [1] misc/shmutils.cc:71 NCCL WARN Error: failed to extend /dev/shm/nccl-TqOjkl to 5967876 bytes ubuntu:9621:9686 [1] misc/shmutils.cc:112 NCCL WARN Error while creating shared memory segment /dev/shm/nccl-TqOjkl (size 5967872) ubuntu:9621:9686 [1] NCCL INFO transport/shm.cc:114 -> 2 ubuntu:9621:9686 [1] NCCL INFO transport.cc:33 -> 2 ubuntu:9621:9686 [1] NCCL INFO transport.cc:97 -> 2 ubuntu:9621:9686 [1] NCCL INFO init.cc:1089 -> 2 ubuntu:9621:9686 [1] NCCL INFO init.cc:1358 -> 2 ubuntu:9621:9686 [1] NCCL INFO group.cc:65 -> 2 [Async thread] ubuntu:9621:9688 [3] misc/shmutils.cc:71 NCCL WARN Error: failed to extend /dev/shm/nccl-KWfVyt to 5967876 bytes ubuntu:9621:9688 [3] misc/shmutils.cc:112 NCCL WARN Error while creating shared memory segment /dev/shm/nccl-KWfVyt (size 5967872) ubuntu:9621:9688 [3] NCCL INFO transport/shm.cc:114 -> 2 ubuntu:9621:9688 [3] NCCL INFO transport.cc:33 -> 2 ubuntu:9621:9688 [3] NCCL INFO transport.cc:97 -> 2 ubuntu:9621:9688 [3] NCCL INFO init.cc:1089 -> 2 ubuntu:9621:9688 [3] NCCL INFO init.cc:1358 -> 2 ubuntu:9621:9688 [3] NCCL INFO group.cc:65 -> 2 [Async thread] ubuntu:9621:9691 [6] misc/shmutils.cc:71 NCCL WARN Error: failed to extend /dev/shm/nccl-fjwOOB to 5967876 bytes ubuntu:9621:9691 [6] misc/shmutils.cc:112 NCCL WARN Error while creating shared memory segment /dev/shm/nccl-fjwOOB (size 5967872) ubuntu:9621:9691 [6] NCCL INFO transport/shm.cc:114 -> 2 ubuntu:9621:9691 [6] NCCL INFO transport.cc:33 -> 2 ubuntu:9621:9691 [6] NCCL INFO transport.cc:97 -> 2 ubuntu:9621:9691 [6] NCCL INFO init.cc:1089 -> 2 ubuntu:9621:9691 [6] NCCL INFO init.cc:1358 -> 2 ubuntu:9621:9691 [6] NCCL INFO group.cc:65 -> 2 [Async thread] ubuntu:9621:9689 [4] misc/shmutils.cc:71 NCCL WARN Error: failed to extend /dev/shm/nccl-oA7T4J to 5967876 bytes ubuntu:9621:9689 [4] misc/shmutils.cc:112 NCCL WARN Error while creating shared memory segment /dev/shm/nccl-oA7T4J (size 5967872) ubuntu:9621:9689 [4] NCCL INFO transport/shm.cc:114 -> 2 ubuntu:9621:9689 [4] NCCL INFO transport.cc:33 -> 2 ubuntu:9621:9689 [4] NCCL INFO transport.cc:97 -> 2 ubuntu:9621:9689 [4] NCCL INFO init.cc:1089 -> 2 ubuntu:9621:9689 [4] NCCL INFO init.cc:1358 -> 2 ubuntu:9621:9689 [4] NCCL INFO group.cc:65 -> 2 [Async thread] ubuntu:9621:9685 [0] misc/shmutils.cc:71 NCCL WARN Error: failed to extend /dev/shm/nccl-zTYolS to 5967876 bytes ubuntu:9621:9685 [0] misc/shmutils.cc:112 NCCL WARN Error while creating shared memory segment /dev/shm/nccl-zTYolS (size 5967872) ubuntu:9621:9685 [0] NCCL INFO transport/shm.cc:114 -> 2 ubuntu:9621:9685 [0] NCCL INFO transport.cc:33 -> 2 ubuntu:9621:9685 [0] NCCL INFO transport.cc:97 -> 2 ubuntu:9621:9685 [0] NCCL INFO init.cc:1089 -> 2 ubuntu:9621:9685 [0] NCCL INFO init.cc:1358 -> 2 ubuntu:9621:9685 [0] NCCL INFO group.cc:65 -> 2 [Async thread] ubuntu:9621:9621 [7] NCCL INFO group.cc:406 -> 2 ubuntu:9621:9621 [7] NCCL INFO group.cc:96 -> 2 ubuntu: Test NCCL failure common.cu:958 'unhandled system error (run with NCCL_DEBUG=INFO for details) / ' .. ubuntu pid 9621: Test failure common.cu:842 ubuntu:39964:40051 [7] NCCL INFO Channel 18 : 7[7] -> 6[6] via SHM/direct/direct ubuntu:39964:40051 [7] NCCL INFO Channel 19 : 7[7] -> 6[6] via SHM/direct/direct ubuntu:39964:40051 [7] NCCL INFO Channel 20 : 7[7] -> 6[6] via SHM/direct/direct ubuntu:39964:40051 [7] NCCL INFO Channel 21 : 7[7] -> 6[6] via SHM/direct/direct ubuntu:39964:40050 [6] NCCL INFO Channel 00 : 6[6] -> 5[5] via SHM/direct/direct ubuntu:39964:40050 [6] NCCL INFO Channel 01 : 6[6] -> 5[5] via SHM/direct/direct ubuntu:39964:40050 [6] NCCL INFO Channel 02 : 6[6] -> 5[5] via SHM/direct/direct ubuntu:39964:40050 [6] NCCL INFO Channel 03 : 6[6] -> 5[5] via SHM/direct/direct ubuntu:39964:40046 [2] NCCL INFO Channel 00 : 2[2] -> 1[1] via SHM/direct/direct ubuntu:39964:40045 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct ubuntu:39964:40046 [2] NCCL INFO Channel 01 : 2[2] -> 1[1] via SHM/direct/direct ubuntu:39964:40048 [4] NCCL INFO Channel 00 : 4[4] -> 3[3] via SHM/direct/direct ubuntu:39964:40050 [6] NCCL INFO Channel 04 : 6[6] -> 5[5] via SHM/direct/direct ubuntu:39964:40047 [3] NCCL INFO Channel 00 : 3[3] -> 2[2] via SHM/direct/direct ubuntu:39964:40045 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct ubuntu:39964:40049 [5] NCCL INFO Channel 00 : 5[5] -> 4[4] via SHM/direct/direct ubuntu:39964:40048 [4] NCCL INFO Channel 01 : 4[4] -> 3[3] via SHM/direct/direct ubuntu:39964:40050 [6] NCCL INFO Channel 05 : 6[6] -> 5[5] via SHM/direct/direct ubuntu:39964:40047 [3] NCCL INFO Channel 01 : 3[3] -> 2[2] via SHM/direct/direct ubuntu:39964:40045 [1] NCCL INFO Channel 02 : 1[1] -> 0[0] via SHM/direct/direct ubuntu:39964:40046 [2] NCCL INFO Channel 02 : 2[2] -> 1[1] via SHM/direct/direct ubuntu:39964:40048 [4] NCCL INFO Channel 02 : 4[4] -> 3[3] via SHM/direct/direct ubuntu:39964:40050 [6] NCCL INFO Channel 06 : 6[6] -> 5[5] via SHM/direct/direct ubuntu:39964:40049 [5] NCCL INFO Channel 01 : 5[5] -> 4[4] via SHM/direct/direct ubuntu:39964:40047 [3] NCCL INFO Channel 02 : 3[3] -> 2[2] via SHM/direct/direct ubuntu:39964:40045 [1] NCCL INFO Channel 03 : 1[1] -> 0[0] via SHM/direct/direct ubuntu:39964:40046 [2] NCCL INFO Channel 03 : 2[2] -> 1[1] via SHM/direct/direct ubuntu:39964:40048 [4] NCCL INFO Channel 03 : 4[4] -> 3[3] via SHM/direct/direct ubuntu:39964:40049 [5] NCCL INFO Channel 02 : 5[5] -> 4[4] via SHM/direct/direct ubuntu:39964:40050 [6] NCCL INFO Channel 07 : 6[6] -> 5[5] via SHM/direct/direct ubuntu:39964:40047 [3] NCCL INFO Channel 03 : 3[3] -> 2[2] via SHM/direct/direct ubuntu:39964:40045 [1] NCCL INFO Channel 04 : 1[1] -> 0[0] via SHM/direct/direct ubuntu:39964:40046 [2] NCCL INFO Channel 04 : 2[2] -> 1[1] via SHM/direct/direct ubuntu:39964:40048 [4] NCCL INFO Channel 04 : 4[4] -> 3[3] via SHM/direct/direct ubuntu:39964:40050 [6] NCCL INFO Channel 08 : 6[6] -> 5[5] via SHM/direct/direct ubuntu:39964:40049 [5] NCCL INFO Channel 03 : 5[5] -> 4[4] via SHM/direct/direct ubuntu:39964:40047 [3] NCCL INFO Channel 04 : 3[3] -> 2[2] via SHM/direct/direct ubuntu:39964:40045 [1] NCCL INFO Channel 05 : 1[1] -> 0[0] via SHM/direct/direct ubuntu:39964:40046 [2] NCCL INFO Channel 05 : 2[2] -> 1[1] via SHM/direct/direct ubuntu:39964:40048 [4] NCCL INFO Channel 05 : 4[4] -> 3[3] via SHM/direct/direct ubuntu:39964:40050 [6] NCCL INFO Channel 09 : 6[6] -> 5[5] via SHM/direct/direct ubuntu:39964:40049 [5] NCCL INFO Channel 04 : 5[5] -> 4[4] via SHM/direct/direct ubuntu:39964:40047 [3] NCCL INFO Channel 05 : 3[3] -> 2[2] via SHM/direct/direct ubuntu:39964:40045 [1] NCCL INFO Channel 06 : 1[1] -> 0[0] via SHM/direct/direct ubuntu:39964:40046 [2] NCCL INFO Channel 06 : 2[2] -> 1[1] via SHM/direct/direct ubuntu:39964:40048 [4] NCCL INFO Channel 06 : 4[4] -> 3[3] via SHM/direct/direct ubuntu:39964:40050 [6] NCCL INFO Channel 10 : 6[6] -> 5[5] via SHM/direct/direct ubuntu:39964:40049 [5] NCCL INFO Channel 05 : 5[5] -> 4[4] via SHM/direct/direct ubuntu:39964:40047 [3] NCCL INFO Channel 06 : 3[3] -> 2[2] via SHM/direct/direct ubuntu:39964:40045 [1] NCCL INFO Channel 07 : 1[1] -> 0[0] via SHM/direct/direct ubuntu:39964:40046 [2] NCCL INFO Channel 07 : 2[2] -> 1[1] via SHM/direct/direct ubuntu:39964:40048 [4] NCCL INFO Channel 07 : 4[4] -> 3[3] via SHM/direct/direct ubuntu:39964:40049 [5] NCCL INFO Channel 06 : 5[5] -> 4[4] via SHM/direct/direct ubuntu:39964:40050 [6] NCCL INFO Channel 11 : 6[6] -> 5[5] via SHM/direct/direct ubuntu:39964:40047 [3] NCCL INFO Channel 07 : 3[3] -> 2[2] via SHM/direct/direct ubuntu:39964:40045 [1] NCCL INFO Channel 08 : 1[1] -> 0[0] via SHM/direct/direct ubuntu:39964:40046 [2] NCCL INFO Channel 08 : 2[2] -> 1[1] via SHM/direct/direct ubuntu:39964:40048 [4] NCCL INFO Channel 08 : 4[4] -> 3[3] via SHM/direct/direct ubuntu:39964:40049 [5] NCCL INFO Channel 07 : 5[5] -> 4[4] via SHM/direct/direct ubuntu:39964:40050 [6] NCCL INFO Channel 12 : 6[6] -> 5[5] via SHM/direct/direct ubuntu:39964:40047 [3] NCCL INFO Channel 08 : 3[3] -> 2[2] via SHM/direct/direct ubuntu:39964:40045 [1] NCCL INFO Channel 09 : 1[1] -> 0[0] via SHM/direct/direct ubuntu:39964:40046 [2] NCCL INFO Channel 09 : 2[2] -> 1[1] via SHM/direct/direct ubuntu:39964:40048 [4] NCCL INFO Channel 09 : 4[4] -> 3[3] via SHM/direct/direct ubuntu:39964:40049 [5] NCCL INFO Channel 08 : 5[5] -> 4[4] via SHM/direct/direct ubuntu:39964:40050 [6] NCCL INFO Channel 13 : 6[6] -> 5[5] via SHM/direct/direct ubuntu:39964:40047 [3] NCCL INFO Channel 09 : 3[3] -> 2[2] via SHM/direct/direct ubuntu:39964:40045 [1] NCCL INFO Channel 10 : 1[1] -> 0[0] via SHM/direct/direct ubuntu:39964:40046 [2] NCCL INFO Channel 10 : 2[2] -> 1[1] via SHM/direct/direct ubuntu:39964:40048 [4] NCCL INFO Channel 10 : 4[4] -> 3[3] via SHM/direct/direct ubuntu:39964:40049 [5] NCCL INFO Channel 09 : 5[5] -> 4[4] via SHM/direct/direct ubuntu:39964:40047 [3] NCCL INFO Channel 10 : 3[3] -> 2[2] via SHM/direct/direct ubuntu:39964:40050 [6] NCCL INFO Channel 14 : 6[6] -> 5[5] via SHM/direct/direct ubuntu:39964:40046 [2] NCCL INFO Channel 11 : 2[2] -> 1[1] via SHM/direct/direct ubuntu:39964:40045 [1] NCCL INFO Channel 11 : 1[1] -> 0[0] via SHM/direct/direct ubuntu:39964:40048 [4] NCCL INFO Channel 11 : 4[4] -> 3[3] via SHM/direct/direct ubuntu:39964:40049 [5] NCCL INFO Channel 10 : 5[5] -> 4[4] via SHM/direct/direct ubuntu:39964:40050 [6] NCCL INFO Channel 15 : 6[6] -> 5[5] via SHM/direct/direct ubuntu:39964:40047 [3] NCCL INFO Channel 11 : 3[3] -> 2[2] via SHM/direct/direct ubuntu:39964:40046 [2] NCCL INFO Channel 12 : 2[2] -> 1[1] via SHM/direct/direct ubuntu:39964:40045 [1] NCCL INFO Channel 12 : 1[1] -> 0[0] via SHM/direct/direct ubuntu:39964:40048 [4] NCCL INFO Channel 12 : 4[4] -> 3[3] via SHM/direct/direct ubuntu:39964:40049 [5] NCCL INFO Channel 11 : 5[5] -> 4[4] via SHM/direct/direct ubuntu:39964:40050 [6] NCCL INFO Channel 16 : 6[6] -> 5[5] via SHM/direct/direct ubuntu:39964:40047 [3] NCCL INFO Channel 12 : 3[3] -> 2[2] via SHM/direct/direct ubuntu:39964:40046 [2] NCCL INFO Channel 13 : 2[2] -> 1[1] via SHM/direct/direct ubuntu:39964:40045 [1] NCCL INFO Channel 13 : 1[1] -> 0[0] via SHM/direct/direct ubuntu:39964:40048 [4] NCCL INFO Channel 13 : 4[4] -> 3[3] via SHM/direct/direct ubuntu:39964:40049 [5] NCCL INFO Channel 12 : 5[5] -> 4[4] via SHM/direct/direct ubuntu:39964:40050 [6] NCCL INFO Channel 17 : 6[6] -> 5[5] via SHM/direct/direct ubuntu:39964:40047 [3] NCCL INFO Channel 13 : 3[3] -> 2[2] via SHM/direct/direct ubuntu:39964:40046 [2] NCCL INFO Channel 14 : 2[2] -> 1[1] via SHM/direct/direct ubuntu:39964:40045 [1] NCCL INFO Channel 14 : 1[1] -> 0[0] via SHM/direct/direct ubuntu:39964:40048 [4] NCCL INFO Channel 14 : 4[4] -> 3[3] via SHM/direct/direct ubuntu:39964:40050 [6] NCCL INFO Channel 18 : 6[6] -> 5[5] via SHM/direct/direct ubuntu:39964:40047 [3] NCCL INFO Channel 14 : 3[3] -> 2[2] via SHM/direct/direct ubuntu:39964:40049 [5] NCCL INFO Channel 13 : 5[5] -> 4[4] via SHM/direct/direct ubuntu:39964:40046 [2] NCCL INFO Channel 15 : 2[2] -> 1[1] via SHM/direct/direct ubuntu:39964:40045 [1] NCCL INFO Channel 15 : 1[1] -> 0[0] via SHM/direct/direct ubuntu:39964:40050 [6] NCCL INFO Channel 19 : 6[6] -> 5[5] via SHM/direct/direct ubuntu:39964:40047 [3] NCCL INFO Channel 15 : 3[3] -> 2[2] via SHM/direct/direct ubuntu:39964:40048 [4] NCCL INFO Channel 15 : 4[4] -> 3[3] via SHM/direct/direct ubuntu:39964:40049 [5] NCCL INFO Channel 14 : 5[5] -> 4[4] via SHM/direct/direct ubuntu:39964:40045 [1] NCCL INFO Channel 16 : 1[1] -> 0[0] via SHM/direct/direct ubuntu:39964:40046 [2] NCCL INFO Channel 16 : 2[2] -> 1[1] via SHM/direct/direct ubuntu:39964:40050 [6] NCCL INFO Channel 20 : 6[6] -> 5[5] via SHM/direct/direct ubuntu:39964:40047 [3] NCCL INFO Channel 16 : 3[3] -> 2[2] via SHM/direct/direct ubuntu:39964:40049 [5] NCCL INFO Channel 15 : 5[5] -> 4[4] via SHM/direct/direct ubuntu:39964:40048 [4] NCCL INFO Channel 16 : 4[4] -> 3[3] via SHM/direct/direct ubuntu:39964:40046 [2] NCCL INFO Channel 17 : 2[2] -> 1[1] via SHM/direct/direct ubuntu:39964:40045 [1] NCCL INFO Channel 17 : 1[1] -> 0[0] via SHM/direct/direct ubuntu:39964:40050 [6] NCCL INFO Channel 21 : 6[6] -> 5[5] via SHM/direct/direct ubuntu:39964:40047 [3] NCCL INFO Channel 17 : 3[3] -> 2[2] via SHM/direct/direct ubuntu:39964:40048 [4] NCCL INFO Channel 17 : 4[4] -> 3[3] via SHM/direct/direct ubuntu:39964:40049 [5] NCCL INFO Channel 16 : 5[5] -> 4[4] via SHM/direct/direct ubuntu:39964:40046 [2] NCCL INFO Channel 18 : 2[2] -> 1[1] via SHM/direct/direct ubuntu:39964:40045 [1] NCCL INFO Channel 18 : 1[1] -> 0[0] via SHM/direct/direct ubuntu:39964:40047 [3] NCCL INFO Channel 18 : 3[3] -> 2[2] via SHM/direct/direct ubuntu:39964:40048 [4] NCCL INFO Channel 18 : 4[4] -> 3[3] via SHM/direct/direct ubuntu:39964:40049 [5] NCCL INFO Channel 17 : 5[5] -> 4[4] via SHM/direct/direct ubuntu:39964:40046 [2] NCCL INFO Channel 19 : 2[2] -> 1[1] via SHM/direct/direct ubuntu:39964:40045 [1] NCCL INFO Channel 19 : 1[1] -> 0[0] via SHM/direct/direct ubuntu:39964:40049 [5] NCCL INFO Channel 18 : 5[5] -> 4[4] via SHM/direct/direct ubuntu:39964:40047 [3] NCCL INFO Channel 19 : 3[3] -> 2[2] via SHM/direct/direct ubuntu:39964:40048 [4] NCCL INFO Channel 19 : 4[4] -> 3[3] via SHM/direct/direct ubuntu:39964:40046 [2] NCCL INFO Channel 20 : 2[2] -> 1[1] via SHM/direct/direct ubuntu:39964:40045 [1] NCCL INFO Channel 20 : 1[1] -> 0[0] via SHM/direct/direct ubuntu:39964:40047 [3] NCCL INFO Channel 20 : 3[3] -> 2[2] via SHM/direct/direct ubuntu:39964:40048 [4] NCCL INFO Channel 20 : 4[4] -> 3[3] via SHM/direct/direct ubuntu:39964:40049 [5] NCCL INFO Channel 19 : 5[5] -> 4[4] via SHM/direct/direct ubuntu:39964:40046 [2] NCCL INFO Channel 21 : 2[2] -> 1[1] via SHM/direct/direct ubuntu:39964:40045 [1] NCCL INFO Channel 21 : 1[1] -> 0[0] via SHM/direct/direct ubuntu:39964:40047 [3] NCCL INFO Channel 21 : 3[3] -> 2[2] via SHM/direct/direct ubuntu:39964:40048 [4] NCCL INFO Channel 21 : 4[4] -> 3[3] via SHM/direct/direct ubuntu:39964:40075 [0] misc/socket.cc:483 NCCL WARN socketStartConnect: Connect to 10.0.0.2<45763> failed : Software caused connection abort ubuntu:39964:40075 [0] NCCL INFO misc/socket.cc:564 -> 2 ubuntu:39964:40075 [0] NCCL INFO misc/socket.cc:586 -> 2 ubuntu:39964:40075 [0] NCCL INFO transport/net_socket.cc:336 -> 2 ubuntu:39964:40075 [0] NCCL INFO transport/net.cc:592 -> 2 ubuntu:39964:40075 [0] NCCL INFO proxy.cc:1306 -> 2 ubuntu:39964:40075 [0] NCCL INFO proxy.cc:1377 -> 2 ubuntu:39964:40075 [0] proxy.cc:1519 NCCL WARN [Proxy Service 0] Failed to execute operation Connect from rank 0, retcode 2 ubuntu:39964:40049 [5] NCCL INFO Channel 20 : 5[5] -> 4[4] via SHM/direct/direct ubuntu:39964:40049 [5] NCCL INFO Channel 21 : 5[5] -> 4[4] via SHM/direct/direct ubuntu:39964:40044 [0] misc/socket.cc:49 NCCL WARN socketProgress: Connection closed by remote peer node1<57557> ubuntu:39964:40044 [0] NCCL INFO misc/socket.cc:749 -> 6 ubuntu:39964:40044 [0] proxy.cc:1143 NCCL WARN Socket recv failed while polling for opId=0x7f6f18cb7f40 ubuntu:39964:40044 [0] NCCL INFO transport/net.cc:288 -> 3 ubuntu:39964:40044 [0] NCCL INFO transport.cc:148 -> 3 ubuntu:39964:40044 [0] NCCL INFO init.cc:1089 -> 3 ubuntu:39964:40044 [0] NCCL INFO init.cc:1358 -> 3 ubuntu:39964:40044 [0] NCCL INFO group.cc:65 -> 3 [Async thread] ubuntu:39964:40051 [7] NCCL INFO Connected all trees ubuntu:39964:40051 [7] NCCL INFO NCCL_NTHREADS set by environment to 128. ubuntu:39964:40051 [7] NCCL INFO NCCL_ALGO set by environment to Tree ubuntu:39964:40051 [7] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 ubuntu:39964:40051 [7] NCCL INFO 22 coll channels, 0 nvls channels, 32 p2p channels, 2 p2p channels per peer ubuntu:39964:40071 [7] NCCL INFO misc/socket.cc:805 -> 3 ubuntu:39964:40071 [7] proxy.cc:1495 NCCL WARN [Service thread] Could not receive type from localRank 7, res=3, closed=0 ubuntu:39964:40071 [7] proxy.cc:1519 NCCL WARN [Proxy Service 7] Failed to execute operation Init from rank 7, retcode 3 ubuntu:39964:40051 [7] NCCL INFO misc/socket.cc:46 -> 3 ubuntu:39964:40051 [7] NCCL INFO misc/socket.cc:57 -> 3 ubuntu:39964:40051 [7] NCCL INFO misc/socket.cc:772 -> 3 ubuntu:39964:40051 [7] NCCL INFO proxy.cc:1107 -> 3 ubuntu:39964:40051 [7] NCCL INFO proxy.cc:1193 -> 3 ubuntu:39964:40051 [7] NCCL INFO proxy.cc:1047 -> 3 ubuntu:39964:40051 [7] NCCL INFO init.cc:1185 -> 3 ubuntu:39964:40051 [7] NCCL INFO init.cc:1358 -> 3 ubuntu:39964:40051 [7] NCCL INFO group.cc:65 -> 3 [Async thread] ubuntu:39964:40047 [3] NCCL INFO Connected all trees ubuntu:39964:40047 [3] NCCL INFO NCCL_ALGO set by environment to Tree ubuntu:39964:40047 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 ubuntu:39964:40047 [3] NCCL INFO 22 coll channels, 0 nvls channels, 32 p2p channels, 2 p2p channels per peer ubuntu:39964:40072 [3] NCCL INFO misc/socket.cc:46 -> 3 ubuntu:39964:40072 [3] NCCL INFO misc/socket.cc:749 -> 3 ubuntu:39964:40072 [3] NCCL INFO misc/socket.cc:427 -> 3 ubuntu:39964:40072 [3] NCCL INFO misc/socket.cc:561 -> 3 ubuntu:39964:40072 [3] NCCL INFO misc/socket.cc:665 -> 3 ubuntu:39964:40072 [3] proxy.cc:1456 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable ubuntu:39964:40047 [3] NCCL INFO misc/socket.cc:46 -> 3 ubuntu:39964:40047 [3] NCCL INFO misc/socket.cc:547 -> 3 ubuntu:39964:40047 [3] NCCL INFO misc/socket.cc:570 -> 3 ubuntu:39964:40047 [3] NCCL INFO misc/socket.cc:618 -> 3 ubuntu:39964:40047 [3] NCCL INFO proxy.cc:1034 -> 3 ubuntu:39964:40047 [3] NCCL INFO init.cc:1185 -> 3 ubuntu:39964:40047 [3] NCCL INFO init.cc:1358 -> 3 ubuntu:39964:40047 [3] NCCL INFO group.cc:65 -> 3 [Async thread] ubuntu:39964:40049 [5] NCCL INFO Connected all trees ubuntu:39964:40049 [5] NCCL INFO NCCL_ALGO set by environment to Tree ubuntu:39964:40049 [5] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 ubuntu:39964:40049 [5] NCCL INFO 22 coll channels, 0 nvls channels, 32 p2p channels, 2 p2p channels per peer ubuntu:39964:40069 [5] NCCL INFO misc/socket.cc:46 -> 3 ubuntu:39964:40069 [5] NCCL INFO misc/socket.cc:749 -> 3 ubuntu:39964:40069 [5] NCCL INFO misc/socket.cc:427 -> 3 ubuntu:39964:40069 [5] NCCL INFO misc/socket.cc:561 -> 3 ubuntu:39964:40069 [5] NCCL INFO misc/socket.cc:665 -> 3 ubuntu:39964:40069 [5] proxy.cc:1456 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable ubuntu:39964:40049 [5] NCCL INFO misc/socket.cc:46 -> 3 ubuntu:39964:40049 [5] NCCL INFO misc/socket.cc:547 -> 3 ubuntu:39964:40049 [5] NCCL INFO misc/socket.cc:570 -> 3 ubuntu:39964:40049 [5] NCCL INFO misc/socket.cc:618 -> 3 ubuntu:39964:40049 [5] NCCL INFO proxy.cc:1034 -> 3 ubuntu:39964:40049 [5] NCCL INFO init.cc:1185 -> 3 ubuntu:39964:40049 [5] NCCL INFO init.cc:1358 -> 3 ubuntu:39964:40049 [5] NCCL INFO group.cc:65 -> 3 [Async thread] ubuntu:39964:40045 [1] NCCL INFO Connected all trees ubuntu:39964:40045 [1] NCCL INFO NCCL_ALGO set by environment to Tree ubuntu:39964:40045 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 ubuntu:39964:40045 [1] NCCL INFO 22 coll channels, 0 nvls channels, 32 p2p channels, 2 p2p channels per peer ubuntu:39964:40073 [1] NCCL INFO misc/socket.cc:46 -> 3 ubuntu:39964:40073 [1] NCCL INFO misc/socket.cc:749 -> 3 ubuntu:39964:40073 [1] NCCL INFO misc/socket.cc:427 -> 3 ubuntu:39964:40073 [1] NCCL INFO misc/socket.cc:561 -> 3 ubuntu:39964:40073 [1] NCCL INFO misc/socket.cc:665 -> 3 ubuntu:39964:40073 [1] proxy.cc:1456 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable ubuntu:39964:40045 [1] NCCL INFO misc/socket.cc:46 -> 3 ubuntu:39964:40045 [1] NCCL INFO misc/socket.cc:547 -> 3 ubuntu:39964:40045 [1] NCCL INFO misc/socket.cc:570 -> 3 ubuntu:39964:40045 [1] NCCL INFO misc/socket.cc:618 -> 3 ubuntu:39964:40045 [1] NCCL INFO proxy.cc:1034 -> 3 ubuntu:39964:40045 [1] NCCL INFO init.cc:1185 -> 3 ubuntu:39964:40045 [1] NCCL INFO init.cc:1358 -> 3 ubuntu:39964:40045 [1] NCCL INFO group.cc:65 -> 3 [Async thread] ubuntu:39964:40046 [2] NCCL INFO Connected all trees ubuntu:39964:40046 [2] NCCL INFO NCCL_ALGO set by environment to Tree ubuntu:39964:40046 [2] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 ubuntu:39964:40046 [2] NCCL INFO 22 coll channels, 0 nvls channels, 32 p2p channels, 2 p2p channels per peer ubuntu:39964:40070 [2] NCCL INFO misc/socket.cc:46 -> 3 ubuntu:39964:40070 [2] NCCL INFO misc/socket.cc:749 -> 3 ubuntu:39964:40070 [2] NCCL INFO misc/socket.cc:427 -> 3 ubuntu:39964:40070 [2] NCCL INFO misc/socket.cc:561 -> 3 ubuntu:39964:40070 [2] NCCL INFO misc/socket.cc:665 -> 3 ubuntu:39964:40070 [2] proxy.cc:1456 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable ubuntu:39964:40046 [2] NCCL INFO misc/socket.cc:46 -> 3 ubuntu:39964:40046 [2] NCCL INFO misc/socket.cc:547 -> 3 ubuntu:39964:40046 [2] NCCL INFO misc/socket.cc:570 -> 3 ubuntu:39964:40046 [2] NCCL INFO misc/socket.cc:618 -> 3 ubuntu:39964:40046 [2] NCCL INFO proxy.cc:1034 -> 3 ubuntu:39964:40046 [2] NCCL INFO init.cc:1185 -> 3 ubuntu:39964:40046 [2] NCCL INFO init.cc:1358 -> 3 ubuntu:39964:40046 [2] NCCL INFO group.cc:65 -> 3 [Async thread] ubuntu:39964:40050 [6] NCCL INFO Connected all trees ubuntu:39964:40050 [6] NCCL INFO NCCL_ALGO set by environment to Tree ubuntu:39964:40050 [6] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 ubuntu:39964:40050 [6] NCCL INFO 22 coll channels, 0 nvls channels, 32 p2p channels, 2 p2p channels per peer ubuntu:39964:40068 [6] NCCL INFO misc/socket.cc:46 -> 3 ubuntu:39964:40068 [6] NCCL INFO misc/socket.cc:749 -> 3 ubuntu:39964:40068 [6] NCCL INFO misc/socket.cc:427 -> 3 ubuntu:39964:40068 [6] NCCL INFO misc/socket.cc:561 -> 3 ubuntu:39964:40068 [6] NCCL INFO misc/socket.cc:665 -> 3 ubuntu:39964:40068 [6] proxy.cc:1456 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable ubuntu:39964:40050 [6] NCCL INFO misc/socket.cc:46 -> 3 ubuntu:39964:40050 [6] NCCL INFO misc/socket.cc:547 -> 3 ubuntu:39964:40050 [6] NCCL INFO misc/socket.cc:570 -> 3 ubuntu:39964:40050 [6] NCCL INFO misc/socket.cc:618 -> 3 ubuntu:39964:40050 [6] NCCL INFO proxy.cc:1034 -> 3 ubuntu:39964:40050 [6] NCCL INFO init.cc:1185 -> 3 ubuntu:39964:40050 [6] NCCL INFO init.cc:1358 -> 3 ubuntu:39964:40050 [6] NCCL INFO group.cc:65 -> 3 [Async thread] ubuntu:39964:40048 [4] NCCL INFO Connected all trees ubuntu:39964:40048 [4] NCCL INFO NCCL_ALGO set by environment to Tree ubuntu:39964:40048 [4] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 ubuntu:39964:40048 [4] NCCL INFO 22 coll channels, 0 nvls channels, 32 p2p channels, 2 p2p channels per peer ubuntu:39964:40074 [4] NCCL INFO misc/socket.cc:46 -> 3 ubuntu:39964:40074 [4] NCCL INFO misc/socket.cc:749 -> 3 ubuntu:39964:40074 [4] NCCL INFO misc/socket.cc:427 -> 3 ubuntu:39964:40074 [4] NCCL INFO misc/socket.cc:561 -> 3 ubuntu:39964:40074 [4] NCCL INFO misc/socket.cc:665 -> 3 ubuntu:39964:40074 [4] proxy.cc:1456 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable ubuntu:39964:40048 [4] NCCL INFO misc/socket.cc:46 -> 3 ubuntu:39964:40048 [4] NCCL INFO misc/socket.cc:547 -> 3 ubuntu:39964:40048 [4] NCCL INFO misc/socket.cc:570 -> 3 ubuntu:39964:40048 [4] NCCL INFO misc/socket.cc:618 -> 3 ubuntu:39964:40048 [4] NCCL INFO proxy.cc:1034 -> 3 ubuntu:39964:40048 [4] NCCL INFO init.cc:1185 -> 3 ubuntu:39964:40048 [4] NCCL INFO init.cc:1358 -> 3 ubuntu:39964:40048 [4] NCCL INFO group.cc:65 -> 3 [Async thread] ubuntu:39964:39964 [7] NCCL INFO group.cc:406 -> 3 ubuntu:39964:39964 [7] NCCL INFO group.cc:96 -> 3 ubuntu: Test NCCL failure common.cu:958 'internal error - please report this issue to the NCCL developers / ' .. ubuntu pid 39964: Test failure common.cu:842 -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[39336,1],1] Exit code: 3 -------------------------------------------------------------------------- ```

Details

``` mpirun --allow-run-as-root -np 2 --hostfile ./hostMPI -x SHELL -x LD_LIBRARY_PATH -x PATH -x LD_PRELOAD -x NCCL_ALGO -x NCCL_BUFFSIZE -x NCCL_CHECK_POINTERS -x NCCL_COMM_BLOCKING -x NCCL_CROSS_NIC -x NCCL_DEBUG -x NCCL_DMABUF_ENABLE -x NCCL_GDR_READ -x NCCL_GRAPH_MIXING_SUPPORT -x NCCL_GRAPH_REGISTER -x NCCL_IGNORE_CPU_AFFINITY -x NCCL_LAUNCH_MODE -x NCCL_MAX_NCHANNELS -x NCCL_MIN_NCHANNELS -x NCCL_NET_GDR_LEVEL -x NCCL_NET_SHARED_BUFFERS -x NCCL_NET_SHARED_COMMS -x NCCL_NSOCKS_PERTHREAD -x NCCL_NTHREADS -x NCCL_NVB_DISABLE -x NCCL_P2P_DIRECT_DISABLE -x NCCL_P2P_DISABLE -x NCCL_P2P_LEVEL -x NCCL_P2P_LL_THRESHOLD /root/cxwang/nccl-tests/build/all_reduce_perf -b 8 -e 128M -f 2 -g 8 # nThread 1 nGpus 8 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0 # # Using devices # Rank 0 Group 0 Pid 40273 on ubuntu device 0 [0x00] Tesla V100-SXM2-32GB # Rank 1 Group 0 Pid 40273 on ubuntu device 1 [0x00] Tesla V100-SXM2-32GB # Rank 2 Group 0 Pid 40273 on ubuntu device 2 [0x00] Tesla V100-SXM2-32GB # Rank 3 Group 0 Pid 40273 on ubuntu device 3 [0x00] Tesla V100-SXM2-32GB # Rank 4 Group 0 Pid 40273 on ubuntu device 4 [0x00] Tesla V100-SXM2-32GB # Rank 5 Group 0 Pid 40273 on ubuntu device 5 [0x00] Tesla V100-SXM2-32GB # Rank 6 Group 0 Pid 40273 on ubuntu device 6 [0x00] Tesla V100-SXM2-32GB # Rank 7 Group 0 Pid 40273 on ubuntu device 7 [0x00] Tesla V100-SXM2-32GB # Rank 8 Group 0 Pid 10196 on ubuntu device 0 [0x00] Tesla V100-SXM2-32GB # Rank 9 Group 0 Pid 10196 on ubuntu device 1 [0x00] Tesla V100-SXM2-32GB # Rank 10 Group 0 Pid 10196 on ubuntu device 2 [0x00] Tesla V100-SXM2-32GB # Rank 11 Group 0 Pid 10196 on ubuntu device 3 [0x00] Tesla V100-SXM2-32GB # Rank 12 Group 0 Pid 10196 on ubuntu device 4 [0x00] Tesla V100-SXM2-32GB # Rank 13 Group 0 Pid 10196 on ubuntu device 5 [0x00] Tesla V100-SXM2-32GB # Rank 14 Group 0 Pid 10196 on ubuntu device 6 [0x00] Tesla V100-SXM2-32GB # Rank 15 Group 0 Pid 10196 on ubuntu device 7 [0x00] Tesla V100-SXM2-32GB ubuntu:40273:40273 [0] NCCL INFO Bootstrap : Using ens4:10.0.0.1<0> ubuntu:40273:40273 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory ubuntu:40273:40273 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation ubuntu:40273:40273 [0] NCCL INFO cudaDriverVersion 12020 NCCL version 2.18.3+cuda12.2 ubuntu:40273:40273 [0] NCCL INFO NCCL_COMM_BLOCKING set by environment to 1. ubuntu:10196:10196 [0] misc/cudawrap.cc:179 NCCL WARN Failed to find CUDA library libcuda.so (NCCL_CUDA_PATH='') : ubuntu:10196:10196 [0] NCCL INFO Bootstrap : Using ens4:10.0.0.2<0> ubuntu:10196:10196 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory ubuntu:10196:10196 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation ubuntu:10196:10196 [0] NCCL INFO NCCL_COMM_BLOCKING set by environment to 1. ubuntu:10196:10266 [6] NCCL INFO Failed to open libibverbs.so[.1] ubuntu:10196:10266 [6] NCCL INFO NET/Socket : Using [0]ens4:10.0.0.2<0> ubuntu:10196:10266 [6] NCCL INFO Using network Socket ubuntu:10196:10266 [6] NCCL INFO NCCL_CHECK_POINTERS set by environment to 0. ubuntu:10196:10266 [6] NCCL INFO NCCL_DMABUF_ENABLE set by environment to 1. ubuntu:40273:40374 [6] NCCL INFO Failed to open libibverbs.so[.1] ubuntu:40273:40374 [6] NCCL INFO NET/Socket : Using [0]ens4:10.0.0.1<0> ubuntu:40273:40374 [6] NCCL INFO Using network Socket ubuntu:40273:40374 [6] NCCL INFO NCCL_CHECK_POINTERS set by environment to 0. ubuntu:40273:40374 [6] NCCL INFO NCCL_DMABUF_ENABLE set by environment to 1. ubuntu:10196:10260 [0] NCCL INFO Using network Socket ubuntu:40273:40370 [2] NCCL INFO Using network Socket ubuntu:10196:10262 [2] NCCL INFO Using network Socket ubuntu:40273:40368 [0] NCCL INFO Using network Socket ubuntu:10196:10261 [1] NCCL INFO Using network Socket ubuntu:40273:40373 [5] NCCL INFO Using network Socket ubuntu:10196:10263 [3] NCCL INFO Using network Socket ubuntu:10196:10267 [7] NCCL INFO Using network Socket ubuntu:40273:40371 [3] NCCL INFO Using network Socket ubuntu:40273:40375 [7] NCCL INFO Using network Socket ubuntu:10196:10265 [5] NCCL INFO Using network Socket ubuntu:10196:10264 [4] NCCL INFO Using network Socket ubuntu:40273:40369 [1] NCCL INFO Using network Socket ubuntu:40273:40372 [4] NCCL INFO Using network Socket ubuntu:10196:10267 [7] NCCL INFO comm 0x5566b1d234f0 rank 15 nranks 16 cudaDev 7 nvmlDev 7 busId c0 commId 0x2bd9ec6b9b271c70 - Init START ubuntu:40273:40369 [1] NCCL INFO comm 0x5579e1f83680 rank 1 nranks 16 cudaDev 1 nvmlDev 1 busId 60 commId 0x2bd9ec6b9b271c70 - Init START ubuntu:10196:10264 [4] NCCL INFO comm 0x5566b1ce6890 rank 12 nranks 16 cudaDev 4 nvmlDev 4 busId 90 commId 0x2bd9ec6b9b271c70 - Init START ubuntu:40273:40370 [2] NCCL INFO comm 0x5579e1f97aa0 rank 2 nranks 16 cudaDev 2 nvmlDev 2 busId 70 commId 0x2bd9ec6b9b271c70 - Init START ubuntu:10196:10265 [5] NCCL INFO comm 0x5566b1cfacb0 rank 13 nranks 16 cudaDev 5 nvmlDev 5 busId a0 commId 0x2bd9ec6b9b271c70 - Init START ubuntu:10196:10266 [6] NCCL INFO comm 0x5566b1d0f0d0 rank 14 nranks 16 cudaDev 6 nvmlDev 6 busId b0 commId 0x2bd9ec6b9b271c70 - Init START ubuntu:40273:40368 [0] NCCL INFO comm 0x5579e1f6f2c0 rank 0 nranks 16 cudaDev 0 nvmlDev 0 busId 50 commId 0x2bd9ec6b9b271c70 - Init START ubuntu:40273:40371 [3] NCCL INFO comm 0x5579e1fabd80 rank 3 nranks 16 cudaDev 3 nvmlDev 3 busId 80 commId 0x2bd9ec6b9b271c70 - Init START ubuntu:40273:40372 [4] NCCL INFO comm 0x5579e1fc01a0 rank 4 nranks 16 cudaDev 4 nvmlDev 4 busId 90 commId 0x2bd9ec6b9b271c70 - Init START ubuntu:40273:40373 [5] NCCL INFO comm 0x5579e1fd45c0 rank 5 nranks 16 cudaDev 5 nvmlDev 5 busId a0 commId 0x2bd9ec6b9b271c70 - Init START ubuntu:40273:40374 [6] NCCL INFO comm 0x5579e1fe89e0 rank 6 nranks 16 cudaDev 6 nvmlDev 6 busId b0 commId 0x2bd9ec6b9b271c70 - Init START ubuntu:40273:40375 [7] NCCL INFO comm 0x5579e1ffce00 rank 7 nranks 16 cudaDev 7 nvmlDev 7 busId c0 commId 0x2bd9ec6b9b271c70 - Init START ubuntu:10196:10261 [1] NCCL INFO comm 0x5566b1ca9d70 rank 9 nranks 16 cudaDev 1 nvmlDev 1 busId 60 commId 0x2bd9ec6b9b271c70 - Init START ubuntu:10196:10262 [2] NCCL INFO comm 0x5566b1cbe190 rank 10 nranks 16 cudaDev 2 nvmlDev 2 busId 70 commId 0x2bd9ec6b9b271c70 - Init START ubuntu:10196:10263 [3] NCCL INFO comm 0x5566b1cd2470 rank 11 nranks 16 cudaDev 3 nvmlDev 3 busId 80 commId 0x2bd9ec6b9b271c70 - Init START ubuntu:10196:10260 [0] NCCL INFO comm 0x5566b1c959f0 rank 8 nranks 16 cudaDev 0 nvmlDev 0 busId 50 commId 0x2bd9ec6b9b271c70 - Init START ubuntu:40273:40369 [1] NCCL INFO NCCL_NVB_DISABLE set by environment to 1. ubuntu:40273:40369 [1] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC ubuntu:40273:40369 [1] NCCL INFO NCCL_IGNORE_CPU_AFFINITY set by environment to 0. ubuntu:40273:40369 [1] NCCL INFO NVLS multicast support is not available on dev 1 ubuntu:40273:40369 [1] NCCL INFO NCCL_CROSS_NIC set by environment to 2. ubuntu:10196:10266 [6] NCCL INFO NCCL_NVB_DISABLE set by environment to 1. ubuntu:10196:10266 [6] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC ubuntu:10196:10266 [6] NCCL INFO NCCL_IGNORE_CPU_AFFINITY set by environment to 0. ubuntu:10196:10266 [6] NCCL INFO NCCL_CROSS_NIC set by environment to 2. ubuntu:40273:40370 [2] NCCL INFO NVLS multicast support is not available on dev 2 ubuntu:40273:40371 [3] NCCL INFO NVLS multicast support is not available on dev 3 ubuntu:40273:40375 [7] NCCL INFO NVLS multicast support is not available on dev 7 ubuntu:40273:40368 [0] NCCL INFO NVLS multicast support is not available on dev 0 ubuntu:40273:40374 [6] NCCL INFO NVLS multicast support is not available on dev 6 ubuntu:40273:40373 [5] NCCL INFO NVLS multicast support is not available on dev 5 ubuntu:40273:40372 [4] NCCL INFO NVLS multicast support is not available on dev 4 ubuntu:40273:40368 [0] NCCL INFO NCCL_MAX_NCHANNELS set by environment to 55. ubuntu:40273:40368 [0] NCCL INFO NCCL_MIN_NCHANNELS set by environment to 21. ubuntu:40273:40368 [0] NCCL INFO Channel 00/21 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ubuntu:40273:40368 [0] NCCL INFO Channel 01/21 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ubuntu:40273:40368 [0] NCCL INFO Channel 02/21 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ubuntu:40273:40368 [0] NCCL INFO Channel 03/21 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ubuntu:40273:40368 [0] NCCL INFO Channel 04/21 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ubuntu:40273:40368 [0] NCCL INFO Channel 05/21 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ubuntu:40273:40368 [0] NCCL INFO Channel 06/21 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ubuntu:40273:40368 [0] NCCL INFO Channel 07/21 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ubuntu:40273:40368 [0] NCCL INFO Channel 08/21 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ubuntu:40273:40368 [0] NCCL INFO Channel 09/21 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ubuntu:40273:40368 [0] NCCL INFO Channel 10/21 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ubuntu:40273:40368 [0] NCCL INFO Channel 11/21 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ubuntu:40273:40368 [0] NCCL INFO Channel 12/21 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ubuntu:40273:40368 [0] NCCL INFO Channel 13/21 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ubuntu:40273:40368 [0] NCCL INFO Channel 14/21 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ubuntu:40273:40368 [0] NCCL INFO Channel 15/21 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ubuntu:40273:40368 [0] NCCL INFO Channel 16/21 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ubuntu:40273:40368 [0] NCCL INFO Channel 17/21 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ubuntu:40273:40368 [0] NCCL INFO Channel 18/21 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ubuntu:40273:40368 [0] NCCL INFO Channel 19/21 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ubuntu:40273:40368 [0] NCCL INFO Channel 20/21 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ubuntu:40273:40368 [0] NCCL INFO Trees [0] 1/8/-1->0->-1 [1] 1/-1/-1->0->8 [2] 1/8/-1->0->-1 [3] 1/-1/-1->0->8 [4] 1/8/-1->0->-1 [5] 1/-1/-1->0->8 [6] 1/8/-1->0->-1 [7] 1/-1/-1->0->8 [8] 1/8/-1->0->-1 [9] 1/-1/-1->0->8 [10] 1/8/-1->0->-1 [11] 1/-1/-1->0->8 [12] 1/8/-1->0->-1 [13] 1/-1/-1->0->8 [14] 1/8/-1->0->-1 [15] 1/-1/-1->0->8 [16] 1/8/-1->0->-1 [17] 1/-1/-1->0->8 [18] 1/8/-1->0->-1 [19] 1/-1/-1->0->8 [20] 1/8/-1->0->-1 ubuntu:40273:40368 [0] NCCL INFO NCCL_BUFFSIZE set by environment to 4194304. ubuntu:40273:40368 [0] NCCL INFO P2P Chunksize set to 131072 ubuntu:40273:40368 [0] NCCL INFO NCCL_GRAPH_MIXING_SUPPORT set by environment to 1. ubuntu:10196:10263 [3] NCCL INFO NCCL_MAX_NCHANNELS set by environment to 55. ubuntu:10196:10263 [3] NCCL INFO NCCL_MIN_NCHANNELS set by environment to 21. ubuntu:10196:10263 [3] NCCL INFO Trees [0] 12/-1/-1->11->10 [1] 12/-1/-1->11->10 [2] 12/-1/-1->11->10 [3] 12/-1/-1->11->10 [4] 12/-1/-1->11->10 [5] 12/-1/-1->11->10 [6] 12/-1/-1->11->10 [7] 12/-1/-1->11->10 [8] 12/-1/-1->11->10 [9] 12/-1/-1->11->10 [10] 12/-1/-1->11->10 [11] 12/-1/-1->11->10 [12] 12/-1/-1->11->10 [13] 12/-1/-1->11->10 [14] 12/-1/-1->11->10 [15] 12/-1/-1->11->10 [16] 12/-1/-1->11->10 [17] 12/-1/-1->11->10 [18] 12/-1/-1->11->10 [19] 12/-1/-1->11->10 [20] 12/-1/-1->11->10 ubuntu:10196:10263 [3] NCCL INFO NCCL_BUFFSIZE set by environment to 4194304. ubuntu:10196:10263 [3] NCCL INFO P2P Chunksize set to 131072 ubuntu:10196:10263 [3] NCCL INFO NCCL_GRAPH_MIXING_SUPPORT set by environment to 1. ubuntu:40273:40369 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0 [16] 2/-1/-1->1->0 [17] 2/-1/-1->1->0 [18] 2/-1/-1->1->0 [19] 2/-1/-1->1->0 [20] 2/-1/-1->1->0 ubuntu:40273:40369 [1] NCCL INFO P2P Chunksize set to 131072 ubuntu:10196:10262 [2] NCCL INFO Trees [0] 11/-1/-1->10->9 [1] 11/-1/-1->10->9 [2] 11/-1/-1->10->9 [3] 11/-1/-1->10->9 [4] 11/-1/-1->10->9 [5] 11/-1/-1->10->9 [6] 11/-1/-1->10->9 [7] 11/-1/-1->10->9 [8] 11/-1/-1->10->9 [9] 11/-1/-1->10->9 [10] 11/-1/-1->10->9 [11] 11/-1/-1->10->9 [12] 11/-1/-1->10->9 [13] 11/-1/-1->10->9 [14] 11/-1/-1->10->9 [15] 11/-1/-1->10->9 [16] 11/-1/-1->10->9 [17] 11/-1/-1->10->9 [18] 11/-1/-1->10->9 [19] 11/-1/-1->10->9 [20] 11/-1/-1->10->9 ubuntu:10196:10262 [2] NCCL INFO P2P Chunksize set to 131072 ubuntu:10196:10261 [1] NCCL INFO Trees [0] 10/-1/-1->9->8 [1] 10/-1/-1->9->8 [2] 10/-1/-1->9->8 [3] 10/-1/-1->9->8 [4] 10/-1/-1->9->8 [5] 10/-1/-1->9->8 [6] 10/-1/-1->9->8 [7] 10/-1/-1->9->8 [8] 10/-1/-1->9->8 [9] 10/-1/-1->9->8 [10] 10/-1/-1->9->8 [11] 10/-1/-1->9->8 [12] 10/-1/-1->9->8 [13] 10/-1/-1->9->8 [14] 10/-1/-1->9->8 [15] 10/-1/-1->9->8 [16] 10/-1/-1->9->8 [17] 10/-1/-1->9->8 [18] 10/-1/-1->9->8 [19] 10/-1/-1->9->8 [20] 10/-1/-1->9->8 ubuntu:10196:10261 [1] NCCL INFO P2P Chunksize set to 131072 ubuntu:10196:10264 [4] NCCL INFO Trees [0] 13/-1/-1->12->11 [1] 13/-1/-1->12->11 [2] 13/-1/-1->12->11 [3] 13/-1/-1->12->11 [4] 13/-1/-1->12->11 [5] 13/-1/-1->12->11 [6] 13/-1/-1->12->11 [7] 13/-1/-1->12->11 [8] 13/-1/-1->12->11 [9] 13/-1/-1->12->11 [10] 13/-1/-1->12->11 [11] 13/-1/-1->12->11 [12] 13/-1/-1->12->11 [13] 13/-1/-1->12->11 [14] 13/-1/-1->12->11 [15] 13/-1/-1->12->11 [16] 13/-1/-1->12->11 [17] 13/-1/-1->12->11 [18] 13/-1/-1->12->11 [19] 13/-1/-1->12->11 [20] 13/-1/-1->12->11 ubuntu:10196:10264 [4] NCCL INFO P2P Chunksize set to 131072 ubuntu:10196:10265 [5] NCCL INFO Trees [0] 14/-1/-1->13->12 [1] 14/-1/-1->13->12 [2] 14/-1/-1->13->12 [3] 14/-1/-1->13->12 [4] 14/-1/-1->13->12 [5] 14/-1/-1->13->12 [6] 14/-1/-1->13->12 [7] 14/-1/-1->13->12 [8] 14/-1/-1->13->12 [9] 14/-1/-1->13->12 [10] 14/-1/-1->13->12 [11] 14/-1/-1->13->12 [12] 14/-1/-1->13->12 [13] 14/-1/-1->13->12 [14] 14/-1/-1->13->12 [15] 14/-1/-1->13->12 [16] 14/-1/-1->13->12 [17] 14/-1/-1->13->12 [18] 14/-1/-1->13->12 [19] 14/-1/-1->13->12 [20] 14/-1/-1->13->12 ubuntu:10196:10265 [5] NCCL INFO P2P Chunksize set to 131072 ubuntu:10196:10267 [7] NCCL INFO Trees [0] -1/-1/-1->15->14 [1] -1/-1/-1->15->14 [2] -1/-1/-1->15->14 [3] -1/-1/-1->15->14 [4] -1/-1/-1->15->14 [5] -1/-1/-1->15->14 [6] -1/-1/-1->15->14 [7] -1/-1/-1->15->14 [8] -1/-1/-1->15->14 [9] -1/-1/-1->15->14 [10] -1/-1/-1->15->14 [11] -1/-1/-1->15->14 [12] -1/-1/-1->15->14 [13] -1/-1/-1->15->14 [14] -1/-1/-1->15->14 [15] -1/-1/-1->15->14 [16] -1/-1/-1->15->14 [17] -1/-1/-1->15->14 [18] -1/-1/-1->15->14 [19] -1/-1/-1->15->14 [20] -1/-1/-1->15->14 ubuntu:10196:10267 [7] NCCL INFO P2P Chunksize set to 131072 ubuntu:10196:10266 [6] NCCL INFO Trees [0] 15/-1/-1->14->13 [1] 15/-1/-1->14->13 [2] 15/-1/-1->14->13 [3] 15/-1/-1->14->13 [4] 15/-1/-1->14->13 [5] 15/-1/-1->14->13 [6] 15/-1/-1->14->13 [7] 15/-1/-1->14->13 [8] 15/-1/-1->14->13 [9] 15/-1/-1->14->13 [10] 15/-1/-1->14->13 [11] 15/-1/-1->14->13 [12] 15/-1/-1->14->13 [13] 15/-1/-1->14->13 [14] 15/-1/-1->14->13 [15] 15/-1/-1->14->13 [16] 15/-1/-1->14->13 [17] 15/-1/-1->14->13 [18] 15/-1/-1->14->13 [19] 15/-1/-1->14->13 [20] 15/-1/-1->14->13 ubuntu:10196:10266 [6] NCCL INFO P2P Chunksize set to 131072 ubuntu:40273:40371 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2 [16] 4/-1/-1->3->2 [17] 4/-1/-1->3->2 [18] 4/-1/-1->3->2 [19] 4/-1/-1->3->2 [20] 4/-1/-1->3->2 ubuntu:40273:40371 [3] NCCL INFO P2P Chunksize set to 131072 ubuntu:40273:40375 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6 [16] -1/-1/-1->7->6 [17] -1/-1/-1->7->6 [18] -1/-1/-1->7->6 [19] -1/-1/-1->7->6 [20] -1/-1/-1->7->6 ubuntu:40273:40375 [7] NCCL INFO P2P Chunksize set to 131072 ubuntu:40273:40370 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1 [16] 3/-1/-1->2->1 [17] 3/-1/-1->2->1 [18] 3/-1/-1->2->1 [19] 3/-1/-1->2->1 [20] 3/-1/-1->2->1 ubuntu:40273:40370 [2] NCCL INFO P2P Chunksize set to 131072 ubuntu:40273:40372 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3 [16] 5/-1/-1->4->3 [17] 5/-1/-1->4->3 [18] 5/-1/-1->4->3 [19] 5/-1/-1->4->3 [20] 5/-1/-1->4->3 ubuntu:40273:40372 [4] NCCL INFO P2P Chunksize set to 131072 ubuntu:40273:40373 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4 [16] 6/-1/-1->5->4 [17] 6/-1/-1->5->4 [18] 6/-1/-1->5->4 [19] 6/-1/-1->5->4 [20] 6/-1/-1->5->4 ubuntu:40273:40373 [5] NCCL INFO P2P Chunksize set to 131072 ubuntu:10196:10260 [0] NCCL INFO Trees [0] 9/-1/-1->8->0 [1] 9/0/-1->8->-1 [2] 9/-1/-1->8->0 [3] 9/0/-1->8->-1 [4] 9/-1/-1->8->0 [5] 9/0/-1->8->-1 [6] 9/-1/-1->8->0 [7] 9/0/-1->8->-1 [8] 9/-1/-1->8->0 [9] 9/0/-1->8->-1 [10] 9/-1/-1->8->0 [11] 9/0/-1->8->-1 [12] 9/-1/-1->8->0 [13] 9/0/-1->8->-1 [14] 9/-1/-1->8->0 [15] 9/0/-1->8->-1 [16] 9/-1/-1->8->0 [17] 9/0/-1->8->-1 [18] 9/-1/-1->8->0 [19] 9/0/-1->8->-1 [20] 9/-1/-1->8->0 ubuntu:10196:10260 [0] NCCL INFO P2P Chunksize set to 131072 ubuntu:40273:40374 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5 [16] 7/-1/-1->6->5 [17] 7/-1/-1->6->5 [18] 7/-1/-1->6->5 [19] 7/-1/-1->6->5 [20] 7/-1/-1->6->5 ubuntu:40273:40374 [6] NCCL INFO P2P Chunksize set to 131072 ubuntu:40273:40376 [0] NCCL INFO NCCL_NSOCKS_PERTHREAD set by environment to 1. ubuntu:40273:40368 [0] NCCL INFO Channel 00/0 : 15[7] -> 0[0] [receive] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 01/0 : 15[7] -> 0[0] [receive] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 02/0 : 15[7] -> 0[0] [receive] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 03/0 : 15[7] -> 0[0] [receive] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 04/0 : 15[7] -> 0[0] [receive] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 05/0 : 15[7] -> 0[0] [receive] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 06/0 : 15[7] -> 0[0] [receive] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 07/0 : 15[7] -> 0[0] [receive] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 08/0 : 15[7] -> 0[0] [receive] via NET/Socket/0 ubuntu:10196:10275 [0] NCCL INFO NCCL_NSOCKS_PERTHREAD set by environment to 1. ubuntu:10196:10260 [0] NCCL INFO Channel 00/0 : 7[7] -> 8[0] [receive] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 01/0 : 7[7] -> 8[0] [receive] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 09/0 : 15[7] -> 0[0] [receive] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 10/0 : 15[7] -> 0[0] [receive] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 02/0 : 7[7] -> 8[0] [receive] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 11/0 : 15[7] -> 0[0] [receive] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 12/0 : 15[7] -> 0[0] [receive] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 13/0 : 15[7] -> 0[0] [receive] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 03/0 : 7[7] -> 8[0] [receive] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 04/0 : 7[7] -> 8[0] [receive] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 05/0 : 7[7] -> 8[0] [receive] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 06/0 : 7[7] -> 8[0] [receive] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 14/0 : 15[7] -> 0[0] [receive] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 07/0 : 7[7] -> 8[0] [receive] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 08/0 : 7[7] -> 8[0] [receive] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 09/0 : 7[7] -> 8[0] [receive] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 10/0 : 7[7] -> 8[0] [receive] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 11/0 : 7[7] -> 8[0] [receive] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 12/0 : 7[7] -> 8[0] [receive] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 13/0 : 7[7] -> 8[0] [receive] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 15/0 : 15[7] -> 0[0] [receive] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 16/0 : 15[7] -> 0[0] [receive] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 17/0 : 15[7] -> 0[0] [receive] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 18/0 : 15[7] -> 0[0] [receive] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 19/0 : 15[7] -> 0[0] [receive] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 20/0 : 15[7] -> 0[0] [receive] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 14/0 : 7[7] -> 8[0] [receive] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 15/0 : 7[7] -> 8[0] [receive] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 16/0 : 7[7] -> 8[0] [receive] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 17/0 : 7[7] -> 8[0] [receive] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 18/0 : 7[7] -> 8[0] [receive] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 19/0 : 7[7] -> 8[0] [receive] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 20/0 : 7[7] -> 8[0] [receive] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct ubuntu:10196:10260 [0] NCCL INFO Channel 00 : 8[0] -> 9[1] via SHM/direct/direct ubuntu:10196:10260 [0] NCCL INFO Channel 01 : 8[0] -> 9[1] via SHM/direct/direct ubuntu:40273:40368 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct ubuntu:40273:40368 [0] NCCL INFO Channel 02 : 0[0] -> 1[1] via SHM/direct/direct ubuntu:40273:40368 [0] NCCL INFO Channel 03 : 0[0] -> 1[1] via SHM/direct/direct ubuntu:40273:40368 [0] NCCL INFO Channel 04 : 0[0] -> 1[1] via SHM/direct/direct ubuntu:10196:10260 [0] NCCL INFO Channel 02 : 8[0] -> 9[1] via SHM/direct/direct ubuntu:10196:10260 [0] NCCL INFO Channel 03 : 8[0] -> 9[1] via SHM/direct/direct ubuntu:40273:40368 [0] NCCL INFO Channel 05 : 0[0] -> 1[1] via SHM/direct/direct ubuntu:10196:10260 [0] NCCL INFO Channel 04 : 8[0] -> 9[1] via SHM/direct/direct ubuntu:10196:10260 [0] NCCL INFO Channel 05 : 8[0] -> 9[1] via SHM/direct/direct ubuntu:40273:40368 [0] NCCL INFO Channel 06 : 0[0] -> 1[1] via SHM/direct/direct ubuntu:10196:10260 [0] NCCL INFO Channel 06 : 8[0] -> 9[1] via SHM/direct/direct ubuntu:40273:40368 [0] NCCL INFO Channel 07 : 0[0] -> 1[1] via SHM/direct/direct ubuntu:40273:40368 [0] NCCL INFO Channel 08 : 0[0] -> 1[1] via SHM/direct/direct ubuntu:40273:40368 [0] NCCL INFO Channel 09 : 0[0] -> 1[1] via SHM/direct/direct ubuntu:10196:10260 [0] NCCL INFO Channel 07 : 8[0] -> 9[1] via SHM/direct/direct ubuntu:10196:10260 [0] NCCL INFO Channel 08 : 8[0] -> 9[1] via SHM/direct/direct ubuntu:10196:10260 [0] NCCL INFO Channel 09 : 8[0] -> 9[1] via SHM/direct/direct ubuntu:40273:40368 [0] NCCL INFO Channel 10 : 0[0] -> 1[1] via SHM/direct/direct ubuntu:40273:40368 [0] NCCL INFO Channel 11 : 0[0] -> 1[1] via SHM/direct/direct ubuntu:10196:10260 [0] NCCL INFO Channel 10 : 8[0] -> 9[1] via SHM/direct/direct ubuntu:10196:10260 [0] NCCL INFO Channel 11 : 8[0] -> 9[1] via SHM/direct/direct ubuntu:10196:10260 [0] NCCL INFO Channel 12 : 8[0] -> 9[1] via SHM/direct/direct ubuntu:40273:40368 [0] NCCL INFO Channel 12 : 0[0] -> 1[1] via SHM/direct/direct ubuntu:40273:40368 [0] NCCL INFO Channel 13 : 0[0] -> 1[1] via SHM/direct/direct ubuntu:40273:40368 [0] NCCL INFO Channel 14 : 0[0] -> 1[1] via SHM/direct/direct ubuntu:10196:10260 [0] NCCL INFO Channel 13 : 8[0] -> 9[1] via SHM/direct/direct ubuntu:10196:10260 [0] NCCL INFO Channel 14 : 8[0] -> 9[1] via SHM/direct/direct ubuntu:40273:40368 [0] NCCL INFO Channel 15 : 0[0] -> 1[1] via SHM/direct/direct ubuntu:40273:40368 [0] NCCL INFO Channel 16 : 0[0] -> 1[1] via SHM/direct/direct ubuntu:40273:40368 [0] NCCL INFO Channel 17 : 0[0] -> 1[1] via SHM/direct/direct ubuntu:40273:40368 [0] NCCL INFO Channel 18 : 0[0] -> 1[1] via SHM/direct/direct ubuntu:10196:10260 [0] NCCL INFO Channel 15 : 8[0] -> 9[1] via SHM/direct/direct ubuntu:10196:10260 [0] NCCL INFO Channel 16 : 8[0] -> 9[1] via SHM/direct/direct ubuntu:10196:10260 [0] NCCL INFO Channel 17 : 8[0] -> 9[1] via SHM/direct/direct ubuntu:40273:40368 [0] NCCL INFO Channel 19 : 0[0] -> 1[1] via SHM/direct/direct ubuntu:10196:10260 [0] NCCL INFO Channel 18 : 8[0] -> 9[1] via SHM/direct/direct ubuntu:10196:10260 [0] NCCL INFO Channel 19 : 8[0] -> 9[1] via SHM/direct/direct ubuntu:40273:40368 [0] NCCL INFO Channel 20 : 0[0] -> 1[1] via SHM/direct/direct ubuntu:10196:10260 [0] NCCL INFO Channel 20 : 8[0] -> 9[1] via SHM/direct/direct ubuntu:10196:10263 [3] NCCL INFO Channel 00 : 11[3] -> 12[4] via SHM/direct/direct ubuntu:10196:10263 [3] NCCL INFO Channel 01 : 11[3] -> 12[4] via SHM/direct/direct ubuntu:10196:10261 [1] NCCL INFO Channel 00 : 9[1] -> 10[2] via SHM/direct/direct ubuntu:10196:10262 [2] NCCL INFO Channel 00 : 10[2] -> 11[3] via SHM/direct/direct ubuntu:10196:10264 [4] NCCL INFO Channel 00 : 12[4] -> 13[5] via SHM/direct/direct ubuntu:40273:40370 [2] NCCL INFO Channel 00 : 2[2] -> 3[3] via SHM/direct/direct ubuntu:40273:40370 [2] NCCL INFO Channel 01 : 2[2] -> 3[3] via SHM/direct/direct ubuntu:10196:10264 [4] NCCL INFO Channel 01 : 12[4] -> 13[5] via SHM/direct/direct ubuntu:10196:10261 [1] NCCL INFO Channel 01 : 9[1] -> 10[2] via SHM/direct/direct ubuntu:10196:10262 [2] NCCL INFO Channel 01 : 10[2] -> 11[3] via SHM/direct/direct ubuntu:10196:10263 [3] NCCL INFO Channel 02 : 11[3] -> 12[4] via SHM/direct/direct ubuntu:10196:10265 [5] NCCL INFO Channel 00 : 13[5] -> 14[6] via SHM/direct/direct ubuntu:10196:10264 [4] NCCL INFO Channel 02 : 12[4] -> 13[5] via SHM/direct/direct ubuntu:10196:10261 [1] NCCL INFO Channel 02 : 9[1] -> 10[2] via SHM/direct/direct ubuntu:10196:10266 [6] NCCL INFO Channel 00 : 14[6] -> 15[7] via SHM/direct/direct ubuntu:10196:10262 [2] NCCL INFO Channel 02 : 10[2] -> 11[3] via SHM/direct/direct ubuntu:10196:10263 [3] NCCL INFO Channel 03 : 11[3] -> 12[4] via SHM/direct/direct ubuntu:10196:10265 [5] NCCL INFO Channel 01 : 13[5] -> 14[6] via SHM/direct/direct ubuntu:40273:40371 [3] NCCL INFO Channel 00 : 3[3] -> 4[4] via SHM/direct/direct ubuntu:10196:10264 [4] NCCL INFO Channel 03 : 12[4] -> 13[5] via SHM/direct/direct ubuntu:40273:40371 [3] NCCL INFO Channel 01 : 3[3] -> 4[4] via SHM/direct/direct ubuntu:40273:40370 [2] NCCL INFO Channel 02 : 2[2] -> 3[3] via SHM/direct/direct ubuntu:40273:40375 [7] NCCL INFO Channel 00/0 : 7[7] -> 8[0] [send] via NET/Socket/0 ubuntu:40273:40375 [7] NCCL INFO Channel 01/0 : 7[7] -> 8[0] [send] via NET/Socket/0 ubuntu:40273:40375 [7] NCCL INFO Channel 02/0 : 7[7] -> 8[0] [send] via NET/Socket/0 ubuntu:40273:40369 [1] NCCL INFO Channel 00 : 1[1] -> 2[2] via SHM/direct/direct ubuntu:40273:40369 [1] NCCL INFO Channel 01 : 1[1] -> 2[2] via SHM/direct/direct ubuntu:40273:40374 [6] NCCL INFO Channel 00 : 6[6] -> 7[7] via SHM/direct/direct ubuntu:10196:10261 [1] NCCL INFO Channel 03 : 9[1] -> 10[2] via SHM/direct/direct ubuntu:40273:40374 [6] NCCL INFO Channel 01 : 6[6] -> 7[7] via SHM/direct/direct ubuntu:40273:40373 [5] NCCL INFO Channel 00 : 5[5] -> 6[6] via SHM/direct/direct ubuntu:40273:40369 [1] NCCL INFO Channel 02 : 1[1] -> 2[2] via SHM/direct/direct ubuntu:40273:40375 [7] NCCL INFO Channel 03/0 : 7[7] -> 8[0] [send] via NET/Socket/0 ubuntu:40273:40371 [3] NCCL INFO Channel 02 : 3[3] -> 4[4] via SHM/direct/direct ubuntu:10196:10267 [7] NCCL INFO Channel 00/0 : 15[7] -> 0[0] [send] via NET/Socket/0 ubuntu:10196:10267 [7] NCCL INFO Channel 01/0 : 15[7] -> 0[0] [send] via NET/Socket/0 ubuntu:10196:10267 [7] NCCL INFO Channel 02/0 : 15[7] -> 0[0] [send] via NET/Socket/0 ubuntu:10196:10267 [7] NCCL INFO Channel 03/0 : 15[7] -> 0[0] [send] via NET/Socket/0 ubuntu:10196:10267 [7] NCCL INFO Channel 04/0 : 15[7] -> 0[0] [send] via NET/Socket/0 ubuntu:10196:10267 [7] NCCL INFO Channel 05/0 : 15[7] -> 0[0] [send] via NET/Socket/0 ubuntu:10196:10267 [7] NCCL INFO Channel 06/0 : 15[7] -> 0[0] [send] via NET/Socket/0 ubuntu:10196:10267 [7] NCCL INFO Channel 07/0 : 15[7] -> 0[0] [send] via NET/Socket/0 ubuntu:10196:10267 [7] NCCL INFO Channel 08/0 : 15[7] -> 0[0] [send] via NET/Socket/0 ubuntu:10196:10267 [7] NCCL INFO Channel 09/0 : 15[7] -> 0[0] [send] via NET/Socket/0 ubuntu:10196:10267 [7] NCCL INFO Channel 10/0 : 15[7] -> 0[0] [send] via NET/Socket/0 ubuntu:10196:10267 [7] NCCL INFO Channel 11/0 : 15[7] -> 0[0] [send] via NET/Socket/0 ubuntu:10196:10267 [7] NCCL INFO Channel 12/0 : 15[7] -> 0[0] [send] via NET/Socket/0 ubuntu:10196:10267 [7] NCCL INFO Channel 13/0 : 15[7] -> 0[0] [send] via NET/Socket/0 ubuntu:10196:10267 [7] NCCL INFO Channel 14/0 : 15[7] -> 0[0] [send] via NET/Socket/0 ubuntu:10196:10267 [7] NCCL INFO Channel 15/0 : 15[7] -> 0[0] [send] via NET/Socket/0 ubuntu:10196:10267 [7] NCCL INFO Channel 16/0 : 15[7] -> 0[0] [send] via NET/Socket/0 ubuntu:10196:10267 [7] NCCL INFO Channel 17/0 : 15[7] -> 0[0] [send] via NET/Socket/0 ubuntu:10196:10267 [7] NCCL INFO Channel 18/0 : 15[7] -> 0[0] [send] via NET/Socket/0 ubuntu:10196:10267 [7] NCCL INFO Channel 19/0 : 15[7] -> 0[0] [send] via NET/Socket/0 ubuntu:10196:10267 [7] NCCL INFO Channel 20/0 : 15[7] -> 0[0] [send] via NET/Socket/0 ubuntu:10196:10266 [6] NCCL INFO Channel 01 : 14[6] -> 15[7] via SHM/direct/direct ubuntu:40273:40372 [4] NCCL INFO Channel 00 : 4[4] -> 5[5] via SHM/direct/direct ubuntu:10196:10266 [6] NCCL INFO Channel 02 : 14[6] -> 15[7] via SHM/direct/direct ubuntu:40273:40370 [2] NCCL INFO Channel 03 : 2[2] -> 3[3] via SHM/direct/direct ubuntu:40273:40374 [6] NCCL INFO Channel 02 : 6[6] -> 7[7] via SHM/direct/direct ubuntu:10196:10261 [1] NCCL INFO Channel 04 : 9[1] -> 10[2] via SHM/direct/direct ubuntu:10196:10263 [3] NCCL INFO Channel 04 : 11[3] -> 12[4] via SHM/direct/direct ubuntu:10196:10263 [3] NCCL INFO Channel 05 : 11[3] -> 12[4] via SHM/direct/direct ubuntu:10196:10266 [6] NCCL INFO Channel 03 : 14[6] -> 15[7] via SHM/direct/direct ubuntu:10196:10262 [2] NCCL INFO Channel 03 : 10[2] -> 11[3] via SHM/direct/direct ubuntu:40273:40373 [5] NCCL INFO Channel 01 : 5[5] -> 6[6] via SHM/direct/direct ubuntu:10196:10262 [2] NCCL INFO Channel 04 : 10[2] -> 11[3] via SHM/direct/direct ubuntu:10196:10264 [4] NCCL INFO Channel 04 : 12[4] -> 13[5] via SHM/direct/direct ubuntu:10196:10265 [5] NCCL INFO Channel 02 : 13[5] -> 14[6] via SHM/direct/direct ubuntu:40273:40373 [5] NCCL INFO Channel 02 : 5[5] -> 6[6] via SHM/direct/direct ubuntu:10196:10265 [5] NCCL INFO Channel 03 : 13[5] -> 14[6] via SHM/direct/direct ubuntu:10196:10265 [5] NCCL INFO Channel 04 : 13[5] -> 14[6] via SHM/direct/direct ubuntu:10196:10263 [3] NCCL INFO Channel 06 : 11[3] -> 12[4] via SHM/direct/direct ubuntu:10196:10262 [2] NCCL INFO Channel 05 : 10[2] -> 11[3] via SHM/direct/direct ubuntu:40273:40369 [1] NCCL INFO Channel 03 : 1[1] -> 2[2] via SHM/direct/direct ubuntu:40273:40369 [1] NCCL INFO Channel 04 : 1[1] -> 2[2] via SHM/direct/direct ubuntu:10196:10261 [1] NCCL INFO Channel 05 : 9[1] -> 10[2] via SHM/direct/direct ubuntu:40273:40369 [1] NCCL INFO Channel 05 : 1[1] -> 2[2] via SHM/direct/direct ubuntu:10196:10264 [4] NCCL INFO Channel 05 : 12[4] -> 13[5] via SHM/direct/direct ubuntu:40273:40373 [5] NCCL INFO Channel 03 : 5[5] -> 6[6] via SHM/direct/direct ubuntu:10196:10266 [6] NCCL INFO Channel 04 : 14[6] -> 15[7] via SHM/direct/direct ubuntu:40273:40370 [2] NCCL INFO Channel 04 : 2[2] -> 3[3] via SHM/direct/direct ubuntu:40273:40370 [2] NCCL INFO Channel 05 : 2[2] -> 3[3] via SHM/direct/direct ubuntu:40273:40370 [2] NCCL INFO Channel 06 : 2[2] -> 3[3] via SHM/direct/direct ubuntu:40273:40369 [1] NCCL INFO Channel 06 : 1[1] -> 2[2] via SHM/direct/direct ubuntu:10196:10263 [3] NCCL INFO Channel 07 : 11[3] -> 12[4] via SHM/direct/direct ubuntu:10196:10261 [1] NCCL INFO Channel 06 : 9[1] -> 10[2] via SHM/direct/direct ubuntu:10196:10262 [2] NCCL INFO Channel 06 : 10[2] -> 11[3] via SHM/direct/direct ubuntu:10196:10264 [4] NCCL INFO Channel 06 : 12[4] -> 13[5] via SHM/direct/direct ubuntu:10196:10265 [5] NCCL INFO Channel 05 : 13[5] -> 14[6] via SHM/direct/direct ubuntu:40273:40371 [3] NCCL INFO Channel 03 : 3[3] -> 4[4] via SHM/direct/direct ubuntu:40273:40371 [3] NCCL INFO Channel 04 : 3[3] -> 4[4] via SHM/direct/direct ubuntu:10196:10266 [6] NCCL INFO Channel 05 : 14[6] -> 15[7] via SHM/direct/direct ubuntu:40273:40371 [3] NCCL INFO Channel 05 : 3[3] -> 4[4] via SHM/direct/direct ubuntu:40273:40370 [2] NCCL INFO Channel 07 : 2[2] -> 3[3] via SHM/direct/direct ubuntu:40273:40371 [3] NCCL INFO Channel 06 : 3[3] -> 4[4] via SHM/direct/direct ubuntu:40273:40372 [4] NCCL INFO Channel 01 : 4[4] -> 5[5] via SHM/direct/direct ubuntu:10196:10261 [1] NCCL INFO Channel 07 : 9[1] -> 10[2] via SHM/direct/direct ubuntu:10196:10262 [2] NCCL INFO Channel 07 : 10[2] -> 11[3] via SHM/direct/direct ubuntu:40273:40372 [4] NCCL INFO Channel 02 : 4[4] -> 5[5] via SHM/direct/direct ubuntu:10196:10263 [3] NCCL INFO Channel 08 : 11[3] -> 12[4] via SHM/direct/direct ubuntu:40273:40372 [4] NCCL INFO Channel 03 : 4[4] -> 5[5] via SHM/direct/direct ubuntu:40273:40372 [4] NCCL INFO Channel 04 : 4[4] -> 5[5] via SHM/direct/direct ubuntu:10196:10265 [5] NCCL INFO Channel 06 : 13[5] -> 14[6] via SHM/direct/direct ubuntu:10196:10264 [4] NCCL INFO Channel 07 : 12[4] -> 13[5] via SHM/direct/direct ubuntu:40273:40374 [6] NCCL INFO Channel 03 : 6[6] -> 7[7] via SHM/direct/direct ubuntu:40273:40374 [6] NCCL INFO Channel 04 : 6[6] -> 7[7] via SHM/direct/direct ubuntu:10196:10266 [6] NCCL INFO Channel 06 : 14[6] -> 15[7] via SHM/direct/direct ubuntu:40273:40374 [6] NCCL INFO Channel 05 : 6[6] -> 7[7] via SHM/direct/direct ubuntu:40273:40374 [6] NCCL INFO Channel 06 : 6[6] -> 7[7] via SHM/direct/direct ubuntu:10196:10261 [1] NCCL INFO Channel 08 : 9[1] -> 10[2] via SHM/direct/direct ubuntu:40273:40374 [6] NCCL INFO Channel 07 : 6[6] -> 7[7] via SHM/direct/direct ubuntu:10196:10262 [2] NCCL INFO Channel 08 : 10[2] -> 11[3] via SHM/direct/direct ubuntu:40273:40371 [3] NCCL INFO Channel 07 : 3[3] -> 4[4] via SHM/direct/direct ubuntu:40273:40369 [1] NCCL INFO Channel 07 : 1[1] -> 2[2] via SHM/direct/direct ubuntu:10196:10263 [3] NCCL INFO Channel 09 : 11[3] -> 12[4] via SHM/direct/direct ubuntu:40273:40372 [4] NCCL INFO Channel 05 : 4[4] -> 5[5] via SHM/direct/direct ubuntu:10196:10265 [5] NCCL INFO Channel 07 : 13[5] -> 14[6] via SHM/direct/direct ubuntu:10196:10264 [4] NCCL INFO Channel 08 : 12[4] -> 13[5] via SHM/direct/direct ubuntu:40273:40373 [5] NCCL INFO Channel 04 : 5[5] -> 6[6] via SHM/direct/direct ubuntu:40273:40373 [5] NCCL INFO Channel 05 : 5[5] -> 6[6] via SHM/direct/direct ubuntu:40273:40370 [2] NCCL INFO Channel 08 : 2[2] -> 3[3] via SHM/direct/direct ubuntu:10196:10266 [6] NCCL INFO Channel 07 : 14[6] -> 15[7] via SHM/direct/direct ubuntu:40273:40375 [7] NCCL INFO Channel 04/0 : 7[7] -> 8[0] [send] via NET/Socket/0 ubuntu:10196:10262 [2] NCCL INFO Channel 09 : 10[2] -> 11[3] via SHM/direct/direct ubuntu:10196:10261 [1] NCCL INFO Channel 09 : 9[1] -> 10[2] via SHM/direct/direct ubuntu:10196:10263 [3] NCCL INFO Channel 10 : 11[3] -> 12[4] via SHM/direct/direct ubuntu:10196:10264 [4] NCCL INFO Channel 09 : 12[4] -> 13[5] via SHM/direct/direct ubuntu:10196:10265 [5] NCCL INFO Channel 08 : 13[5] -> 14[6] via SHM/direct/direct ubuntu:40273:40374 [6] NCCL INFO Channel 08 : 6[6] -> 7[7] via SHM/direct/direct ubuntu:40273:40371 [3] NCCL INFO Channel 08 : 3[3] -> 4[4] via SHM/direct/direct ubuntu:40273:40375 [7] NCCL INFO Channel 05/0 : 7[7] -> 8[0] [send] via NET/Socket/0 ubuntu:40273:40375 [7] NCCL INFO Channel 06/0 : 7[7] -> 8[0] [send] via NET/Socket/0 ubuntu:40273:40375 [7] NCCL INFO Channel 07/0 : 7[7] -> 8[0] [send] via NET/Socket/0 ubuntu:10196:10266 [6] NCCL INFO Channel 08 : 14[6] -> 15[7] via SHM/direct/direct ubuntu:40273:40369 [1] NCCL INFO Channel 08 : 1[1] -> 2[2] via SHM/direct/direct ubuntu:40273:40372 [4] NCCL INFO Channel 06 : 4[4] -> 5[5] via SHM/direct/direct ubuntu:40273:40375 [7] NCCL INFO Channel 08/0 : 7[7] -> 8[0] [send] via NET/Socket/0 ubuntu:40273:40375 [7] NCCL INFO Channel 09/0 : 7[7] -> 8[0] [send] via NET/Socket/0 ubuntu:10196:10261 [1] NCCL INFO Channel 10 : 9[1] -> 10[2] via SHM/direct/direct ubuntu:10196:10262 [2] NCCL INFO Channel 10 : 10[2] -> 11[3] via SHM/direct/direct ubuntu:40273:40373 [5] NCCL INFO Channel 06 : 5[5] -> 6[6] via SHM/direct/direct ubuntu:40273:40375 [7] NCCL INFO Channel 10/0 : 7[7] -> 8[0] [send] via NET/Socket/0 ubuntu:40273:40375 [7] NCCL INFO Channel 11/0 : 7[7] -> 8[0] [send] via NET/Socket/0 ubuntu:40273:40370 [2] NCCL INFO Channel 09 : 2[2] -> 3[3] via SHM/direct/direct ubuntu:40273:40375 [7] NCCL INFO Channel 12/0 : 7[7] -> 8[0] [send] via NET/Socket/0 ubuntu:40273:40375 [7] NCCL INFO Channel 13/0 : 7[7] -> 8[0] [send] via NET/Socket/0 ubuntu:40273:40375 [7] NCCL INFO Channel 14/0 : 7[7] -> 8[0] [send] via NET/Socket/0 ubuntu:40273:40375 [7] NCCL INFO Channel 15/0 : 7[7] -> 8[0] [send] via NET/Socket/0 ubuntu:40273:40375 [7] NCCL INFO Channel 16/0 : 7[7] -> 8[0] [send] via NET/Socket/0 ubuntu:40273:40375 [7] NCCL INFO Channel 17/0 : 7[7] -> 8[0] [send] via NET/Socket/0 ubuntu:40273:40375 [7] NCCL INFO Channel 18/0 : 7[7] -> 8[0] [send] via NET/Socket/0 ubuntu:40273:40375 [7] NCCL INFO Channel 19/0 : 7[7] -> 8[0] [send] via NET/Socket/0 ubuntu:40273:40375 [7] NCCL INFO Channel 20/0 : 7[7] -> 8[0] [send] via NET/Socket/0 ubuntu:10196:10264 [4] NCCL INFO Channel 10 : 12[4] -> 13[5] via SHM/direct/direct ubuntu:10196:10265 [5] NCCL INFO Channel 09 : 13[5] -> 14[6] via SHM/direct/direct ubuntu:10196:10263 [3] NCCL INFO Channel 11 : 11[3] -> 12[4] via SHM/direct/direct ubuntu:40273:40371 [3] NCCL INFO Channel 09 : 3[3] -> 4[4] via SHM/direct/direct ubuntu:40273:40374 [6] NCCL INFO Channel 09 : 6[6] -> 7[7] via SHM/direct/direct ubuntu:40273:40369 [1] NCCL INFO Channel 09 : 1[1] -> 2[2] via SHM/direct/direct ubuntu:10196:10266 [6] NCCL INFO Channel 09 : 14[6] -> 15[7] via SHM/direct/direct ubuntu:10196:10261 [1] NCCL INFO Channel 11 : 9[1] -> 10[2] via SHM/direct/direct ubuntu:40273:40372 [4] NCCL INFO Channel 07 : 4[4] -> 5[5] via SHM/direct/direct ubuntu:10196:10262 [2] NCCL INFO Channel 11 : 10[2] -> 11[3] via SHM/direct/direct ubuntu:40273:40373 [5] NCCL INFO Channel 07 : 5[5] -> 6[6] via SHM/direct/direct ubuntu:40273:40370 [2] NCCL INFO Channel 10 : 2[2] -> 3[3] via SHM/direct/direct ubuntu:10196:10263 [3] NCCL INFO Channel 12 : 11[3] -> 12[4] via SHM/direct/direct ubuntu:10196:10264 [4] NCCL INFO Channel 11 : 12[4] -> 13[5] via SHM/direct/direct ubuntu:10196:10265 [5] NCCL INFO Channel 10 : 13[5] -> 14[6] via SHM/direct/direct ubuntu:10196:10266 [6] NCCL INFO Channel 10 : 14[6] -> 15[7] via SHM/direct/direct ubuntu:40273:40374 [6] NCCL INFO Channel 10 : 6[6] -> 7[7] via SHM/direct/direct ubuntu:40273:40371 [3] NCCL INFO Channel 10 : 3[3] -> 4[4] via SHM/direct/direct ubuntu:40273:40369 [1] NCCL INFO Channel 10 : 1[1] -> 2[2] via SHM/direct/direct ubuntu:40273:40373 [5] NCCL INFO Channel 08 : 5[5] -> 6[6] via SHM/direct/direct ubuntu:40273:40372 [4] NCCL INFO Channel 08 : 4[4] -> 5[5] via SHM/direct/direct ubuntu:10196:10261 [1] NCCL INFO Channel 12 : 9[1] -> 10[2] via SHM/direct/direct ubuntu:10196:10262 [2] NCCL INFO Channel 12 : 10[2] -> 11[3] via SHM/direct/direct ubuntu:40273:40370 [2] NCCL INFO Channel 11 : 2[2] -> 3[3] via SHM/direct/direct ubuntu:10196:10264 [4] NCCL INFO Channel 12 : 12[4] -> 13[5] via SHM/direct/direct ubuntu:10196:10263 [3] NCCL INFO Channel 13 : 11[3] -> 12[4] via SHM/direct/direct ubuntu:10196:10265 [5] NCCL INFO Channel 11 : 13[5] -> 14[6] via SHM/direct/direct ubuntu:40273:40374 [6] NCCL INFO Channel 11 : 6[6] -> 7[7] via SHM/direct/direct ubuntu:40273:40371 [3] NCCL INFO Channel 11 : 3[3] -> 4[4] via SHM/direct/direct ubuntu:10196:10266 [6] NCCL INFO Channel 11 : 14[6] -> 15[7] via SHM/direct/direct ubuntu:10196:10262 [2] NCCL INFO Channel 13 : 10[2] -> 11[3] via SHM/direct/direct ubuntu:10196:10261 [1] NCCL INFO Channel 13 : 9[1] -> 10[2] via SHM/direct/direct ubuntu:40273:40373 [5] NCCL INFO Channel 09 : 5[5] -> 6[6] via SHM/direct/direct ubuntu:40273:40369 [1] NCCL INFO Channel 11 : 1[1] -> 2[2] via SHM/direct/direct ubuntu:40273:40372 [4] NCCL INFO Channel 09 : 4[4] -> 5[5] via SHM/direct/direct ubuntu:10196:10264 [4] NCCL INFO Channel 13 : 12[4] -> 13[5] via SHM/direct/direct ubuntu:40273:40370 [2] NCCL INFO Channel 12 : 2[2] -> 3[3] via SHM/direct/direct ubuntu:10196:10263 [3] NCCL INFO Channel 14 : 11[3] -> 12[4] via SHM/direct/direct ubuntu:10196:10265 [5] NCCL INFO Channel 12 : 13[5] -> 14[6] via SHM/direct/direct ubuntu:10196:10266 [6] NCCL INFO Channel 12 : 14[6] -> 15[7] via SHM/direct/direct ubuntu:40273:40374 [6] NCCL INFO Channel 12 : 6[6] -> 7[7] via SHM/direct/direct ubuntu:40273:40373 [5] NCCL INFO Channel 10 : 5[5] -> 6[6] via SHM/direct/direct ubuntu:40273:40369 [1] NCCL INFO Channel 12 : 1[1] -> 2[2] via SHM/direct/direct ubuntu:10196:10262 [2] NCCL INFO Channel 14 : 10[2] -> 11[3] via SHM/direct/direct ubuntu:40273:40371 [3] NCCL INFO Channel 12 : 3[3] -> 4[4] via SHM/direct/direct ubuntu:10196:10261 [1] NCCL INFO Channel 14 : 9[1] -> 10[2] via SHM/direct/direct ubuntu:40273:40372 [4] NCCL INFO Channel 10 : 4[4] -> 5[5] via SHM/direct/direct ubuntu:10196:10265 [5] NCCL INFO Channel 13 : 13[5] -> 14[6] via SHM/direct/direct ubuntu:40273:40370 [2] NCCL INFO Channel 13 : 2[2] -> 3[3] via SHM/direct/direct ubuntu:10196:10264 [4] NCCL INFO Channel 14 : 12[4] -> 13[5] via SHM/direct/direct ubuntu:10196:10263 [3] NCCL INFO Channel 15 : 11[3] -> 12[4] via SHM/direct/direct ubuntu:40273:40369 [1] NCCL INFO Channel 13 : 1[1] -> 2[2] via SHM/direct/direct ubuntu:40273:40373 [5] NCCL INFO Channel 11 : 5[5] -> 6[6] via SHM/direct/direct ubuntu:40273:40374 [6] NCCL INFO Channel 13 : 6[6] -> 7[7] via SHM/direct/direct ubuntu:10196:10266 [6] NCCL INFO Channel 13 : 14[6] -> 15[7] via SHM/direct/direct ubuntu:10196:10262 [2] NCCL INFO Channel 15 : 10[2] -> 11[3] via SHM/direct/direct ubuntu:40273:40371 [3] NCCL INFO Channel 13 : 3[3] -> 4[4] via SHM/direct/direct ubuntu:40273:40372 [4] NCCL INFO Channel 11 : 4[4] -> 5[5] via SHM/direct/direct ubuntu:10196:10261 [1] NCCL INFO Channel 15 : 9[1] -> 10[2] via SHM/direct/direct ubuntu:40273:40370 [2] NCCL INFO Channel 14 : 2[2] -> 3[3] via SHM/direct/direct ubuntu:10196:10263 [3] NCCL INFO Channel 16 : 11[3] -> 12[4] via SHM/direct/direct ubuntu:10196:10265 [5] NCCL INFO Channel 14 : 13[5] -> 14[6] via SHM/direct/direct ubuntu:10196:10264 [4] NCCL INFO Channel 15 : 12[4] -> 13[5] via SHM/direct/direct ubuntu:40273:40369 [1] NCCL INFO Channel 14 : 1[1] -> 2[2] via SHM/direct/direct ubuntu:40273:40373 [5] NCCL INFO Channel 12 : 5[5] -> 6[6] via SHM/direct/direct ubuntu:40273:40374 [6] NCCL INFO Channel 14 : 6[6] -> 7[7] via SHM/direct/direct ubuntu:10196:10266 [6] NCCL INFO Channel 14 : 14[6] -> 15[7] via SHM/direct/direct ubuntu:10196:10262 [2] NCCL INFO Channel 16 : 10[2] -> 11[3] via SHM/direct/direct ubuntu:40273:40371 [3] NCCL INFO Channel 14 : 3[3] -> 4[4] via SHM/direct/direct ubuntu:10196:10261 [1] NCCL INFO Channel 16 : 9[1] -> 10[2] via SHM/direct/direct ubuntu:40273:40372 [4] NCCL INFO Channel 12 : 4[4] -> 5[5] via SHM/direct/direct ubuntu:10196:10265 [5] NCCL INFO Channel 15 : 13[5] -> 14[6] via SHM/direct/direct ubuntu:40273:40370 [2] NCCL INFO Channel 15 : 2[2] -> 3[3] via SHM/direct/direct ubuntu:10196:10264 [4] NCCL INFO Channel 16 : 12[4] -> 13[5] via SHM/direct/direct ubuntu:10196:10263 [3] NCCL INFO Channel 17 : 11[3] -> 12[4] via SHM/direct/direct ubuntu:40273:40373 [5] NCCL INFO Channel 13 : 5[5] -> 6[6] via SHM/direct/direct ubuntu:10196:10266 [6] NCCL INFO Channel 15 : 14[6] -> 15[7] via SHM/direct/direct ubuntu:40273:40369 [1] NCCL INFO Channel 15 : 1[1] -> 2[2] via SHM/direct/direct ubuntu:40273:40371 [3] NCCL INFO Channel 15 : 3[3] -> 4[4] via SHM/direct/direct ubuntu:10196:10262 [2] NCCL INFO Channel 17 : 10[2] -> 11[3] via SHM/direct/direct ubuntu:40273:40374 [6] NCCL INFO Channel 15 : 6[6] -> 7[7] via SHM/direct/direct ubuntu:40273:40372 [4] NCCL INFO Channel 13 : 4[4] -> 5[5] via SHM/direct/direct ubuntu:10196:10261 [1] NCCL INFO Channel 17 : 9[1] -> 10[2] via SHM/direct/direct ubuntu:40273:40370 [2] NCCL INFO Channel 16 : 2[2] -> 3[3] via SHM/direct/direct ubuntu:10196:10263 [3] NCCL INFO Channel 18 : 11[3] -> 12[4] via SHM/direct/direct ubuntu:10196:10265 [5] NCCL INFO Channel 16 : 13[5] -> 14[6] via SHM/direct/direct ubuntu:10196:10264 [4] NCCL INFO Channel 17 : 12[4] -> 13[5] via SHM/direct/direct ubuntu:40273:40373 [5] NCCL INFO Channel 14 : 5[5] -> 6[6] via SHM/direct/direct ubuntu:40273:40371 [3] NCCL INFO Channel 16 : 3[3] -> 4[4] via SHM/direct/direct ubuntu:40273:40369 [1] NCCL INFO Channel 16 : 1[1] -> 2[2] via SHM/direct/direct ubuntu:40273:40372 [4] NCCL INFO Channel 14 : 4[4] -> 5[5] via SHM/direct/direct ubuntu:10196:10266 [6] NCCL INFO Channel 16 : 14[6] -> 15[7] via SHM/direct/direct ubuntu:10196:10262 [2] NCCL INFO Channel 18 : 10[2] -> 11[3] via SHM/direct/direct ubuntu:40273:40374 [6] NCCL INFO Channel 16 : 6[6] -> 7[7] via SHM/direct/direct ubuntu:10196:10261 [1] NCCL INFO Channel 18 : 9[1] -> 10[2] via SHM/direct/direct ubuntu:40273:40370 [2] NCCL INFO Channel 17 : 2[2] -> 3[3] via SHM/direct/direct ubuntu:10196:10265 [5] NCCL INFO Channel 17 : 13[5] -> 14[6] via SHM/direct/direct ubuntu:10196:10264 [4] NCCL INFO Channel 18 : 12[4] -> 13[5] via SHM/direct/direct ubuntu:10196:10263 [3] NCCL INFO Channel 19 : 11[3] -> 12[4] via SHM/direct/direct ubuntu:40273:40371 [3] NCCL INFO Channel 17 : 3[3] -> 4[4] via SHM/direct/direct ubuntu:40273:40373 [5] NCCL INFO Channel 15 : 5[5] -> 6[6] via SHM/direct/direct ubuntu:40273:40369 [1] NCCL INFO Channel 17 : 1[1] -> 2[2] via SHM/direct/direct ubuntu:40273:40372 [4] NCCL INFO Channel 15 : 4[4] -> 5[5] via SHM/direct/direct ubuntu:40273:40374 [6] NCCL INFO Channel 17 : 6[6] -> 7[7] via SHM/direct/direct ubuntu:10196:10261 [1] NCCL INFO Channel 19 : 9[1] -> 10[2] via SHM/direct/direct ubuntu:10196:10262 [2] NCCL INFO Channel 19 : 10[2] -> 11[3] via SHM/direct/direct ubuntu:10196:10266 [6] NCCL INFO Channel 17 : 14[6] -> 15[7] via SHM/direct/direct ubuntu:40273:40370 [2] NCCL INFO Channel 18 : 2[2] -> 3[3] via SHM/direct/direct ubuntu:10196:10265 [5] NCCL INFO Channel 18 : 13[5] -> 14[6] via SHM/direct/direct ubuntu:10196:10264 [4] NCCL INFO Channel 19 : 12[4] -> 13[5] via SHM/direct/direct ubuntu:10196:10263 [3] NCCL INFO Channel 20 : 11[3] -> 12[4] via SHM/direct/direct ubuntu:40273:40373 [5] NCCL INFO Channel 16 : 5[5] -> 6[6] via SHM/direct/direct ubuntu:40273:40371 [3] NCCL INFO Channel 18 : 3[3] -> 4[4] via SHM/direct/direct ubuntu:40273:40372 [4] NCCL INFO Channel 16 : 4[4] -> 5[5] via SHM/direct/direct ubuntu:10196:10262 [2] NCCL INFO Channel 20 : 10[2] -> 11[3] via SHM/direct/direct ubuntu:40273:40369 [1] NCCL INFO Channel 18 : 1[1] -> 2[2] via SHM/direct/direct ubuntu:10196:10261 [1] NCCL INFO Channel 20 : 9[1] -> 10[2] via SHM/direct/direct ubuntu:40273:40374 [6] NCCL INFO Channel 18 : 6[6] -> 7[7] via SHM/direct/direct ubuntu:40273:40370 [2] NCCL INFO Channel 19 : 2[2] -> 3[3] via SHM/direct/direct ubuntu:10196:10265 [5] NCCL INFO Channel 19 : 13[5] -> 14[6] via SHM/direct/direct ubuntu:10196:10265 [5] NCCL INFO Channel 20 : 13[5] -> 14[6] via SHM/direct/direct ubuntu:40273:40373 [5] NCCL INFO Channel 17 : 5[5] -> 6[6] via SHM/direct/direct ubuntu:10196:10264 [4] NCCL INFO Channel 20 : 12[4] -> 13[5] via SHM/direct/direct ubuntu:40273:40372 [4] NCCL INFO Channel 17 : 4[4] -> 5[5] via SHM/direct/direct ubuntu:40273:40369 [1] NCCL INFO Channel 19 : 1[1] -> 2[2] via SHM/direct/direct ubuntu:40273:40374 [6] NCCL INFO Channel 19 : 6[6] -> 7[7] via SHM/direct/direct ubuntu:40273:40371 [3] NCCL INFO Channel 19 : 3[3] -> 4[4] via SHM/direct/direct ubuntu:10196:10266 [6] NCCL INFO Channel 18 : 14[6] -> 15[7] via SHM/direct/direct ubuntu:40273:40370 [2] NCCL INFO Channel 20 : 2[2] -> 3[3] via SHM/direct/direct ubuntu:40273:40371 [3] NCCL INFO Channel 20 : 3[3] -> 4[4] via SHM/direct/direct ubuntu:40273:40374 [6] NCCL INFO Channel 20 : 6[6] -> 7[7] via SHM/direct/direct ubuntu:40273:40369 [1] NCCL INFO Channel 20 : 1[1] -> 2[2] via SHM/direct/direct ubuntu:10196:10266 [6] NCCL INFO Channel 19 : 14[6] -> 15[7] via SHM/direct/direct ubuntu:10196:10266 [6] NCCL INFO Channel 20 : 14[6] -> 15[7] via SHM/direct/direct ubuntu:40273:40372 [4] NCCL INFO Channel 18 : 4[4] -> 5[5] via SHM/direct/direct ubuntu:40273:40373 [5] NCCL INFO Channel 18 : 5[5] -> 6[6] via SHM/direct/direct ubuntu:40273:40372 [4] NCCL INFO Channel 19 : 4[4] -> 5[5] via SHM/direct/direct ubuntu:40273:40373 [5] NCCL INFO Channel 19 : 5[5] -> 6[6] via SHM/direct/direct ubuntu:40273:40372 [4] NCCL INFO Channel 20 : 4[4] -> 5[5] via SHM/direct/direct ubuntu:40273:40373 [5] NCCL INFO Channel 20 : 5[5] -> 6[6] via SHM/direct/direct ubuntu:10196:10263 [3] NCCL INFO Connected all rings ubuntu:10196:10262 [2] NCCL INFO Connected all rings ubuntu:10196:10264 [4] NCCL INFO Connected all rings ubuntu:40273:40370 [2] NCCL INFO Connected all rings ubuntu:40273:40369 [1] NCCL INFO Connected all rings ubuntu:10196:10261 [1] NCCL INFO Connected all rings ubuntu:40273:40372 [4] NCCL INFO Connected all rings ubuntu:10196:10266 [6] NCCL INFO Connected all rings ubuntu:10196:10265 [5] NCCL INFO Connected all rings ubuntu:40273:40375 [7] NCCL INFO Connected all rings ubuntu:40273:40375 [7] NCCL INFO Channel 00 : 7[7] -> 6[6] via SHM/direct/direct ubuntu:40273:40375 [7] NCCL INFO Channel 01 : 7[7] -> 6[6] via SHM/direct/direct ubuntu:40273:40371 [3] NCCL INFO Connected all rings ubuntu:40273:40374 [6] NCCL INFO Connected all rings ubuntu:40273:40373 [5] NCCL INFO Connected all rings ubuntu:10196:10260 [0] NCCL INFO Connected all rings ubuntu:10196:10267 [7] NCCL INFO Connected all rings ubuntu:40273:40375 [7] NCCL INFO Channel 02 : 7[7] -> 6[6] via SHM/direct/direct ubuntu:40273:40375 [7] NCCL INFO Channel 03 : 7[7] -> 6[6] via SHM/direct/direct ubuntu:40273:40375 [7] NCCL INFO Channel 04 : 7[7] -> 6[6] via SHM/direct/direct ubuntu:10196:10260 [0] NCCL INFO Channel 00/0 : 0[0] -> 8[0] [receive] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 01/0 : 0[0] -> 8[0] [receive] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 02/0 : 0[0] -> 8[0] [receive] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 03/0 : 0[0] -> 8[0] [receive] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 04/0 : 0[0] -> 8[0] [receive] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 05/0 : 0[0] -> 8[0] [receive] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Connected all rings ubuntu:10196:10267 [7] NCCL INFO Channel 00 : 15[7] -> 14[6] via SHM/direct/direct ubuntu:10196:10267 [7] NCCL INFO Channel 01 : 15[7] -> 14[6] via SHM/direct/direct ubuntu:40273:40368 [0] NCCL INFO Channel 00/0 : 8[0] -> 0[0] [receive] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 06/0 : 0[0] -> 8[0] [receive] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 01/0 : 8[0] -> 0[0] [receive] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 02/0 : 8[0] -> 0[0] [receive] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 07/0 : 0[0] -> 8[0] [receive] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 08/0 : 0[0] -> 8[0] [receive] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 09/0 : 0[0] -> 8[0] [receive] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 10/0 : 0[0] -> 8[0] [receive] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 11/0 : 0[0] -> 8[0] [receive] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 12/0 : 0[0] -> 8[0] [receive] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 13/0 : 0[0] -> 8[0] [receive] via NET/Socket/0 ubuntu:40273:40375 [7] NCCL INFO Channel 05 : 7[7] -> 6[6] via SHM/direct/direct ubuntu:40273:40375 [7] NCCL INFO Channel 06 : 7[7] -> 6[6] via SHM/direct/direct ubuntu:40273:40368 [0] NCCL INFO Channel 03/0 : 8[0] -> 0[0] [receive] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 14/0 : 0[0] -> 8[0] [receive] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 04/0 : 8[0] -> 0[0] [receive] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 05/0 : 8[0] -> 0[0] [receive] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 06/0 : 8[0] -> 0[0] [receive] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 07/0 : 8[0] -> 0[0] [receive] via NET/Socket/0 ubuntu:40273:40375 [7] NCCL INFO Channel 07 : 7[7] -> 6[6] via SHM/direct/direct ubuntu:10196:10260 [0] NCCL INFO Channel 15/0 : 0[0] -> 8[0] [receive] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 16/0 : 0[0] -> 8[0] [receive] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 17/0 : 0[0] -> 8[0] [receive] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 18/0 : 0[0] -> 8[0] [receive] via NET/Socket/0 ubuntu:10196:10267 [7] NCCL INFO Channel 02 : 15[7] -> 14[6] via SHM/direct/direct ubuntu:10196:10267 [7] NCCL INFO Channel 03 : 15[7] -> 14[6] via SHM/direct/direct ubuntu:10196:10267 [7] NCCL INFO Channel 04 : 15[7] -> 14[6] via SHM/direct/direct ubuntu:40273:40368 [0] NCCL INFO Channel 08/0 : 8[0] -> 0[0] [receive] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 19/0 : 0[0] -> 8[0] [receive] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 20/0 : 0[0] -> 8[0] [receive] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 00/0 : 8[0] -> 0[0] [send] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 01/0 : 8[0] -> 0[0] [send] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 02/0 : 8[0] -> 0[0] [send] via NET/Socket/0 ubuntu:10196:10267 [7] NCCL INFO Channel 05 : 15[7] -> 14[6] via SHM/direct/direct ubuntu:10196:10260 [0] NCCL INFO Channel 03/0 : 8[0] -> 0[0] [send] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 04/0 : 8[0] -> 0[0] [send] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 09/0 : 8[0] -> 0[0] [receive] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 05/0 : 8[0] -> 0[0] [send] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 06/0 : 8[0] -> 0[0] [send] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 07/0 : 8[0] -> 0[0] [send] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 08/0 : 8[0] -> 0[0] [send] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 10/0 : 8[0] -> 0[0] [receive] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 09/0 : 8[0] -> 0[0] [send] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 11/0 : 8[0] -> 0[0] [receive] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 10/0 : 8[0] -> 0[0] [send] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 12/0 : 8[0] -> 0[0] [receive] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 13/0 : 8[0] -> 0[0] [receive] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 14/0 : 8[0] -> 0[0] [receive] via NET/Socket/0 ubuntu:40273:40375 [7] NCCL INFO Channel 08 : 7[7] -> 6[6] via SHM/direct/direct ubuntu:40273:40375 [7] NCCL INFO Channel 09 : 7[7] -> 6[6] via SHM/direct/direct ubuntu:40273:40375 [7] NCCL INFO Channel 10 : 7[7] -> 6[6] via SHM/direct/direct ubuntu:40273:40368 [0] NCCL INFO Channel 15/0 : 8[0] -> 0[0] [receive] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 16/0 : 8[0] -> 0[0] [receive] via NET/Socket/0 ubuntu:10196:10267 [7] NCCL INFO Channel 06 : 15[7] -> 14[6] via SHM/direct/direct ubuntu:10196:10267 [7] NCCL INFO Channel 07 : 15[7] -> 14[6] via SHM/direct/direct ubuntu:40273:40368 [0] NCCL INFO Channel 17/0 : 8[0] -> 0[0] [receive] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 18/0 : 8[0] -> 0[0] [receive] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 19/0 : 8[0] -> 0[0] [receive] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 20/0 : 8[0] -> 0[0] [receive] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 00/0 : 0[0] -> 8[0] [send] via NET/Socket/0 ubuntu:40273:40375 [7] NCCL INFO Channel 11 : 7[7] -> 6[6] via SHM/direct/direct ubuntu:10196:10267 [7] NCCL INFO Channel 08 : 15[7] -> 14[6] via SHM/direct/direct ubuntu:10196:10260 [0] NCCL INFO Channel 11/0 : 8[0] -> 0[0] [send] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 01/0 : 0[0] -> 8[0] [send] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 12/0 : 8[0] -> 0[0] [send] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 02/0 : 0[0] -> 8[0] [send] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 13/0 : 8[0] -> 0[0] [send] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 03/0 : 0[0] -> 8[0] [send] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 04/0 : 0[0] -> 8[0] [send] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 05/0 : 0[0] -> 8[0] [send] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 06/0 : 0[0] -> 8[0] [send] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 14/0 : 8[0] -> 0[0] [send] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 15/0 : 8[0] -> 0[0] [send] via NET/Socket/0 ubuntu:40273:40375 [7] NCCL INFO Channel 12 : 7[7] -> 6[6] via SHM/direct/direct ubuntu:40273:40375 [7] NCCL INFO Channel 13 : 7[7] -> 6[6] via SHM/direct/direct ubuntu:10196:10260 [0] NCCL INFO Channel 16/0 : 8[0] -> 0[0] [send] via NET/Socket/0 ubuntu:40273:40375 [7] NCCL INFO Channel 14 : 7[7] -> 6[6] via SHM/direct/direct ubuntu:10196:10260 [0] NCCL INFO Channel 17/0 : 8[0] -> 0[0] [send] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 18/0 : 8[0] -> 0[0] [send] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 19/0 : 8[0] -> 0[0] [send] via NET/Socket/0 ubuntu:10196:10267 [7] NCCL INFO Channel 09 : 15[7] -> 14[6] via SHM/direct/direct ubuntu:40273:40368 [0] NCCL INFO Channel 07/0 : 0[0] -> 8[0] [send] via NET/Socket/0 ubuntu:10196:10267 [7] NCCL INFO Channel 10 : 15[7] -> 14[6] via SHM/direct/direct ubuntu:40273:40368 [0] NCCL INFO Channel 08/0 : 0[0] -> 8[0] [send] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 09/0 : 0[0] -> 8[0] [send] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 10/0 : 0[0] -> 8[0] [send] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 11/0 : 0[0] -> 8[0] [send] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 12/0 : 0[0] -> 8[0] [send] via NET/Socket/0 ubuntu:10196:10260 [0] NCCL INFO Channel 20/0 : 8[0] -> 0[0] [send] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 13/0 : 0[0] -> 8[0] [send] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 14/0 : 0[0] -> 8[0] [send] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 15/0 : 0[0] -> 8[0] [send] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 16/0 : 0[0] -> 8[0] [send] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 17/0 : 0[0] -> 8[0] [send] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 18/0 : 0[0] -> 8[0] [send] via NET/Socket/0 ubuntu:40273:40368 [0] NCCL INFO Channel 19/0 : 0[0] -> 8[0] [send] via NET/Socket/0 ubuntu:40273:40375 [7] NCCL INFO Channel 15 : 7[7] -> 6[6] via SHM/direct/direct ubuntu:40273:40375 [7] NCCL INFO Channel 16 : 7[7] -> 6[6] via SHM/direct/direct ubuntu:40273:40375 [7] NCCL INFO Channel 17 : 7[7] -> 6[6] via SHM/direct/direct ubuntu:10196:10267 [7] NCCL INFO Channel 11 : 15[7] -> 14[6] via SHM/direct/direct ubuntu:10196:10267 [7] NCCL INFO Channel 12 : 15[7] -> 14[6] via SHM/direct/direct ubuntu:40273:40368 [0] NCCL INFO Channel 20/0 : 0[0] -> 8[0] [send] via NET/Socket/0 ubuntu:10196:10267 [7] NCCL INFO Channel 13 : 15[7] -> 14[6] via SHM/direct/direct ubuntu:10196:10267 [7] NCCL INFO Channel 14 : 15[7] -> 14[6] via SHM/direct/direct ubuntu:40273:40375 [7] NCCL INFO Channel 18 : 7[7] -> 6[6] via SHM/direct/direct ubuntu:40273:40375 [7] NCCL INFO Channel 19 : 7[7] -> 6[6] via SHM/direct/direct ubuntu:40273:40375 [7] NCCL INFO Channel 20 : 7[7] -> 6[6] via SHM/direct/direct ubuntu:10196:10267 [7] NCCL INFO Channel 15 : 15[7] -> 14[6] via SHM/direct/direct ubuntu:10196:10267 [7] NCCL INFO Channel 16 : 15[7] -> 14[6] via SHM/direct/direct ubuntu:10196:10260 [0] misc/shmutils.cc:71 NCCL WARN Error: failed to extend /dev/shm/nccl-8Wgjd7 to 9637892 bytes ubuntu:10196:10260 [0] misc/shmutils.cc:112 NCCL WARN Error while creating shared memory segment /dev/shm/nccl-8Wgjd7 (size 9637888) ubuntu:10196:10260 [0] NCCL INFO transport/shm.cc:114 -> 2 ubuntu:10196:10260 [0] NCCL INFO transport.cc:33 -> 2 ubuntu:10196:10260 [0] NCCL INFO transport.cc:97 -> 2 ubuntu:10196:10260 [0] NCCL INFO init.cc:1089 -> 2 ubuntu:10196:10260 [0] NCCL INFO init.cc:1358 -> 2 ubuntu:10196:10260 [0] NCCL INFO group.cc:65 -> 2 [Async thread] ubuntu:10196:10267 [7] NCCL INFO Channel 17 : 15[7] -> 14[6] via SHM/direct/direct ubuntu:10196:10267 [7] misc/shmutils.cc:71 NCCL WARN Error: failed to extend /dev/shm/nccl-SVxRqZ to 4100 bytes ubuntu:10196:10267 [7] misc/shmutils.cc:112 NCCL WARN Error while creating shared memory segment /dev/shm/nccl-SVxRqZ (size 4096) ubuntu:10196:10267 [7] NCCL INFO transport/shm.cc:91 -> 2 ubuntu:10196:10267 [7] NCCL INFO transport.cc:33 -> 2 ubuntu:10196:10267 [7] NCCL INFO transport.cc:106 -> 2 ubuntu:10196:10267 [7] NCCL INFO init.cc:1089 -> 2 ubuntu:10196:10267 [7] NCCL INFO init.cc:1358 -> 2 ubuntu:10196:10267 [7] NCCL INFO group.cc:65 -> 2 [Async thread] ubuntu:10196:10265 [5] misc/shmutils.cc:71 NCCL WARN Error: failed to extend /dev/shm/nccl-ICzGER to 9637892 bytes ubuntu:10196:10265 [5] misc/shmutils.cc:112 NCCL WARN Error while creating shared memory segment /dev/shm/nccl-ICzGER (size 9637888) ubuntu:10196:10265 [5] NCCL INFO transport/shm.cc:114 -> 2 ubuntu:10196:10265 [5] NCCL INFO transport.cc:33 -> 2 ubuntu:10196:10265 [5] NCCL INFO transport.cc:97 -> 2 ubuntu:10196:10265 [5] NCCL INFO init.cc:1089 -> 2 ubuntu:10196:10265 [5] NCCL INFO init.cc:1358 -> 2 ubuntu:10196:10265 [5] NCCL INFO group.cc:65 -> 2 [Async thread] ubuntu:10196:10263 [3] misc/shmutils.cc:71 NCCL WARN Error: failed to extend /dev/shm/nccl-OxqhTJ to 9637892 bytes ubuntu:10196:10263 [3] misc/shmutils.cc:112 NCCL WARN Error while creating shared memory segment /dev/shm/nccl-OxqhTJ (size 9637888) ubuntu:10196:10263 [3] NCCL INFO transport/shm.cc:114 -> 2 ubuntu:10196:10263 [3] NCCL INFO transport.cc:33 -> 2 ubuntu:10196:10263 [3] NCCL INFO transport.cc:97 -> 2 ubuntu:10196:10263 [3] NCCL INFO init.cc:1089 -> 2 ubuntu:10196:10263 [3] NCCL INFO init.cc:1358 -> 2 ubuntu:10196:10263 [3] NCCL INFO group.cc:65 -> 2 [Async thread] ubuntu:10196:10264 [4] misc/shmutils.cc:71 NCCL WARN Error: failed to extend /dev/shm/nccl-G2B37B to 9637892 bytes ubuntu:10196:10264 [4] misc/shmutils.cc:112 NCCL WARN Error while creating shared memory segment /dev/shm/nccl-G2B37B (size 9637888) ubuntu:10196:10264 [4] NCCL INFO transport/shm.cc:114 -> 2 ubuntu:10196:10264 [4] NCCL INFO transport.cc:33 -> 2 ubuntu:10196:10264 [4] NCCL INFO transport.cc:97 -> 2 ubuntu:10196:10264 [4] NCCL INFO init.cc:1089 -> 2 ubuntu:10196:10264 [4] NCCL INFO init.cc:1358 -> 2 ubuntu:10196:10264 [4] NCCL INFO group.cc:65 -> 2 [Async thread] ubuntu:10196:10262 [2] misc/shmutils.cc:71 NCCL WARN Error: failed to extend /dev/shm/nccl-823Ymu to 9637892 bytes ubuntu:10196:10262 [2] misc/shmutils.cc:112 NCCL WARN Error while creating shared memory segment /dev/shm/nccl-823Ymu (size 9637888) ubuntu:10196:10262 [2] NCCL INFO transport/shm.cc:114 -> 2 ubuntu:10196:10262 [2] NCCL INFO transport.cc:33 -> 2 ubuntu:10196:10262 [2] NCCL INFO transport.cc:97 -> 2 ubuntu:10196:10262 [2] NCCL INFO init.cc:1089 -> 2 ubuntu:10196:10262 [2] NCCL INFO init.cc:1358 -> 2 ubuntu:10196:10262 [2] NCCL INFO group.cc:65 -> 2 [Async thread] ubuntu:10196:10261 [1] misc/shmutils.cc:71 NCCL WARN Error: failed to extend /dev/shm/nccl-iy0QTe to 9637892 bytes ubuntu:10196:10261 [1] misc/shmutils.cc:112 NCCL WARN Error while creating shared memory segment /dev/shm/nccl-iy0QTe (size 9637888) ubuntu:10196:10261 [1] NCCL INFO transport/shm.cc:114 -> 2 ubuntu:10196:10261 [1] NCCL INFO transport.cc:33 -> 2 ubuntu:10196:10261 [1] NCCL INFO transport.cc:97 -> 2 ubuntu:10196:10261 [1] NCCL INFO init.cc:1089 -> 2 ubuntu:10196:10261 [1] NCCL INFO init.cc:1358 -> 2 ubuntu:10196:10261 [1] NCCL INFO group.cc:65 -> 2 [Async thread] ubuntu:10196:10266 [6] misc/shmutils.cc:71 NCCL WARN Error: failed to extend /dev/shm/nccl-OkgqDm to 9637892 bytes ubuntu:10196:10266 [6] misc/shmutils.cc:112 NCCL WARN Error while creating shared memory segment /dev/shm/nccl-OkgqDm (size 9637888) ubuntu:10196:10266 [6] NCCL INFO transport/shm.cc:114 -> 2 ubuntu:10196:10266 [6] NCCL INFO transport.cc:33 -> 2 ubuntu:10196:10266 [6] NCCL INFO transport.cc:97 -> 2 ubuntu:10196:10266 [6] NCCL INFO init.cc:1089 -> 2 ubuntu:10196:10266 [6] NCCL INFO init.cc:1358 -> 2 ubuntu:10196:10266 [6] NCCL INFO group.cc:65 -> 2 [Async thread] ubuntu:10196:10196 [7] NCCL INFO group.cc:406 -> 2 ubuntu:10196:10196 [7] NCCL INFO group.cc:96 -> 2 ubuntu: Test NCCL failure common.cu:958 'unhandled system error (run with NCCL_DEBUG=INFO for details) / ' .. ubuntu pid 10196: Test failure common.cu:842 ubuntu:40273:40370 [2] NCCL INFO Channel 00 : 2[2] -> 1[1] via SHM/direct/direct ubuntu:40273:40370 [2] NCCL INFO Channel 01 : 2[2] -> 1[1] via SHM/direct/direct ubuntu:40273:40370 [2] NCCL INFO Channel 02 : 2[2] -> 1[1] via SHM/direct/direct ubuntu:40273:40369 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct ubuntu:40273:40372 [4] NCCL INFO Channel 00 : 4[4] -> 3[3] via SHM/direct/direct ubuntu:40273:40369 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct ubuntu:40273:40369 [1] NCCL INFO Channel 02 : 1[1] -> 0[0] via SHM/direct/direct ubuntu:40273:40369 [1] NCCL INFO Channel 03 : 1[1] -> 0[0] via SHM/direct/direct ubuntu:40273:40369 [1] NCCL INFO Channel 04 : 1[1] -> 0[0] via SHM/direct/direct ubuntu:40273:40370 [2] NCCL INFO Channel 03 : 2[2] -> 1[1] via SHM/direct/direct ubuntu:40273:40370 [2] NCCL INFO Channel 04 : 2[2] -> 1[1] via SHM/direct/direct ubuntu:40273:40371 [3] NCCL INFO Channel 00 : 3[3] -> 2[2] via SHM/direct/direct ubuntu:40273:40370 [2] NCCL INFO Channel 05 : 2[2] -> 1[1] via SHM/direct/direct ubuntu:40273:40369 [1] NCCL INFO Channel 05 : 1[1] -> 0[0] via SHM/direct/direct ubuntu:40273:40372 [4] NCCL INFO Channel 01 : 4[4] -> 3[3] via SHM/direct/direct ubuntu:40273:40372 [4] NCCL INFO Channel 02 : 4[4] -> 3[3] via SHM/direct/direct ubuntu:40273:40371 [3] NCCL INFO Channel 01 : 3[3] -> 2[2] via SHM/direct/direct ubuntu:40273:40372 [4] NCCL INFO Channel 03 : 4[4] -> 3[3] via SHM/direct/direct ubuntu:40273:40374 [6] NCCL INFO Channel 00 : 6[6] -> 5[5] via SHM/direct/direct ubuntu:40273:40373 [5] NCCL INFO Channel 00 : 5[5] -> 4[4] via SHM/direct/direct ubuntu:40273:40370 [2] NCCL INFO Channel 06 : 2[2] -> 1[1] via SHM/direct/direct ubuntu:40273:40369 [1] NCCL INFO Channel 06 : 1[1] -> 0[0] via SHM/direct/direct ubuntu:40273:40371 [3] NCCL INFO Channel 02 : 3[3] -> 2[2] via SHM/direct/direct ubuntu:40273:40372 [4] NCCL INFO Channel 04 : 4[4] -> 3[3] via SHM/direct/direct ubuntu:40273:40374 [6] NCCL INFO Channel 01 : 6[6] -> 5[5] via SHM/direct/direct ubuntu:40273:40373 [5] NCCL INFO Channel 01 : 5[5] -> 4[4] via SHM/direct/direct ubuntu:40273:40370 [2] NCCL INFO Channel 07 : 2[2] -> 1[1] via SHM/direct/direct ubuntu:40273:40369 [1] NCCL INFO Channel 07 : 1[1] -> 0[0] via SHM/direct/direct ubuntu:40273:40369 [1] NCCL INFO Channel 08 : 1[1] -> 0[0] via SHM/direct/direct ubuntu:40273:40372 [4] NCCL INFO Channel 05 : 4[4] -> 3[3] via SHM/direct/direct ubuntu:40273:40371 [3] NCCL INFO Channel 03 : 3[3] -> 2[2] via SHM/direct/direct ubuntu:40273:40374 [6] NCCL INFO Channel 02 : 6[6] -> 5[5] via SHM/direct/direct ubuntu:40273:40373 [5] NCCL INFO Channel 02 : 5[5] -> 4[4] via SHM/direct/direct ubuntu:40273:40370 [2] NCCL INFO Channel 08 : 2[2] -> 1[1] via SHM/direct/direct ubuntu:40273:40372 [4] NCCL INFO Channel 06 : 4[4] -> 3[3] via SHM/direct/direct ubuntu:40273:40371 [3] NCCL INFO Channel 04 : 3[3] -> 2[2] via SHM/direct/direct ubuntu:40273:40374 [6] NCCL INFO Channel 03 : 6[6] -> 5[5] via SHM/direct/direct ubuntu:40273:40373 [5] NCCL INFO Channel 03 : 5[5] -> 4[4] via SHM/direct/direct ubuntu:40273:40370 [2] NCCL INFO Channel 09 : 2[2] -> 1[1] via SHM/direct/direct ubuntu:40273:40369 [1] NCCL INFO Channel 09 : 1[1] -> 0[0] via SHM/direct/direct ubuntu:40273:40371 [3] NCCL INFO Channel 05 : 3[3] -> 2[2] via SHM/direct/direct ubuntu:40273:40372 [4] NCCL INFO Channel 07 : 4[4] -> 3[3] via SHM/direct/direct ubuntu:40273:40373 [5] NCCL INFO Channel 04 : 5[5] -> 4[4] via SHM/direct/direct ubuntu:40273:40374 [6] NCCL INFO Channel 04 : 6[6] -> 5[5] via SHM/direct/direct ubuntu:40273:40370 [2] NCCL INFO Channel 10 : 2[2] -> 1[1] via SHM/direct/direct ubuntu:40273:40369 [1] NCCL INFO Channel 10 : 1[1] -> 0[0] via SHM/direct/direct ubuntu:40273:40371 [3] NCCL INFO Channel 06 : 3[3] -> 2[2] via SHM/direct/direct ubuntu:40273:40372 [4] NCCL INFO Channel 08 : 4[4] -> 3[3] via SHM/direct/direct ubuntu:40273:40373 [5] NCCL INFO Channel 05 : 5[5] -> 4[4] via SHM/direct/direct ubuntu:40273:40374 [6] NCCL INFO Channel 05 : 6[6] -> 5[5] via SHM/direct/direct ubuntu:40273:40370 [2] NCCL INFO Channel 11 : 2[2] -> 1[1] via SHM/direct/direct ubuntu:40273:40369 [1] NCCL INFO Channel 11 : 1[1] -> 0[0] via SHM/direct/direct ubuntu:40273:40371 [3] NCCL INFO Channel 07 : 3[3] -> 2[2] via SHM/direct/direct ubuntu:40273:40372 [4] NCCL INFO Channel 09 : 4[4] -> 3[3] via SHM/direct/direct ubuntu:40273:40374 [6] NCCL INFO Channel 06 : 6[6] -> 5[5] via SHM/direct/direct ubuntu:40273:40373 [5] NCCL INFO Channel 06 : 5[5] -> 4[4] via SHM/direct/direct ubuntu:40273:40370 [2] NCCL INFO Channel 12 : 2[2] -> 1[1] via SHM/direct/direct ubuntu:40273:40369 [1] NCCL INFO Channel 12 : 1[1] -> 0[0] via SHM/direct/direct ubuntu:40273:40371 [3] NCCL INFO Channel 08 : 3[3] -> 2[2] via SHM/direct/direct ubuntu:40273:40372 [4] NCCL INFO Channel 10 : 4[4] -> 3[3] via SHM/direct/direct ubuntu:40273:40374 [6] NCCL INFO Channel 07 : 6[6] -> 5[5] via SHM/direct/direct ubuntu:40273:40373 [5] NCCL INFO Channel 07 : 5[5] -> 4[4] via SHM/direct/direct ubuntu:40273:40370 [2] NCCL INFO Channel 13 : 2[2] -> 1[1] via SHM/direct/direct ubuntu:40273:40369 [1] NCCL INFO Channel 13 : 1[1] -> 0[0] via SHM/direct/direct ubuntu:40273:40371 [3] NCCL INFO Channel 09 : 3[3] -> 2[2] via SHM/direct/direct ubuntu:40273:40372 [4] NCCL INFO Channel 11 : 4[4] -> 3[3] via SHM/direct/direct ubuntu:40273:40374 [6] NCCL INFO Channel 08 : 6[6] -> 5[5] via SHM/direct/direct ubuntu:40273:40373 [5] NCCL INFO Channel 08 : 5[5] -> 4[4] via SHM/direct/direct ubuntu:40273:40370 [2] NCCL INFO Channel 14 : 2[2] -> 1[1] via SHM/direct/direct ubuntu:40273:40369 [1] NCCL INFO Channel 14 : 1[1] -> 0[0] via SHM/direct/direct ubuntu:40273:40371 [3] NCCL INFO Channel 10 : 3[3] -> 2[2] via SHM/direct/direct ubuntu:40273:40374 [6] NCCL INFO Channel 09 : 6[6] -> 5[5] via SHM/direct/direct ubuntu:40273:40372 [4] NCCL INFO Channel 12 : 4[4] -> 3[3] via SHM/direct/direct ubuntu:40273:40373 [5] NCCL INFO Channel 09 : 5[5] -> 4[4] via SHM/direct/direct ubuntu:40273:40370 [2] NCCL INFO Channel 15 : 2[2] -> 1[1] via SHM/direct/direct ubuntu:40273:40369 [1] NCCL INFO Channel 15 : 1[1] -> 0[0] via SHM/direct/direct ubuntu:40273:40371 [3] NCCL INFO Channel 11 : 3[3] -> 2[2] via SHM/direct/direct ubuntu:40273:40372 [4] NCCL INFO Channel 13 : 4[4] -> 3[3] via SHM/direct/direct ubuntu:40273:40374 [6] NCCL INFO Channel 10 : 6[6] -> 5[5] via SHM/direct/direct ubuntu:40273:40373 [5] NCCL INFO Channel 10 : 5[5] -> 4[4] via SHM/direct/direct ubuntu:40273:40370 [2] NCCL INFO Channel 16 : 2[2] -> 1[1] via SHM/direct/direct ubuntu:40273:40369 [1] NCCL INFO Channel 16 : 1[1] -> 0[0] via SHM/direct/direct ubuntu:40273:40371 [3] NCCL INFO Channel 12 : 3[3] -> 2[2] via SHM/direct/direct ubuntu:40273:40372 [4] NCCL INFO Channel 14 : 4[4] -> 3[3] via SHM/direct/direct ubuntu:40273:40374 [6] NCCL INFO Channel 11 : 6[6] -> 5[5] via SHM/direct/direct ubuntu:40273:40373 [5] NCCL INFO Channel 11 : 5[5] -> 4[4] via SHM/direct/direct ubuntu:40273:40370 [2] NCCL INFO Channel 17 : 2[2] -> 1[1] via SHM/direct/direct ubuntu:40273:40369 [1] NCCL INFO Channel 17 : 1[1] -> 0[0] via SHM/direct/direct ubuntu:40273:40371 [3] NCCL INFO Channel 13 : 3[3] -> 2[2] via SHM/direct/direct ubuntu:40273:40372 [4] NCCL INFO Channel 15 : 4[4] -> 3[3] via SHM/direct/direct ubuntu:40273:40374 [6] NCCL INFO Channel 12 : 6[6] -> 5[5] via SHM/direct/direct ubuntu:40273:40373 [5] NCCL INFO Channel 12 : 5[5] -> 4[4] via SHM/direct/direct ubuntu:40273:40370 [2] NCCL INFO Channel 18 : 2[2] -> 1[1] via SHM/direct/direct ubuntu:40273:40369 [1] NCCL INFO Channel 18 : 1[1] -> 0[0] via SHM/direct/direct ubuntu:40273:40371 [3] NCCL INFO Channel 14 : 3[3] -> 2[2] via SHM/direct/direct ubuntu:40273:40372 [4] NCCL INFO Channel 16 : 4[4] -> 3[3] via SHM/direct/direct ubuntu:40273:40374 [6] NCCL INFO Channel 13 : 6[6] -> 5[5] via SHM/direct/direct ubuntu:40273:40373 [5] NCCL INFO Channel 13 : 5[5] -> 4[4] via SHM/direct/direct ubuntu:40273:40370 [2] NCCL INFO Channel 19 : 2[2] -> 1[1] via SHM/direct/direct ubuntu:40273:40369 [1] NCCL INFO Channel 19 : 1[1] -> 0[0] via SHM/direct/direct ubuntu:40273:40371 [3] NCCL INFO Channel 15 : 3[3] -> 2[2] via SHM/direct/direct ubuntu:40273:40372 [4] NCCL INFO Channel 17 : 4[4] -> 3[3] via SHM/direct/direct ubuntu:40273:40374 [6] NCCL INFO Channel 14 : 6[6] -> 5[5] via SHM/direct/direct ubuntu:40273:40373 [5] NCCL INFO Channel 14 : 5[5] -> 4[4] via SHM/direct/direct ubuntu:40273:40370 [2] NCCL INFO Channel 20 : 2[2] -> 1[1] via SHM/direct/direct ubuntu:40273:40369 [1] NCCL INFO Channel 20 : 1[1] -> 0[0] via SHM/direct/direct ubuntu:40273:40371 [3] NCCL INFO Channel 16 : 3[3] -> 2[2] via SHM/direct/direct ubuntu:40273:40376 [0] misc/socket.cc:483 NCCL WARN socketStartConnect: Connect to 10.0.0.2<41409> failed : Software caused connection abort ubuntu:40273:40376 [0] NCCL INFO misc/socket.cc:564 -> 2 ubuntu:40273:40376 [0] NCCL INFO misc/socket.cc:586 -> 2 ubuntu:40273:40376 [0] NCCL INFO transport/net_socket.cc:336 -> 2 ubuntu:40273:40376 [0] NCCL INFO transport/net.cc:592 -> 2 ubuntu:40273:40376 [0] NCCL INFO proxy.cc:1306 -> 2 ubuntu:40273:40376 [0] NCCL INFO proxy.cc:1377 -> 2 ubuntu:40273:40376 [0] proxy.cc:1519 NCCL WARN [Proxy Service 0] Failed to execute operation Connect from rank 0, retcode 2 ubuntu:40273:40372 [4] NCCL INFO Channel 18 : 4[4] -> 3[3] via SHM/direct/direct ubuntu:40273:40374 [6] NCCL INFO Channel 15 : 6[6] -> 5[5] via SHM/direct/direct ubuntu:40273:40373 [5] NCCL INFO Channel 15 : 5[5] -> 4[4] via SHM/direct/direct ubuntu:40273:40371 [3] NCCL INFO Channel 17 : 3[3] -> 2[2] via SHM/direct/direct ubuntu:40273:40372 [4] NCCL INFO Channel 19 : 4[4] -> 3[3] via SHM/direct/direct ubuntu:40273:40372 [4] NCCL INFO Channel 20 : 4[4] -> 3[3] via SHM/direct/direct ubuntu:40273:40373 [5] NCCL INFO Channel 16 : 5[5] -> 4[4] via SHM/direct/direct ubuntu:40273:40373 [5] NCCL INFO Channel 17 : 5[5] -> 4[4] via SHM/direct/direct ubuntu:40273:40374 [6] NCCL INFO Channel 16 : 6[6] -> 5[5] via SHM/direct/direct ubuntu:40273:40371 [3] NCCL INFO Channel 18 : 3[3] -> 2[2] via SHM/direct/direct ubuntu:40273:40371 [3] NCCL INFO Channel 19 : 3[3] -> 2[2] via SHM/direct/direct ubuntu:40273:40373 [5] NCCL INFO Channel 18 : 5[5] -> 4[4] via SHM/direct/direct ubuntu:40273:40374 [6] NCCL INFO Channel 17 : 6[6] -> 5[5] via SHM/direct/direct ubuntu:40273:40374 [6] NCCL INFO Channel 18 : 6[6] -> 5[5] via SHM/direct/direct ubuntu:40273:40374 [6] NCCL INFO Channel 19 : 6[6] -> 5[5] via SHM/direct/direct ubuntu:40273:40374 [6] NCCL INFO Channel 20 : 6[6] -> 5[5] via SHM/direct/direct ubuntu:40273:40371 [3] NCCL INFO Channel 20 : 3[3] -> 2[2] via SHM/direct/direct ubuntu:40273:40373 [5] NCCL INFO Channel 19 : 5[5] -> 4[4] via SHM/direct/direct ubuntu:40273:40373 [5] NCCL INFO Channel 20 : 5[5] -> 4[4] via SHM/direct/direct ubuntu:40273:40368 [0] misc/socket.cc:49 NCCL WARN socketProgress: Connection closed by remote peer node1<50091> ubuntu:40273:40368 [0] NCCL INFO misc/socket.cc:749 -> 6 ubuntu:40273:40368 [0] proxy.cc:1143 NCCL WARN Socket recv failed while polling for opId=0x7f82d4cbc0a0 ubuntu:40273:40368 [0] NCCL INFO transport/net.cc:288 -> 3 ubuntu:40273:40368 [0] NCCL INFO transport.cc:148 -> 3 ubuntu:40273:40368 [0] NCCL INFO init.cc:1089 -> 3 ubuntu:40273:40368 [0] NCCL INFO init.cc:1358 -> 3 ubuntu:40273:40368 [0] NCCL INFO group.cc:65 -> 3 [Async thread] ubuntu:40273:40375 [7] NCCL INFO Connected all trees ubuntu:40273:40375 [7] NCCL INFO NCCL_NTHREADS set by environment to 256. ubuntu:40273:40375 [7] NCCL INFO NCCL_ALGO set by environment to Tree,Ring ubuntu:40273:40375 [7] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 ubuntu:40273:40375 [7] NCCL INFO 21 coll channels, 0 nvls channels, 32 p2p channels, 2 p2p channels per peer ubuntu:40273:40380 [7] NCCL INFO misc/socket.cc:805 -> 3 ubuntu:40273:40380 [7] proxy.cc:1495 NCCL WARN [Service thread] Could not receive type from localRank 7, res=3, closed=0 ubuntu:40273:40380 [7] proxy.cc:1519 NCCL WARN [Proxy Service 7] Failed to execute operation Init from rank 7, retcode 3 ubuntu:40273:40375 [7] NCCL INFO misc/socket.cc:46 -> 3 ubuntu:40273:40375 [7] NCCL INFO misc/socket.cc:57 -> 3 ubuntu:40273:40375 [7] NCCL INFO misc/socket.cc:772 -> 3 ubuntu:40273:40375 [7] NCCL INFO proxy.cc:1107 -> 3 ubuntu:40273:40375 [7] NCCL INFO proxy.cc:1193 -> 3 ubuntu:40273:40375 [7] NCCL INFO proxy.cc:1047 -> 3 ubuntu:40273:40375 [7] NCCL INFO init.cc:1185 -> 3 ubuntu:40273:40375 [7] NCCL INFO init.cc:1358 -> 3 ubuntu:40273:40375 [7] NCCL INFO group.cc:65 -> 3 [Async thread] ubuntu:40273:40369 [1] NCCL INFO Connected all trees ubuntu:40273:40369 [1] NCCL INFO NCCL_ALGO set by environment to Tree,Ring ubuntu:40273:40369 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 ubuntu:40273:40369 [1] NCCL INFO 21 coll channels, 0 nvls channels, 32 p2p channels, 2 p2p channels per peer ubuntu:40273:40377 [1] NCCL INFO misc/socket.cc:46 -> 3 ubuntu:40273:40377 [1] NCCL INFO misc/socket.cc:749 -> 3 ubuntu:40273:40377 [1] NCCL INFO misc/socket.cc:427 -> 3 ubuntu:40273:40377 [1] NCCL INFO misc/socket.cc:561 -> 3 ubuntu:40273:40377 [1] NCCL INFO misc/socket.cc:665 -> 3 ubuntu:40273:40377 [1] proxy.cc:1456 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable ubuntu:40273:40369 [1] NCCL INFO misc/socket.cc:46 -> 3 ubuntu:40273:40369 [1] NCCL INFO misc/socket.cc:547 -> 3 ubuntu:40273:40369 [1] NCCL INFO misc/socket.cc:570 -> 3 ubuntu:40273:40369 [1] NCCL INFO misc/socket.cc:618 -> 3 ubuntu:40273:40369 [1] NCCL INFO proxy.cc:1034 -> 3 ubuntu:40273:40369 [1] NCCL INFO init.cc:1185 -> 3 ubuntu:40273:40369 [1] NCCL INFO init.cc:1358 -> 3 ubuntu:40273:40369 [1] NCCL INFO group.cc:65 -> 3 [Async thread] ubuntu:40273:40371 [3] NCCL INFO Connected all trees ubuntu:40273:40371 [3] NCCL INFO NCCL_ALGO set by environment to Tree,Ring ubuntu:40273:40371 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 ubuntu:40273:40371 [3] NCCL INFO 21 coll channels, 0 nvls channels, 32 p2p channels, 2 p2p channels per peer ubuntu:40273:40379 [3] NCCL INFO misc/socket.cc:46 -> 3 ubuntu:40273:40379 [3] NCCL INFO misc/socket.cc:749 -> 3 ubuntu:40273:40379 [3] NCCL INFO misc/socket.cc:427 -> 3 ubuntu:40273:40379 [3] NCCL INFO misc/socket.cc:561 -> 3 ubuntu:40273:40379 [3] NCCL INFO misc/socket.cc:665 -> 3 ubuntu:40273:40379 [3] proxy.cc:1456 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable ubuntu:40273:40371 [3] NCCL INFO misc/socket.cc:46 -> 3 ubuntu:40273:40371 [3] NCCL INFO misc/socket.cc:547 -> 3 ubuntu:40273:40371 [3] NCCL INFO misc/socket.cc:570 -> 3 ubuntu:40273:40371 [3] NCCL INFO misc/socket.cc:618 -> 3 ubuntu:40273:40371 [3] NCCL INFO proxy.cc:1034 -> 3 ubuntu:40273:40371 [3] NCCL INFO init.cc:1185 -> 3 ubuntu:40273:40371 [3] NCCL INFO init.cc:1358 -> 3 ubuntu:40273:40371 [3] NCCL INFO group.cc:65 -> 3 [Async thread] ubuntu:40273:40373 [5] NCCL INFO Connected all trees ubuntu:40273:40373 [5] NCCL INFO NCCL_ALGO set by environment to Tree,Ring ubuntu:40273:40373 [5] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 ubuntu:40273:40373 [5] NCCL INFO 21 coll channels, 0 nvls channels, 32 p2p channels, 2 p2p channels per peer ubuntu:40273:40383 [5] NCCL INFO misc/socket.cc:46 -> 3 ubuntu:40273:40383 [5] NCCL INFO misc/socket.cc:749 -> 3 ubuntu:40273:40383 [5] NCCL INFO misc/socket.cc:427 -> 3 ubuntu:40273:40383 [5] NCCL INFO misc/socket.cc:561 -> 3 ubuntu:40273:40383 [5] NCCL INFO misc/socket.cc:665 -> 3 ubuntu:40273:40383 [5] proxy.cc:1456 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable ubuntu:40273:40373 [5] NCCL INFO misc/socket.cc:46 -> 3 ubuntu:40273:40373 [5] NCCL INFO misc/socket.cc:547 -> 3 ubuntu:40273:40373 [5] NCCL INFO misc/socket.cc:570 -> 3 ubuntu:40273:40373 [5] NCCL INFO misc/socket.cc:618 -> 3 ubuntu:40273:40373 [5] NCCL INFO proxy.cc:1034 -> 3 ubuntu:40273:40373 [5] NCCL INFO init.cc:1185 -> 3 ubuntu:40273:40373 [5] NCCL INFO init.cc:1358 -> 3 ubuntu:40273:40373 [5] NCCL INFO group.cc:65 -> 3 [Async thread] ubuntu:40273:40370 [2] NCCL INFO Connected all trees ubuntu:40273:40370 [2] NCCL INFO NCCL_ALGO set by environment to Tree,Ring ubuntu:40273:40370 [2] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 ubuntu:40273:40370 [2] NCCL INFO 21 coll channels, 0 nvls channels, 32 p2p channels, 2 p2p channels per peer ubuntu:40273:40381 [2] NCCL INFO misc/socket.cc:46 -> 3 ubuntu:40273:40381 [2] NCCL INFO misc/socket.cc:749 -> 3 ubuntu:40273:40381 [2] NCCL INFO misc/socket.cc:427 -> 3 ubuntu:40273:40381 [2] NCCL INFO misc/socket.cc:561 -> 3 ubuntu:40273:40381 [2] NCCL INFO misc/socket.cc:665 -> 3 ubuntu:40273:40381 [2] proxy.cc:1456 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable ubuntu:40273:40370 [2] NCCL INFO misc/socket.cc:46 -> 3 ubuntu:40273:40370 [2] NCCL INFO misc/socket.cc:547 -> 3 ubuntu:40273:40370 [2] NCCL INFO misc/socket.cc:570 -> 3 ubuntu:40273:40370 [2] NCCL INFO misc/socket.cc:618 -> 3 ubuntu:40273:40370 [2] NCCL INFO proxy.cc:1034 -> 3 ubuntu:40273:40370 [2] NCCL INFO init.cc:1185 -> 3 ubuntu:40273:40370 [2] NCCL INFO init.cc:1358 -> 3 ubuntu:40273:40370 [2] NCCL INFO group.cc:65 -> 3 [Async thread] ubuntu:40273:40374 [6] NCCL INFO Connected all trees ubuntu:40273:40374 [6] NCCL INFO NCCL_ALGO set by environment to Tree,Ring ubuntu:40273:40374 [6] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 ubuntu:40273:40374 [6] NCCL INFO 21 coll channels, 0 nvls channels, 32 p2p channels, 2 p2p channels per peer ubuntu:40273:40384 [6] NCCL INFO misc/socket.cc:46 -> 3 ubuntu:40273:40384 [6] NCCL INFO misc/socket.cc:749 -> 3 ubuntu:40273:40384 [6] NCCL INFO misc/socket.cc:427 -> 3 ubuntu:40273:40384 [6] NCCL INFO misc/socket.cc:561 -> 3 ubuntu:40273:40384 [6] NCCL INFO misc/socket.cc:665 -> 3 ubuntu:40273:40384 [6] proxy.cc:1456 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable ubuntu:40273:40374 [6] NCCL INFO misc/socket.cc:46 -> 3 ubuntu:40273:40374 [6] NCCL INFO misc/socket.cc:547 -> 3 ubuntu:40273:40374 [6] NCCL INFO misc/socket.cc:570 -> 3 ubuntu:40273:40374 [6] NCCL INFO misc/socket.cc:618 -> 3 ubuntu:40273:40374 [6] NCCL INFO proxy.cc:1034 -> 3 ubuntu:40273:40374 [6] NCCL INFO init.cc:1185 -> 3 ubuntu:40273:40374 [6] NCCL INFO init.cc:1358 -> 3 ubuntu:40273:40374 [6] NCCL INFO group.cc:65 -> 3 [Async thread] ubuntu:40273:40372 [4] NCCL INFO Connected all trees ubuntu:40273:40372 [4] NCCL INFO NCCL_ALGO set by environment to Tree,Ring ubuntu:40273:40372 [4] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512 ubuntu:40273:40372 [4] NCCL INFO 21 coll channels, 0 nvls channels, 32 p2p channels, 2 p2p channels per peer ubuntu:40273:40382 [4] NCCL INFO misc/socket.cc:46 -> 3 ubuntu:40273:40382 [4] NCCL INFO misc/socket.cc:749 -> 3 ubuntu:40273:40382 [4] NCCL INFO misc/socket.cc:427 -> 3 ubuntu:40273:40382 [4] NCCL INFO misc/socket.cc:561 -> 3 ubuntu:40273:40382 [4] NCCL INFO misc/socket.cc:665 -> 3 ubuntu:40273:40382 [4] proxy.cc:1456 NCCL WARN [Service thread] Accept failed Resource temporarily unavailable ubuntu:40273:40372 [4] NCCL INFO misc/socket.cc:46 -> 3 ubuntu:40273:40372 [4] NCCL INFO misc/socket.cc:547 -> 3 ubuntu:40273:40372 [4] NCCL INFO misc/socket.cc:570 -> 3 ubuntu:40273:40372 [4] NCCL INFO misc/socket.cc:618 -> 3 ubuntu:40273:40372 [4] NCCL INFO proxy.cc:1034 -> 3 ubuntu:40273:40372 [4] NCCL INFO init.cc:1185 -> 3 ubuntu:40273:40372 [4] NCCL INFO init.cc:1358 -> 3 ubuntu:40273:40372 [4] NCCL INFO group.cc:65 -> 3 [Async thread] ubuntu:40273:40273 [7] NCCL INFO group.cc:406 -> 3 ubuntu:40273:40273 [7] NCCL INFO group.cc:96 -> 3 ubuntu: Test NCCL failure common.cu:958 'internal error - please report this issue to the NCCL developers / ' .. ubuntu pid 40273: Test failure common.cu:842 -------------------------------------------------------------------------- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[40735,1],1] Exit code: 3 -------------------------------------------------------------------------- ```

sjeaugey commented 11 months ago

The error is right at the beginning:

ubuntu:9621:9687 [2] misc/shmutils.cc:71 NCCL WARN Error: failed to extend /dev/shm/nccl-mEeIEW to 5967876 bytes

It means /dev/shm is too small. You system seems to be a PCI system and communication across the CPU uses shared memory. Since you increased MIN_NCHANNELS to 22, you need to create a LOT of shared memory buffers and that's exceeding the shared mem size. Note on systems without NVLink you should not need more than 2 channels to reach peak bandwidth, so forcing NCCL to increase the number of channels is detrimental to memory usage and GPU usage as well.

Edit: actually it seems you disabled NVLink with NCCL_P2P_LEVEL=0 -- not sure why you'd want to do that? More generally, you are setting a lot of environment variables. This is usually not advised. Some env vars can be ok to set, but the vast majority of them should not be used unless you have a very good reason to do so.

SweeneyJun commented 11 months ago

I appreciate your guidance, especially as I was experimenting with environment variables that I'm not very familiar with (try to test the performence). Thank you for your explanation! 🙌

BTW, I'm using virtual machines, and I'm not quite sure about the real physical topology of the two VMs I'm using. How did you determine that my system is PCI-based and without NVLink? Also, could you provide a rough range of the /dev/shm size to avoid shared mem exceeding?

sjeaugey commented 11 months ago

I thought that your system was PCI-based because you were using shared memory to communicate between GPUs. But your GPUs are SXM GPUs so they do have NVLink. The reason you were using shared memory was because you disabled P2P as you set NCCL_P2P_LEVEL. Just unset that environment variable (and all the others as well unless you really need them) and the problem will go away.

kylematoba commented 9 months ago

@sjeaugey I'm good to close this issue if all the subsequent discussion is resolved.

sjeaugey commented 9 months ago

Ok, closing.