NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.13k stars 788 forks source link

NCCL Tree allreduce test cannot reach the theoretical bus bandwidth on 2 nodes with 4 nics #1357

Open ProHuper opened 2 months ago

ProHuper commented 2 months ago
$ nvidia-smi topo -m

        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    PIX     SYS     SYS     SYS     SYS     SYS     0-47,96-143     0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    SYS     SYS     SYS     SYS     SYS     SYS     0-47,96-143     0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    SYS     PIX     SYS     SYS     SYS     SYS     0-47,96-143     0               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    SYS     SYS     SYS     SYS     SYS     SYS     0-47,96-143     0               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    SYS     SYS     PIX     SYS     SYS     SYS     48-95,144-191   1               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    SYS     SYS     SYS     SYS     SYS     SYS     48-95,144-191   1               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    SYS     SYS     SYS     PIX     SYS     SYS     48-95,144-191   1               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      SYS     SYS     SYS     SYS     PIX     SYS     48-95,144-191   1               N/A
NIC0    PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS     SYS     SYS     SYS
NIC1    SYS     SYS     PIX     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS     SYS     SYS
NIC2    SYS     SYS     SYS     SYS     PIX     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS     SYS
NIC3    SYS     SYS     SYS     SYS     SYS     SYS     PIX     SYS     SYS     SYS     SYS      X      SYS     SYS
NIC4    SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     SYS     SYS     SYS     SYS      X      SYS
NIC5    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_4
  NIC3: mlx5_5
  NIC4: mlx5_6
  NIC5: mlx5_bond_0

2 nodes allreduce test,8 H100 each node,using 4 nics,busbw is 309,theoretical busbw should be 360。

$ mpirun --allow-run-as-root --hostfile hosts.txt  --oversubscribe  -x  NCCL_ALGO=Tree -x NCCL_DEBUG=INFO -x NCCL_IB_GID_INDEX=3 -x NCCL_IB_QPS_PER_CONNECTION=2 -x LD_LIBRARY_PATH -x NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_4,mlx5_5 -np 16 ./all_reduce_perf -b 2M -e 16G -f 2 -n 10 -g 1 -w 10

#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s) 
     2097152        524288     float     sum      -1    118.1   17.75   33.29      0    92.65   22.64   42.44      0
     4194304       1048576     float     sum      -1    104.8   40.01   75.03      0    105.4   39.78   74.59      0
     8388608       2097152     float     sum      -1    140.7   59.60  111.75      0    142.9   58.72  110.10      0
    16777216       4194304     float     sum      -1    231.9   72.33  135.62      0    237.8   70.56  132.29      0
    33554432       8388608     float     sum      -1    412.3   81.39  152.60      0    417.3   80.40  150.75      0
    67108864      16777216     float     sum      -1    663.5  101.14  189.64      0    672.7   99.76  187.05      0
   134217728      33554432     float     sum      -1   1168.2  114.89  215.42      0   1311.3  102.35  191.91      0
   268435456      67108864     float     sum      -1   2130.3  126.01  236.27      0   2130.6  125.99  236.23      0
   536870912     134217728     float     sum      -1   3611.0  148.68  278.77      0   3603.2  149.00  279.37      0
  1073741824     268435456     float     sum      -1   6793.3  158.06  296.36      0   6781.1  158.34  296.89      0
  2147483648     536870912     float     sum      -1    13184  162.89  305.41      0    13129  163.56  306.68      0
  4294967296    1073741824     float     sum      -1    25986  165.28  309.90      0    25893  165.87  311.01      0

nccl log info

qh100-gpu20:39630:39685 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [2]mlx5_4:1/IB/SHARP [3]mlx5_5:1/IB/SHARP [RO]; OOB ibp25s0:10.20.0.120<0>
qh100-gpu20:39630:39685 [1] NCCL INFO Using network IBext
qh100-gpu19:49354:49411 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [2]mlx5_4:1/IB/SHARP [3]mlx5_5:1/IB/SHARP [RO]; OOB ibp25s0:10.20.0.119<0>
qh100-gpu19:49354:49411 [0] NCCL INFO Using network IBext
qh100-gpu19:49361:49412 [7] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [2]mlx5_4:1/IB/SHARP [3]mlx5_5:1/IB/SHARP [RO]; OOB ibp25s0:10.20.0.119<0>
qh100-gpu19:49361:49412 [7] NCCL INFO Using network IBext
qh100-gpu19:49355:49414 [1] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [2]mlx5_4:1/IB/SHARP [3]mlx5_5:1/IB/SHARP [RO]; OOB ibp25s0:10.20.0.119<0>
qh100-gpu19:49355:49414 [1] NCCL INFO Using network IBext
qh100-gpu19:49357:49418 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [2]mlx5_4:1/IB/SHARP [3]mlx5_5:1/IB/SHARP [RO]; OOB ibp25s0:10.20.0.119<0>
qh100-gpu19:49360:49416 [6] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [2]mlx5_4:1/IB/SHARP [3]mlx5_5:1/IB/SHARP [RO]; OOB ibp25s0:10.20.0.119<0>
qh100-gpu19:49356:49417 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [2]mlx5_4:1/IB/SHARP [3]mlx5_5:1/IB/SHARP [RO]; OOB ibp25s0:10.20.0.119<0>
qh100-gpu19:49357:49418 [3] NCCL INFO Using network IBext
qh100-gpu19:49360:49416 [6] NCCL INFO Using network IBext
qh100-gpu19:49356:49417 [2] NCCL INFO Using network IBext
qh100-gpu20:39629:39686 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v8 symbol.
qh100-gpu20:39629:39686 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6)
qh100-gpu20:39629:39686 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
qh100-gpu20:39629:39686 [0] NCCL INFO NET/Plugin: Loaded collnet plugin SHARP (v6)
qh100-gpu20:39629:39686 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
qh100-gpu20:39629:39686 [0] NCCL INFO P2P plugin IBext
qh100-gpu20:39632:39687 [3] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v8 symbol.
qh100-gpu20:39632:39687 [3] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6)
qh100-gpu20:39632:39687 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
qh100-gpu20:39632:39687 [3] NCCL INFO NET/Plugin: Loaded collnet plugin SHARP (v6)
qh100-gpu20:39632:39687 [3] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
qh100-gpu20:39632:39687 [3] NCCL INFO P2P plugin IBext
qh100-gpu20:39635:39691 [6] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v8 symbol.
qh100-gpu20:39635:39691 [6] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6)
qh100-gpu20:39635:39691 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
qh100-gpu20:39635:39691 [6] NCCL INFO NET/Plugin: Loaded collnet plugin SHARP (v6)
qh100-gpu20:39635:39691 [6] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
qh100-gpu20:39635:39691 [6] NCCL INFO P2P plugin IBext
qh100-gpu20:39633:39689 [4] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v8 symbol.
qh100-gpu20:39633:39689 [4] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6)
qh100-gpu20:39633:39689 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
qh100-gpu20:39633:39689 [4] NCCL INFO NET/Plugin: Loaded collnet plugin SHARP (v6)
qh100-gpu20:39633:39689 [4] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
qh100-gpu20:39633:39689 [4] NCCL INFO P2P plugin IBext
qh100-gpu19:49358:49415 [4] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [2]mlx5_4:1/IB/SHARP [3]mlx5_5:1/IB/SHARP [RO]; OOB ibp25s0:10.20.0.119<0>
qh100-gpu19:49358:49415 [4] NCCL INFO Using network IBext
qh100-gpu20:39636:39690 [7] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v8 symbol.
qh100-gpu20:39636:39690 [7] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6)
qh100-gpu20:39636:39690 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
qh100-gpu20:39636:39690 [7] NCCL INFO NET/Plugin: Loaded collnet plugin SHARP (v6)
qh100-gpu20:39636:39690 [7] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
qh100-gpu20:39636:39690 [7] NCCL INFO P2P plugin IBext
qh100-gpu19:49359:49413 [5] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [2]mlx5_4:1/IB/SHARP [3]mlx5_5:1/IB/SHARP [RO]; OOB ibp25s0:10.20.0.119<0>
qh100-gpu19:49359:49413 [5] NCCL INFO Using network IBext
qh100-gpu20:39631:39688 [2] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v8 symbol.
qh100-gpu20:39631:39688 [2] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6)
qh100-gpu20:39631:39688 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
qh100-gpu20:39631:39688 [2] NCCL INFO NET/Plugin: Loaded collnet plugin SHARP (v6)
qh100-gpu20:39631:39688 [2] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
qh100-gpu20:39631:39688 [2] NCCL INFO P2P plugin IBext
qh100-gpu20:39634:39692 [5] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v8 symbol.
qh100-gpu20:39634:39692 [5] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6)
qh100-gpu20:39634:39692 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
qh100-gpu20:39634:39692 [5] NCCL INFO NET/Plugin: Loaded collnet plugin SHARP (v6)
qh100-gpu20:39634:39692 [5] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
qh100-gpu20:39634:39692 [5] NCCL INFO P2P plugin IBext
qh100-gpu20:39632:39687 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [2]mlx5_4:1/IB/SHARP [3]mlx5_5:1/IB/SHARP [RO]; OOB ibp25s0:10.20.0.120<0>
qh100-gpu20:39632:39687 [3] NCCL INFO Using network IBext
qh100-gpu20:39629:39686 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [2]mlx5_4:1/IB/SHARP [3]mlx5_5:1/IB/SHARP [RO]; OOB ibp25s0:10.20.0.120<0>
qh100-gpu20:39629:39686 [0] NCCL INFO Using network IBext
qh100-gpu20:39635:39691 [6] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [2]mlx5_4:1/IB/SHARP [3]mlx5_5:1/IB/SHARP [RO]; OOB ibp25s0:10.20.0.120<0>
qh100-gpu20:39635:39691 [6] NCCL INFO Using network IBext
qh100-gpu20:39631:39688 [2] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [2]mlx5_4:1/IB/SHARP [3]mlx5_5:1/IB/SHARP [RO]; OOB ibp25s0:10.20.0.120<0>
qh100-gpu20:39631:39688 [2] NCCL INFO Using network IBext
qh100-gpu20:39636:39690 [7] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [2]mlx5_4:1/IB/SHARP [3]mlx5_5:1/IB/SHARP [RO]; OOB ibp25s0:10.20.0.120<0>
qh100-gpu20:39636:39690 [7] NCCL INFO Using network IBext
qh100-gpu20:39633:39689 [4] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [2]mlx5_4:1/IB/SHARP [3]mlx5_5:1/IB/SHARP [RO]; OOB ibp25s0:10.20.0.120<0>
qh100-gpu20:39633:39689 [4] NCCL INFO Using network IBext
qh100-gpu20:39634:39692 [5] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB/SHARP [1]mlx5_1:1/IB/SHARP [2]mlx5_4:1/IB/SHARP [3]mlx5_5:1/IB/SHARP [RO]; OOB ibp25s0:10.20.0.120<0>
qh100-gpu20:39634:39692 [5] NCCL INFO Using network IBext
qh100-gpu19:49360:49416 [6] NCCL INFO ncclCommInitRank comm 0x55d900e4a6c0 rank 6 nranks 16 cudaDev 6 nvmlDev 6 busId ba000 commId 0xd8289d3e6e217d26 - Init START
qh100-gpu19:49358:49415 [4] NCCL INFO ncclCommInitRank comm 0x55f93813b890 rank 4 nranks 16 cudaDev 4 nvmlDev 4 busId 9a000 commId 0xd8289d3e6e217d26 - Init START
qh100-gpu19:49359:49413 [5] NCCL INFO ncclCommInitRank comm 0x55c18f112200 rank 5 nranks 16 cudaDev 5 nvmlDev 5 busId ab000 commId 0xd8289d3e6e217d26 - Init START
qh100-gpu19:49357:49418 [3] NCCL INFO ncclCommInitRank comm 0x5555e0b324c0 rank 3 nranks 16 cudaDev 3 nvmlDev 3 busId 5d000 commId 0xd8289d3e6e217d26 - Init START
qh100-gpu19:49361:49412 [7] NCCL INFO ncclCommInitRank comm 0x556bf9ef3fc0 rank 7 nranks 16 cudaDev 7 nvmlDev 7 busId db000 commId 0xd8289d3e6e217d26 - Init START
qh100-gpu20:39631:39688 [2] NCCL INFO ncclCommInitRank comm 0x560fe9299b80 rank 10 nranks 16 cudaDev 2 nvmlDev 2 busId 3a000 commId 0xd8289d3e6e217d26 - Init START
qh100-gpu20:39629:39686 [0] NCCL INFO ncclCommInitRank comm 0x55aa2afe7340 rank 8 nranks 16 cudaDev 0 nvmlDev 0 busId 18000 commId 0xd8289d3e6e217d26 - Init START
qh100-gpu20:39630:39685 [1] NCCL INFO ncclCommInitRank comm 0x556888e99580 rank 9 nranks 16 cudaDev 1 nvmlDev 1 busId 2a000 commId 0xd8289d3e6e217d26 - Init START
qh100-gpu19:49356:49417 [2] NCCL INFO ncclCommInitRank comm 0x55f13ded5890 rank 2 nranks 16 cudaDev 2 nvmlDev 2 busId 3a000 commId 0xd8289d3e6e217d26 - Init START
qh100-gpu19:49355:49414 [1] NCCL INFO ncclCommInitRank comm 0x55bbc86693d0 rank 1 nranks 16 cudaDev 1 nvmlDev 1 busId 2a000 commId 0xd8289d3e6e217d26 - Init START
qh100-gpu19:49354:49411 [0] NCCL INFO ncclCommInitRank comm 0x563b05b8d020 rank 0 nranks 16 cudaDev 0 nvmlDev 0 busId 18000 commId 0xd8289d3e6e217d26 - Init START
qh100-gpu20:39632:39687 [3] NCCL INFO ncclCommInitRank comm 0x559295370c20 rank 11 nranks 16 cudaDev 3 nvmlDev 3 busId 5d000 commId 0xd8289d3e6e217d26 - Init START
qh100-gpu20:39633:39689 [4] NCCL INFO ncclCommInitRank comm 0x5619b9724520 rank 12 nranks 16 cudaDev 4 nvmlDev 4 busId 9a000 commId 0xd8289d3e6e217d26 - Init START
qh100-gpu20:39636:39690 [7] NCCL INFO ncclCommInitRank comm 0x55a80a710090 rank 15 nranks 16 cudaDev 7 nvmlDev 7 busId db000 commId 0xd8289d3e6e217d26 - Init START
qh100-gpu20:39634:39692 [5] NCCL INFO ncclCommInitRank comm 0x5558e3c10180 rank 13 nranks 16 cudaDev 5 nvmlDev 5 busId ab000 commId 0xd8289d3e6e217d26 - Init START
qh100-gpu20:39635:39691 [6] NCCL INFO ncclCommInitRank comm 0x56089796c4b0 rank 14 nranks 16 cudaDev 6 nvmlDev 6 busId ba000 commId 0xd8289d3e6e217d26 - Init START
qh100-gpu19:49361:49412 [7] NCCL INFO Setting affinity for GPU 7 to ffffffff,ffff0000,00000000,ffffffff,ffff0000,00000000
qh100-gpu19:49361:49412 [7] NCCL INFO NVLS multicast support is available on dev 7
qh100-gpu19:49360:49416 [6] NCCL INFO Setting affinity for GPU 6 to ffffffff,ffff0000,00000000,ffffffff,ffff0000,00000000
qh100-gpu19:49360:49416 [6] NCCL INFO NVLS multicast support is available on dev 6
qh100-gpu19:49354:49411 [0] NCCL INFO Setting affinity for GPU 0 to ffff,ffffffff,00000000,0000ffff,ffffffff
qh100-gpu19:49354:49411 [0] NCCL INFO NVLS multicast support is available on dev 0
qh100-gpu19:49359:49413 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,ffff0000,00000000,ffffffff,ffff0000,00000000
qh100-gpu19:49359:49413 [5] NCCL INFO NVLS multicast support is available on dev 5
qh100-gpu19:49358:49415 [4] NCCL INFO Setting affinity for GPU 4 to ffffffff,ffff0000,00000000,ffffffff,ffff0000,00000000
qh100-gpu19:49358:49415 [4] NCCL INFO NVLS multicast support is available on dev 4
qh100-gpu19:49357:49418 [3] NCCL INFO Setting affinity for GPU 3 to ffff,ffffffff,00000000,0000ffff,ffffffff
qh100-gpu19:49357:49418 [3] NCCL INFO NVLS multicast support is available on dev 3
qh100-gpu19:49356:49417 [2] NCCL INFO Setting affinity for GPU 2 to ffff,ffffffff,00000000,0000ffff,ffffffff
qh100-gpu19:49356:49417 [2] NCCL INFO NVLS multicast support is available on dev 2
qh100-gpu19:49355:49414 [1] NCCL INFO Setting affinity for GPU 1 to ffff,ffffffff,00000000,0000ffff,ffffffff
qh100-gpu19:49355:49414 [1] NCCL INFO NVLS multicast support is available on dev 1
qh100-gpu20:39629:39686 [0] NCCL INFO Setting affinity for GPU 0 to ffff,ffffffff,00000000,0000ffff,ffffffff
qh100-gpu20:39629:39686 [0] NCCL INFO NVLS multicast support is available on dev 0
qh100-gpu20:39636:39690 [7] NCCL INFO Setting affinity for GPU 7 to ffffffff,ffff0000,00000000,ffffffff,ffff0000,00000000
qh100-gpu20:39636:39690 [7] NCCL INFO NVLS multicast support is available on dev 7
qh100-gpu20:39635:39691 [6] NCCL INFO Setting affinity for GPU 6 to ffffffff,ffff0000,00000000,ffffffff,ffff0000,00000000
qh100-gpu20:39635:39691 [6] NCCL INFO NVLS multicast support is available on dev 6
qh100-gpu20:39634:39692 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,ffff0000,00000000,ffffffff,ffff0000,00000000
qh100-gpu20:39634:39692 [5] NCCL INFO NVLS multicast support is available on dev 5
qh100-gpu20:39633:39689 [4] NCCL INFO Setting affinity for GPU 4 to ffffffff,ffff0000,00000000,ffffffff,ffff0000,00000000
qh100-gpu20:39633:39689 [4] NCCL INFO NVLS multicast support is available on dev 4
qh100-gpu20:39630:39685 [1] NCCL INFO Setting affinity for GPU 1 to ffff,ffffffff,00000000,0000ffff,ffffffff
qh100-gpu20:39630:39685 [1] NCCL INFO NVLS multicast support is available on dev 1
qh100-gpu20:39632:39687 [3] NCCL INFO Setting affinity for GPU 3 to ffff,ffffffff,00000000,0000ffff,ffffffff
qh100-gpu20:39631:39688 [2] NCCL INFO Setting affinity for GPU 2 to ffff,ffffffff,00000000,0000ffff,ffffffff
qh100-gpu20:39631:39688 [2] NCCL INFO NVLS multicast support is available on dev 2
qh100-gpu20:39632:39687 [3] NCCL INFO NVLS multicast support is available on dev 3
qh100-gpu19:49360:49416 [6] NCCL INFO comm 0x55d900e4a6c0 rank 6 nRanks 16 nNodes 2 localRanks 8 localRank 6 MNNVL 0
qh100-gpu19:49360:49416 [6] NCCL INFO NVLS Head  0:  0  8
qh100-gpu19:49360:49416 [6] NCCL INFO NVLS Head  1:  2 10
qh100-gpu19:49360:49416 [6] NCCL INFO NVLS Head  2:  4 12
qh100-gpu19:49360:49416 [6] NCCL INFO NVLS Head  3:  6 14
qh100-gpu19:49360:49416 [6] NCCL INFO Trees [0] 5/-1/-1->6->4 [1] 5/-1/-1->6->4 [2] 5/-1/-1->6->4 [3] 5/14/-1->6->-1 [4] 5/-1/-1->6->4 [5] 5/-1/-1->6->4 [6] 5/-1/-1->6->4 [7] 5/14/-1->6->-1 [8] 5/-1/-1->6->4 [9] 5/-1/-1->6->4 [10] 5/-1/-1->6->4 [11] 5/-1/-1->6->14 [12] 5/-1/-1->6->4 [13] 5/-1/-1->6->4 [14] 5/-1/-1->6->4 [15] 5/-1/-1->6->14
qh100-gpu19:49360:49416 [6] NCCL INFO P2P Chunksize set to 131072
qh100-gpu19:49361:49412 [7] NCCL INFO comm 0x556bf9ef3fc0 rank 7 nRanks 16 nNodes 2 localRanks 8 localRank 7 MNNVL 0
qh100-gpu19:49361:49412 [7] NCCL INFO NVLS Head  0:  0  8
qh100-gpu19:49361:49412 [7] NCCL INFO NVLS Head  1:  2 10
qh100-gpu19:49361:49412 [7] NCCL INFO NVLS Head  2:  4 12
qh100-gpu19:49361:49412 [7] NCCL INFO NVLS Head  3:  6 14
qh100-gpu19:49361:49412 [7] NCCL INFO Trees [0] -1/-1/-1->7->5 [1] 0/-1/-1->7->5 [2] 0/-1/-1->7->5 [3] 0/-1/-1->7->5 [4] -1/-1/-1->7->5 [5] 0/-1/-1->7->5 [6] 0/-1/-1->7->5 [7] 0/-1/-1->7->5 [8] -1/-1/-1->7->5 [9] 0/-1/-1->7->5 [10] 0/-1/-1->7->5 [11] 0/-1/-1->7->5 [12] -1/-1/-1->7->5 [13] 0/-1/-1->7->5 [14] 0/-1/-1->7->5 [15] 0/-1/-1->7->5
qh100-gpu19:49361:49412 [7] NCCL INFO P2P Chunksize set to 131072
qh100-gpu19:49359:49413 [5] NCCL INFO comm 0x55c18f112200 rank 5 nRanks 16 nNodes 2 localRanks 8 localRank 5 MNNVL 0
qh100-gpu19:49358:49415 [4] NCCL INFO comm 0x55f93813b890 rank 4 nRanks 16 nNodes 2 localRanks 8 localRank 4 MNNVL 0
qh100-gpu19:49358:49415 [4] NCCL INFO NVLS Head  0:  0  8
qh100-gpu19:49358:49415 [4] NCCL INFO NVLS Head  1:  2 10
qh100-gpu19:49358:49415 [4] NCCL INFO NVLS Head  2:  4 12
qh100-gpu19:49358:49415 [4] NCCL INFO NVLS Head  3:  6 14
qh100-gpu19:49358:49415 [4] NCCL INFO Trees [0] 6/-1/-1->4->3 [1] 6/-1/-1->4->3 [2] 6/12/-1->4->-1 [3] -1/-1/-1->4->3 [4] 6/-1/-1->4->3 [5] 6/-1/-1->4->3 [6] 6/12/-1->4->-1 [7] -1/-1/-1->4->3 [8] 6/-1/-1->4->3 [9] 6/-1/-1->4->3 [10] 6/-1/-1->4->12 [11] -1/-1/-1->4->3 [12] 6/-1/-1->4->3 [13] 6/-1/-1->4->3 [14] 6/-1/-1->4->12 [15] -1/-1/-1->4->3
qh100-gpu19:49358:49415 [4] NCCL INFO P2P Chunksize set to 131072
qh100-gpu19:49354:49411 [0] NCCL INFO comm 0x563b05b8d020 rank 0 nRanks 16 nNodes 2 localRanks 8 localRank 0 MNNVL 0
qh100-gpu19:49354:49411 [0] NCCL INFO NVLS Head  0:  0  8
qh100-gpu19:49354:49411 [0] NCCL INFO NVLS Head  1:  2 10
qh100-gpu19:49354:49411 [0] NCCL INFO NVLS Head  2:  4 12
qh100-gpu19:49354:49411 [0] NCCL INFO NVLS Head  3:  6 14
qh100-gpu19:49354:49411 [0] NCCL INFO Channel 00/16 :    0   7   5   6   4   3   1   2   8  15  13  14  12  11   9  10
qh100-gpu19:49354:49411 [0] NCCL INFO Channel 01/16 :    0   7   5   6   4   3   1  10   8  15  13  14  12  11   9   2
qh100-gpu19:49354:49411 [0] NCCL INFO Channel 02/16 :    0   7   5   6  12  11   9  10   8  15  13  14   4   3   1   2
qh100-gpu19:49354:49411 [0] NCCL INFO Channel 03/16 :    0   7   5  14  12  11   9  10   8  15  13   6   4   3   1   2
qh100-gpu19:49354:49411 [0] NCCL INFO Channel 04/16 :    0   7   5   6   4   3   1   2   8  15  13  14  12  11   9  10
qh100-gpu19:49354:49411 [0] NCCL INFO Channel 05/16 :    0   7   5   6   4   3   1  10   8  15  13  14  12  11   9   2
qh100-gpu19:49354:49411 [0] NCCL INFO Channel 06/16 :    0   7   5   6  12  11   9  10   8  15  13  14   4   3   1   2
qh100-gpu19:49354:49411 [0] NCCL INFO Channel 07/16 :    0   7   5  14  12  11   9  10   8  15  13   6   4   3   1   2
qh100-gpu19:49354:49411 [0] NCCL INFO Channel 08/16 :    0   7   5   6   4   3   1   2   8  15  13  14  12  11   9  10
qh100-gpu19:49354:49411 [0] NCCL INFO Channel 09/16 :    0   7   5   6   4   3   1  10   8  15  13  14  12  11   9   2
qh100-gpu19:49354:49411 [0] NCCL INFO Channel 10/16 :    0   7   5   6  12  11   9  10   8  15  13  14   4   3   1   2
qh100-gpu19:49354:49411 [0] NCCL INFO Channel 11/16 :    0   7   5  14  12  11   9  10   8  15  13   6   4   3   1   2
qh100-gpu19:49354:49411 [0] NCCL INFO Channel 12/16 :    0   7   5   6   4   3   1   2   8  15  13  14  12  11   9  10
qh100-gpu19:49354:49411 [0] NCCL INFO Channel 13/16 :    0   7   5   6   4   3   1  10   8  15  13  14  12  11   9   2
qh100-gpu19:49354:49411 [0] NCCL INFO Channel 14/16 :    0   7   5   6  12  11   9  10   8  15  13  14   4   3   1   2
qh100-gpu19:49354:49411 [0] NCCL INFO Channel 15/16 :    0   7   5  14  12  11   9  10   8  15  13   6   4   3   1   2
qh100-gpu19:49354:49411 [0] NCCL INFO Trees [0] 2/8/-1->0->-1 [1] -1/-1/-1->0->7 [2] 2/-1/-1->0->7 [3] 2/-1/-1->0->7 [4] 2/8/-1->0->-1 [5] -1/-1/-1->0->7 [6] 2/-1/-1->0->7 [7] 2/-1/-1->0->7 [8] 2/-1/-1->0->8 [9] -1/-1/-1->0->7 [10] 2/-1/-1->0->7 [11] 2/-1/-1->0->7 [12] 2/-1/-1->0->8 [13] -1/-1/-1->0->7 [14] 2/-1/-1->0->7 [15] 2/-1/-1->0->7
qh100-gpu19:49354:49411 [0] NCCL INFO P2P Chunksize set to 131072
qh100-gpu19:49357:49418 [3] NCCL INFO comm 0x5555e0b324c0 rank 3 nRanks 16 nNodes 2 localRanks 8 localRank 3 MNNVL 0
qh100-gpu19:49357:49418 [3] NCCL INFO NVLS Head  0:  0  8
qh100-gpu19:49357:49418 [3] NCCL INFO NVLS Head  1:  2 10
qh100-gpu19:49357:49418 [3] NCCL INFO NVLS Head  2:  4 12
qh100-gpu19:49357:49418 [3] NCCL INFO NVLS Head  3:  6 14
qh100-gpu19:49357:49418 [3] NCCL INFO Trees [0] 4/-1/-1->3->1 [1] 4/-1/-1->3->1 [2] -1/-1/-1->3->1 [3] 4/-1/-1->3->1 [4] 4/-1/-1->3->1 [5] 4/-1/-1->3->1 [6] -1/-1/-1->3->1 [7] 4/-1/-1->3->1 [8] 4/-1/-1->3->1 [9] 4/-1/-1->3->1 [10] -1/-1/-1->3->1 [11] 4/-1/-1->3->1 [12] 4/-1/-1->3->1 [13] 4/-1/-1->3->1 [14] -1/-1/-1->3->1 [15] 4/-1/-1->3->1
qh100-gpu19:49357:49418 [3] NCCL INFO P2P Chunksize set to 131072
qh100-gpu19:49355:49414 [1] NCCL INFO comm 0x55bbc86693d0 rank 1 nRanks 16 nNodes 2 localRanks 8 localRank 1 MNNVL 0
qh100-gpu19:49355:49414 [1] NCCL INFO NVLS Head  0:  0  8
qh100-gpu19:49355:49414 [1] NCCL INFO NVLS Head  1:  2 10
qh100-gpu19:49355:49414 [1] NCCL INFO NVLS Head  2:  4 12
qh100-gpu19:49355:49414 [1] NCCL INFO NVLS Head  3:  6 14
qh100-gpu19:49355:49414 [1] NCCL INFO Trees [0] 3/-1/-1->1->2 [1] 3/-1/-1->1->2 [2] 3/-1/-1->1->2 [3] 3/-1/-1->1->2 [4] 3/-1/-1->1->2 [5] 3/-1/-1->1->2 [6] 3/-1/-1->1->2 [7] 3/-1/-1->1->2 [8] 3/-1/-1->1->2 [9] 3/-1/-1->1->2 [10] 3/-1/-1->1->2 [11] 3/-1/-1->1->2 [12] 3/-1/-1->1->2 [13] 3/-1/-1->1->2 [14] 3/-1/-1->1->2 [15] 3/-1/-1->1->2
qh100-gpu19:49355:49414 [1] NCCL INFO P2P Chunksize set to 131072
qh100-gpu19:49356:49417 [2] NCCL INFO comm 0x55f13ded5890 rank 2 nRanks 16 nNodes 2 localRanks 8 localRank 2 MNNVL 0
qh100-gpu19:49356:49417 [2] NCCL INFO NVLS Head  0:  0  8
qh100-gpu19:49356:49417 [2] NCCL INFO NVLS Head  1:  2 10
qh100-gpu19:49356:49417 [2] NCCL INFO NVLS Head  2:  4 12
qh100-gpu19:49356:49417 [2] NCCL INFO NVLS Head  3:  6 14
qh100-gpu19:49356:49417 [2] NCCL INFO Trees [0] 1/-1/-1->2->0 [1] 1/10/-1->2->-1 [2] 1/-1/-1->2->0 [3] 1/-1/-1->2->0 [4] 1/-1/-1->2->0 [5] 1/10/-1->2->-1 [6] 1/-1/-1->2->0 [7] 1/-1/-1->2->0 [8] 1/-1/-1->2->0 [9] 1/-1/-1->2->10 [10] 1/-1/-1->2->0 [11] 1/-1/-1->2->0 [12] 1/-1/-1->2->0 [13] 1/-1/-1->2->10 [14] 1/-1/-1->2->0 [15] 1/-1/-1->2->0
qh100-gpu19:49356:49417 [2] NCCL INFO P2P Chunksize set to 131072
qh100-gpu19:49359:49413 [5] NCCL INFO NVLS Head  0:  0  8
qh100-gpu19:49359:49413 [5] NCCL INFO NVLS Head  1:  2 10
qh100-gpu19:49359:49413 [5] NCCL INFO NVLS Head  2:  4 12
qh100-gpu19:49359:49413 [5] NCCL INFO NVLS Head  3:  6 14
qh100-gpu19:49359:49413 [5] NCCL INFO Trees [0] 7/-1/-1->5->6 [1] 7/-1/-1->5->6 [2] 7/-1/-1->5->6 [3] 7/-1/-1->5->6 [4] 7/-1/-1->5->6 [5] 7/-1/-1->5->6 [6] 7/-1/-1->5->6 [7] 7/-1/-1->5->6 [8] 7/-1/-1->5->6 [9] 7/-1/-1->5->6 [10] 7/-1/-1->5->6 [11] 7/-1/-1->5->6 [12] 7/-1/-1->5->6 [13] 7/-1/-1->5->6 [14] 7/-1/-1->5->6 [15] 7/-1/-1->5->6
qh100-gpu19:49359:49413 [5] NCCL INFO P2P Chunksize set to 131072
qh100-gpu20:39629:39686 [0] NCCL INFO comm 0x55aa2afe7340 rank 8 nRanks 16 nNodes 2 localRanks 8 localRank 0 MNNVL 0
qh100-gpu20:39629:39686 [0] NCCL INFO Trees [0] 10/-1/-1->8->0 [1] -1/-1/-1->8->15 [2] 10/-1/-1->8->15 [3] 10/-1/-1->8->15 [4] 10/-1/-1->8->0 [5] -1/-1/-1->8->15 [6] 10/-1/-1->8->15 [7] 10/-1/-1->8->15 [8] 10/0/-1->8->-1 [9] -1/-1/-1->8->15 [10] 10/-1/-1->8->15 [11] 10/-1/-1->8->15 [12] 10/0/-1->8->-1 [13] -1/-1/-1->8->15 [14] 10/-1/-1->8->15 [15] 10/-1/-1->8->15
qh100-gpu20:39629:39686 [0] NCCL INFO P2P Chunksize set to 131072
qh100-gpu20:39631:39688 [2] NCCL INFO comm 0x560fe9299b80 rank 10 nRanks 16 nNodes 2 localRanks 8 localRank 2 MNNVL 0
qh100-gpu20:39631:39688 [2] NCCL INFO Trees [0] 9/-1/-1->10->8 [1] 9/-1/-1->10->2 [2] 9/-1/-1->10->8 [3] 9/-1/-1->10->8 [4] 9/-1/-1->10->8 [5] 9/-1/-1->10->2 [6] 9/-1/-1->10->8 [7] 9/-1/-1->10->8 [8] 9/-1/-1->10->8 [9] 9/2/-1->10->-1 [10] 9/-1/-1->10->8 [11] 9/-1/-1->10->8 [12] 9/-1/-1->10->8 [13] 9/2/-1->10->-1 [14] 9/-1/-1->10->8 [15] 9/-1/-1->10->8
qh100-gpu20:39631:39688 [2] NCCL INFO P2P Chunksize set to 131072
qh100-gpu20:39632:39687 [3] NCCL INFO comm 0x559295370c20 rank 11 nRanks 16 nNodes 2 localRanks 8 localRank 3 MNNVL 0
qh100-gpu20:39632:39687 [3] NCCL INFO Trees [0] 12/-1/-1->11->9 [1] 12/-1/-1->11->9 [2] -1/-1/-1->11->9 [3] 12/-1/-1->11->9 [4] 12/-1/-1->11->9 [5] 12/-1/-1->11->9 [6] -1/-1/-1->11->9 [7] 12/-1/-1->11->9 [8] 12/-1/-1->11->9 [9] 12/-1/-1->11->9 [10] -1/-1/-1->11->9 [11] 12/-1/-1->11->9 [12] 12/-1/-1->11->9 [13] 12/-1/-1->11->9 [14] -1/-1/-1->11->9 [15] 12/-1/-1->11->9
qh100-gpu20:39632:39687 [3] NCCL INFO P2P Chunksize set to 131072
qh100-gpu20:39630:39685 [1] NCCL INFO comm 0x556888e99580 rank 9 nRanks 16 nNodes 2 localRanks 8 localRank 1 MNNVL 0
qh100-gpu20:39630:39685 [1] NCCL INFO Trees [0] 11/-1/-1->9->10 [1] 11/-1/-1->9->10 [2] 11/-1/-1->9->10 [3] 11/-1/-1->9->10 [4] 11/-1/-1->9->10 [5] 11/-1/-1->9->10 [6] 11/-1/-1->9->10 [7] 11/-1/-1->9->10 [8] 11/-1/-1->9->10 [9] 11/-1/-1->9->10 [10] 11/-1/-1->9->10 [11] 11/-1/-1->9->10 [12] 11/-1/-1->9->10 [13] 11/-1/-1->9->10 [14] 11/-1/-1->9->10 [15] 11/-1/-1->9->10
qh100-gpu20:39630:39685 [1] NCCL INFO P2P Chunksize set to 131072
qh100-gpu20:39636:39690 [7] NCCL INFO comm 0x55a80a710090 rank 15 nRanks 16 nNodes 2 localRanks 8 localRank 7 MNNVL 0
qh100-gpu20:39636:39690 [7] NCCL INFO Trees [0] -1/-1/-1->15->13 [1] 8/-1/-1->15->13 [2] 8/-1/-1->15->13 [3] 8/-1/-1->15->13 [4] -1/-1/-1->15->13 [5] 8/-1/-1->15->13 [6] 8/-1/-1->15->13 [7] 8/-1/-1->15->13 [8] -1/-1/-1->15->13 [9] 8/-1/-1->15->13 [10] 8/-1/-1->15->13 [11] 8/-1/-1->15->13 [12] -1/-1/-1->15->13 [13] 8/-1/-1->15->13 [14] 8/-1/-1->15->13 [15] 8/-1/-1->15->13
qh100-gpu20:39636:39690 [7] NCCL INFO P2P Chunksize set to 131072
qh100-gpu20:39633:39689 [4] NCCL INFO comm 0x5619b9724520 rank 12 nRanks 16 nNodes 2 localRanks 8 localRank 4 MNNVL 0
qh100-gpu20:39633:39689 [4] NCCL INFO Trees [0] 14/-1/-1->12->11 [1] 14/-1/-1->12->11 [2] 14/-1/-1->12->4 [3] -1/-1/-1->12->11 [4] 14/-1/-1->12->11 [5] 14/-1/-1->12->11 [6] 14/-1/-1->12->4 [7] -1/-1/-1->12->11 [8] 14/-1/-1->12->11 [9] 14/-1/-1->12->11 [10] 14/4/-1->12->-1 [11] -1/-1/-1->12->11 [12] 14/-1/-1->12->11 [13] 14/-1/-1->12->11 [14] 14/4/-1->12->-1 [15] -1/-1/-1->12->11
qh100-gpu20:39633:39689 [4] NCCL INFO P2P Chunksize set to 131072
qh100-gpu20:39634:39692 [5] NCCL INFO comm 0x5558e3c10180 rank 13 nRanks 16 nNodes 2 localRanks 8 localRank 5 MNNVL 0
qh100-gpu20:39634:39692 [5] NCCL INFO Trees [0] 15/-1/-1->13->14 [1] 15/-1/-1->13->14 [2] 15/-1/-1->13->14 [3] 15/-1/-1->13->14 [4] 15/-1/-1->13->14 [5] 15/-1/-1->13->14 [6] 15/-1/-1->13->14 [7] 15/-1/-1->13->14 [8] 15/-1/-1->13->14 [9] 15/-1/-1->13->14 [10] 15/-1/-1->13->14 [11] 15/-1/-1->13->14 [12] 15/-1/-1->13->14 [13] 15/-1/-1->13->14 [14] 15/-1/-1->13->14 [15] 15/-1/-1->13->14
qh100-gpu20:39634:39692 [5] NCCL INFO P2P Chunksize set to 131072
qh100-gpu20:39635:39691 [6] NCCL INFO comm 0x56089796c4b0 rank 14 nRanks 16 nNodes 2 localRanks 8 localRank 6 MNNVL 0
qh100-gpu20:39635:39691 [6] NCCL INFO Trees [0] 13/-1/-1->14->12 [1] 13/-1/-1->14->12 [2] 13/-1/-1->14->12 [3] 13/-1/-1->14->6 [4] 13/-1/-1->14->12 [5] 13/-1/-1->14->12 [6] 13/-1/-1->14->12 [7] 13/-1/-1->14->6 [8] 13/-1/-1->14->12 [9] 13/-1/-1->14->12 [10] 13/-1/-1->14->12 [11] 13/6/-1->14->-1 [12] 13/-1/-1->14->12 [13] 13/-1/-1->14->12 [14] 13/-1/-1->14->12 [15] 13/6/-1->14->-1
qh100-gpu20:39635:39691 [6] NCCL INFO P2P Chunksize set to 131072
qh100-gpu19:49358:49415 [4] NCCL INFO NCCL_ALGO set by environment to Tree
qh100-gpu19:49358:49415 [4] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
qh100-gpu19:49358:49415 [4] NCCL INFO 16 coll channels, 16 collnet channels, 16 nvls channels, 16 p2p channels, 2 p2p channels per peer
qh100-gpu19:49357:49418 [3] NCCL INFO NCCL_ALGO set by environment to Tree
qh100-gpu19:49357:49418 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
qh100-gpu19:49357:49418 [3] NCCL INFO 16 coll channels, 16 collnet channels, 16 nvls channels, 16 p2p channels, 2 p2p channels per peer
qh100-gpu19:49359:49413 [5] NCCL INFO NCCL_ALGO set by environment to Tree
qh100-gpu19:49359:49413 [5] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
qh100-gpu19:49359:49413 [5] NCCL INFO 16 coll channels, 16 collnet channels, 16 nvls channels, 16 p2p channels, 2 p2p channels per peer
qh100-gpu19:49354:49411 [0] NCCL INFO NCCL_ALGO set by environment to Tree
qh100-gpu19:49354:49411 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
qh100-gpu19:49354:49411 [0] NCCL INFO 16 coll channels, 16 collnet channels, 16 nvls channels, 16 p2p channels, 2 p2p channels per peer
qh100-gpu19:49354:49411 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
qh100-gpu19:49356:49417 [2] NCCL INFO NCCL_ALGO set by environment to Tree
qh100-gpu19:49356:49417 [2] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
qh100-gpu19:49356:49417 [2] NCCL INFO 16 coll channels, 16 collnet channels, 16 nvls channels, 16 p2p channels, 2 p2p channels per peer
qh100-gpu19:49360:49416 [6] NCCL INFO NCCL_ALGO set by environment to Tree
qh100-gpu19:49360:49416 [6] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
qh100-gpu19:49360:49416 [6] NCCL INFO 16 coll channels, 16 collnet channels, 16 nvls channels, 16 p2p channels, 2 p2p channels per peer
qh100-gpu19:49361:49412 [7] NCCL INFO NCCL_ALGO set by environment to Tree
qh100-gpu19:49361:49412 [7] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
qh100-gpu19:49361:49412 [7] NCCL INFO 16 coll channels, 16 collnet channels, 16 nvls channels, 16 p2p channels, 2 p2p channels per peer
qh100-gpu19:49355:49414 [1] NCCL INFO NCCL_ALGO set by environment to Tree
qh100-gpu19:49355:49414 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
qh100-gpu19:49355:49414 [1] NCCL INFO 16 coll channels, 16 collnet channels, 16 nvls channels, 16 p2p channels, 2 p2p channels per peer
qh100-gpu19:49358:49415 [4] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol.
qh100-gpu19:49358:49415 [4] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead.
qh100-gpu19:49358:49415 [4] NCCL INFO ncclCommInitRank comm 0x55f93813b890 rank 4 nranks 16 cudaDev 4 nvmlDev 4 busId 9a000 commId 0xd8289d3e6e217d26 - Init COMPLETE
qh100-gpu19:49360:49416 [6] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol.
qh100-gpu19:49360:49416 [6] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead.
qh100-gpu19:49360:49416 [6] NCCL INFO ncclCommInitRank comm 0x55d900e4a6c0 rank 6 nranks 16 cudaDev 6 nvmlDev 6 busId ba000 commId 0xd8289d3e6e217d26 - Init COMPLETE
qh100-gpu19:49360:49416 [6] NCCL INFO Init timings: rank 6 nranks 16 total 4.02 (kernels 0.29, bootstrap 1.98, allgathers 0.29, topo 0.66, graphs 0.67, connections 0.14, rest 0.00)
qh100-gpu19:49358:49415 [4] NCCL INFO Init timings: rank 4 nranks 16 total 4.03 (kernels 0.29, bootstrap 1.99, allgathers 0.28, topo 0.66, graphs 0.67, connections 0.14, rest 0.01)
qh100-gpu19:49359:49413 [5] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol.
qh100-gpu19:49359:49413 [5] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead.
qh100-gpu19:49359:49413 [5] NCCL INFO ncclCommInitRank comm 0x55c18f112200 rank 5 nranks 16 cudaDev 5 nvmlDev 5 busId ab000 commId 0xd8289d3e6e217d26 - Init COMPLETE
qh100-gpu19:49359:49413 [5] NCCL INFO Init timings: rank 5 nranks 16 total 4.04 (kernels 0.29, bootstrap 1.99, allgathers 0.28, topo 0.66, graphs 0.67, connections 0.14, rest 0.01)
qh100-gpu19:49354:49411 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol.
qh100-gpu19:49354:49411 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead.
qh100-gpu19:49354:49411 [0] NCCL INFO ncclCommInitRank comm 0x563b05b8d020 rank 0 nranks 16 cudaDev 0 nvmlDev 0 busId 18000 commId 0xd8289d3e6e217d26 - Init COMPLETE
qh100-gpu19:49354:49411 [0] NCCL INFO Init timings: rank 0 nranks 16 total 4.15 (kernels 0.34, bootstrap 2.06, allgathers 0.28, topo 0.66, graphs 0.67, connections 0.14, rest 0.01)
qh100-gpu19:49355:49414 [1] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol.
qh100-gpu19:49355:49414 [1] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead.
qh100-gpu19:49355:49414 [1] NCCL INFO ncclCommInitRank comm 0x55bbc86693d0 rank 1 nranks 16 cudaDev 1 nvmlDev 1 busId 2a000 commId 0xd8289d3e6e217d26 - Init COMPLETE
qh100-gpu19:49355:49414 [1] NCCL INFO Init timings: rank 1 nranks 16 total 4.03 (kernels 0.29, bootstrap 1.99, allgathers 0.26, topo 0.66, graphs 0.68, connections 0.14, rest 0.00)
qh100-gpu19:49361:49412 [7] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol.
qh100-gpu19:49361:49412 [7] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead.
qh100-gpu19:49361:49412 [7] NCCL INFO ncclCommInitRank comm 0x556bf9ef3fc0 rank 7 nranks 16 cudaDev 7 nvmlDev 7 busId db000 commId 0xd8289d3e6e217d26 - Init COMPLETE
qh100-gpu19:49361:49412 [7] NCCL INFO Init timings: rank 7 nranks 16 total 4.04 (kernels 0.30, bootstrap 1.98, allgathers 0.28, topo 0.66, graphs 0.67, connections 0.14, rest 0.00)
qh100-gpu19:49356:49417 [2] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol.
qh100-gpu19:49356:49417 [2] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead.
qh100-gpu19:49356:49417 [2] NCCL INFO ncclCommInitRank comm 0x55f13ded5890 rank 2 nranks 16 cudaDev 2 nvmlDev 2 busId 3a000 commId 0xd8289d3e6e217d26 - Init COMPLETE
qh100-gpu19:49356:49417 [2] NCCL INFO Init timings: rank 2 nranks 16 total 4.02 (kernels 0.29, bootstrap 1.98, allgathers 0.27, topo 0.66, graphs 0.68, connections 0.14, rest 0.00)
qh100-gpu19:49357:49418 [3] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol.
qh100-gpu19:49357:49418 [3] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead.
qh100-gpu19:49357:49418 [3] NCCL INFO ncclCommInitRank comm 0x5555e0b324c0 rank 3 nranks 16 cudaDev 3 nvmlDev 3 busId 5d000 commId 0xd8289d3e6e217d26 - Init COMPLETE
qh100-gpu19:49357:49418 [3] NCCL INFO Init timings: rank 3 nranks 16 total 4.02 (kernels 0.29, bootstrap 1.98, allgathers 0.27, topo 0.66, graphs 0.68, connections 0.14, rest 0.01)
qh100-gpu20:39630:39685 [1] NCCL INFO NCCL_ALGO set by environment to Tree
qh100-gpu20:39630:39685 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
qh100-gpu20:39630:39685 [1] NCCL INFO 16 coll channels, 16 collnet channels, 16 nvls channels, 16 p2p channels, 2 p2p channels per peer
qh100-gpu20:39631:39688 [2] NCCL INFO NCCL_ALGO set by environment to Tree
qh100-gpu20:39631:39688 [2] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
qh100-gpu20:39631:39688 [2] NCCL INFO 16 coll channels, 16 collnet channels, 16 nvls channels, 16 p2p channels, 2 p2p channels per peer
qh100-gpu20:39633:39689 [4] NCCL INFO NCCL_ALGO set by environment to Tree
qh100-gpu20:39633:39689 [4] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
qh100-gpu20:39633:39689 [4] NCCL INFO 16 coll channels, 16 collnet channels, 16 nvls channels, 16 p2p channels, 2 p2p channels per peer
qh100-gpu20:39634:39692 [5] NCCL INFO NCCL_ALGO set by environment to Tree
qh100-gpu20:39634:39692 [5] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
qh100-gpu20:39634:39692 [5] NCCL INFO 16 coll channels, 16 collnet channels, 16 nvls channels, 16 p2p channels, 2 p2p channels per peer
qh100-gpu20:39635:39691 [6] NCCL INFO NCCL_ALGO set by environment to Tree
qh100-gpu20:39635:39691 [6] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
qh100-gpu20:39635:39691 [6] NCCL INFO 16 coll channels, 16 collnet channels, 16 nvls channels, 16 p2p channels, 2 p2p channels per peer
qh100-gpu20:39629:39686 [0] NCCL INFO NCCL_ALGO set by environment to Tree
qh100-gpu20:39629:39686 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
qh100-gpu20:39629:39686 [0] NCCL INFO 16 coll channels, 16 collnet channels, 16 nvls channels, 16 p2p channels, 2 p2p channels per peer
qh100-gpu20:39636:39690 [7] NCCL INFO NCCL_ALGO set by environment to Tree
qh100-gpu20:39636:39690 [7] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
qh100-gpu20:39636:39690 [7] NCCL INFO 16 coll channels, 16 collnet channels, 16 nvls channels, 16 p2p channels, 2 p2p channels per peer
qh100-gpu20:39632:39687 [3] NCCL INFO NCCL_ALGO set by environment to Tree
qh100-gpu20:39632:39687 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
qh100-gpu20:39632:39687 [3] NCCL INFO 16 coll channels, 16 collnet channels, 16 nvls channels, 16 p2p channels, 2 p2p channels per peer
qh100-gpu20:39631:39688 [2] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol.
qh100-gpu20:39631:39688 [2] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead.
qh100-gpu20:39631:39688 [2] NCCL INFO ncclCommInitRank comm 0x560fe9299b80 rank 10 nranks 16 cudaDev 2 nvmlDev 2 busId 3a000 commId 0xd8289d3e6e217d26 - Init COMPLETE
qh100-gpu20:39631:39688 [2] NCCL INFO Init timings: rank 10 nranks 16 total 4.03 (kernels 0.32, bootstrap 1.89, allgathers 0.17, topo 0.77, graphs 0.67, connections 0.20, rest 0.00)
qh100-gpu20:39633:39689 [4] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol.
qh100-gpu20:39633:39689 [4] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead.
qh100-gpu20:39633:39689 [4] NCCL INFO ncclCommInitRank comm 0x5619b9724520 rank 12 nranks 16 cudaDev 4 nvmlDev 4 busId 9a000 commId 0xd8289d3e6e217d26 - Init COMPLETE
qh100-gpu20:39633:39689 [4] NCCL INFO Init timings: rank 12 nranks 16 total 4.03 (kernels 0.31, bootstrap 1.90, allgathers 0.16, topo 0.77, graphs 0.67, connections 0.21, rest 0.00)
qh100-gpu20:39634:39692 [5] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol.
qh100-gpu20:39634:39692 [5] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead.
qh100-gpu20:39634:39692 [5] NCCL INFO ncclCommInitRank comm 0x5558e3c10180 rank 13 nranks 16 cudaDev 5 nvmlDev 5 busId ab000 commId 0xd8289d3e6e217d26 - Init COMPLETE
qh100-gpu20:39634:39692 [5] NCCL INFO Init timings: rank 13 nranks 16 total 4.01 (kernels 0.31, bootstrap 1.88, allgathers 0.17, topo 0.77, graphs 0.67, connections 0.21, rest 0.00)
qh100-gpu20:39635:39691 [6] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol.
qh100-gpu20:39635:39691 [6] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead.
qh100-gpu20:39635:39691 [6] NCCL INFO ncclCommInitRank comm 0x56089796c4b0 rank 14 nranks 16 cudaDev 6 nvmlDev 6 busId ba000 commId 0xd8289d3e6e217d26 - Init COMPLETE
qh100-gpu20:39635:39691 [6] NCCL INFO Init timings: rank 14 nranks 16 total 4.02 (kernels 0.31, bootstrap 1.90, allgathers 0.17, topo 0.77, graphs 0.67, connections 0.21, rest 0.00)
qh100-gpu20:39636:39690 [7] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol.
qh100-gpu20:39636:39690 [7] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead.
qh100-gpu20:39636:39690 [7] NCCL INFO ncclCommInitRank comm 0x55a80a710090 rank 15 nranks 16 cudaDev 7 nvmlDev 7 busId db000 commId 0xd8289d3e6e217d26 - Init COMPLETE
qh100-gpu20:39636:39690 [7] NCCL INFO Init timings: rank 15 nranks 16 total 4.02 (kernels 0.31, bootstrap 1.89, allgathers 0.07, topo 0.77, graphs 0.77, connections 0.21, rest 0.00)
qh100-gpu20:39630:39685 [1] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol.
qh100-gpu20:39630:39685 [1] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead.
qh100-gpu20:39630:39685 [1] NCCL INFO ncclCommInitRank comm 0x556888e99580 rank 9 nranks 16 cudaDev 1 nvmlDev 1 busId 2a000 commId 0xd8289d3e6e217d26 - Init COMPLETE
qh100-gpu20:39630:39685 [1] NCCL INFO Init timings: rank 9 nranks 16 total 4.12 (kernels 0.24, bootstrap 2.06, allgathers 0.02, topo 0.77, graphs 0.81, connections 0.20, rest 0.00)
qh100-gpu20:39629:39686 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol.
qh100-gpu20:39629:39686 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead.
qh100-gpu20:39629:39686 [0] NCCL INFO ncclCommInitRank comm 0x55aa2afe7340 rank 8 nranks 16 cudaDev 0 nvmlDev 0 busId 18000 commId 0xd8289d3e6e217d26 - Init COMPLETE
qh100-gpu20:39629:39686 [0] NCCL INFO Init timings: rank 8 nranks 16 total 4.07 (kernels 0.33, bootstrap 1.92, allgathers 0.16, topo 0.77, graphs 0.68, connections 0.21, rest 0.00)
qh100-gpu20:39632:39687 [3] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol.
qh100-gpu20:39632:39687 [3] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead.
qh100-gpu20:39632:39687 [3] NCCL INFO ncclCommInitRank comm 0x559295370c20 rank 11 nranks 16 cudaDev 3 nvmlDev 3 busId 5d000 commId 0xd8289d3e6e217d26 - Init COMPLETE
qh100-gpu20:39632:39687 [3] NCCL INFO Init timings: rank 11 nranks 16 total 4.04 (kernels 0.31, bootstrap 1.91, allgathers 0.01, topo 0.77, graphs 0.82, connections 0.21, rest 0.00)
qh100-gpu19:49359:49471 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/CUMEM
qh100-gpu19:49354:49468 [0] NCCL INFO Channel 00/0 : 0[0] -> 2[2] via P2P/CUMEM
qh100-gpu19:49355:49472 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/CUMEM
qh100-gpu19:49357:49473 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/CUMEM
qh100-gpu20:39629:39744 [0] NCCL INFO Channel 00/0 : 8[0] -> 10[2] via P2P/CUMEM
qh100-gpu20:39632:39747 [3] NCCL INFO Channel 00/0 : 11[3] -> 12[4] via P2P/CUMEM
qh100-gpu20:39630:39748 [1] NCCL INFO Channel 00/0 : 9[1] -> 10[2] via P2P/CUMEM
qh100-gpu20:39634:39746 [5] NCCL INFO Channel 00/0 : 13[5] -> 14[6] via P2P/CUMEM
qh100-gpu19:49359:49471 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/CUMEM
qh100-gpu19:49354:49468 [0] NCCL INFO Channel 02/0 : 0[0] -> 2[2] via P2P/CUMEM
qh100-gpu19:49355:49472 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/CUMEM
qh100-gpu19:49354:49468 [0] NCCL INFO Channel 03/0 : 0[0] -> 2[2] via P2P/CUMEM
qh100-gpu19:49355:49472 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/CUMEM
qh100-gpu20:39632:39747 [3] NCCL INFO Channel 01/0 : 11[3] -> 12[4] via P2P/CUMEM
qh100-gpu19:49359:49471 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/CUMEM
qh100-gpu20:39634:39746 [5] NCCL INFO Channel 01/0 : 13[5] -> 14[6] via P2P/CUMEM
qh100-gpu19:49355:49472 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/CUMEM
qh100-gpu20:39629:39744 [0] NCCL INFO Channel 02/0 : 8[0] -> 10[2] via P2P/CUMEM
qh100-gpu20:39630:39748 [1] NCCL INFO Channel 01/0 : 9[1] -> 10[2] via P2P/CUMEM
qh100-gpu19:49354:49468 [0] NCCL INFO Channel 04/0 : 0[0] -> 2[2] via P2P/CUMEM
qh100-gpu19:49359:49471 [5] NCCL INFO Channel 03/0 : 5[5] -> 6[6] via P2P/CUMEM
qh100-gpu20:39632:39747 [3] NCCL INFO Channel 03/0 : 11[3] -> 12[4] via P2P/CUMEM
qh100-gpu19:49355:49472 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/CUMEM
qh100-gpu19:49354:49468 [0] NCCL INFO Channel 06/0 : 0[0] -> 2[2] via P2P/CUMEM
qh100-gpu19:49359:49471 [5] NCCL INFO Channel 04/0 : 5[5] -> 6[6] via P2P/CUMEM
qh100-gpu20:39634:39746 [5] NCCL INFO Channel 02/0 : 13[5] -> 14[6] via P2P/CUMEM
qh100-gpu19:49355:49472 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/CUMEM
qh100-gpu19:49354:49468 [0] NCCL INFO Channel 07/0 : 0[0] -> 2[2] via P2P/CUMEM
qh100-gpu20:39629:39744 [0] NCCL INFO Channel 03/0 : 8[0] -> 10[2] via P2P/CUMEM
qh100-gpu19:49359:49471 [5] NCCL INFO Channel 05/0 : 5[5] -> 6[6] via P2P/CUMEM
qh100-gpu19:49355:49472 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/CUMEM
qh100-gpu19:49354:49468 [0] NCCL INFO Channel 08/0 : 0[0] -> 2[2] via P2P/CUMEM
qh100-gpu20:39630:39748 [1] NCCL INFO Channel 02/0 : 9[1] -> 10[2] via P2P/CUMEM
qh100-gpu19:49359:49471 [5] NCCL INFO Channel 06/0 : 5[5] -> 6[6] via P2P/CUMEM
qh100-gpu19:49355:49472 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/CUMEM
qh100-gpu20:39632:39747 [3] NCCL INFO Channel 04/0 : 11[3] -> 12[4] via P2P/CUMEM
qh100-gpu19:49359:49471 [5] NCCL INFO Channel 07/0 : 5[5] -> 6[6] via P2P/CUMEM
qh100-gpu19:49354:49468 [0] NCCL INFO Channel 10/0 : 0[0] -> 2[2] via P2P/CUMEM
qh100-gpu20:39634:39746 [5] NCCL INFO Channel 03/0 : 13[5] -> 14[6] via P2P/CUMEM
qh100-gpu20:39629:39744 [0] NCCL INFO Channel 04/0 : 8[0] -> 10[2] via P2P/CUMEM
qh100-gpu19:49355:49472 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/CUMEM
qh100-gpu19:49354:49468 [0] NCCL INFO Channel 11/0 : 0[0] -> 2[2] via P2P/CUMEM
qh100-gpu19:49359:49471 [5] NCCL INFO Channel 08/0 : 5[5] -> 6[6] via P2P/CUMEM
qh100-gpu20:39630:39748 [1] NCCL INFO Channel 03/0 : 9[1] -> 10[2] via P2P/CUMEM
qh100-gpu19:49355:49472 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/CUMEM
qh100-gpu19:49354:49468 [0] NCCL INFO Channel 12/0 : 0[0] -> 2[2] via P2P/CUMEM
qh100-gpu19:49359:49471 [5] NCCL INFO Channel 09/0 : 5[5] -> 6[6] via P2P/CUMEM
qh100-gpu20:39632:39747 [3] NCCL INFO Channel 05/0 : 11[3] -> 12[4] via P2P/CUMEM
qh100-gpu19:49355:49472 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/CUMEM
qh100-gpu19:49354:49468 [0] NCCL INFO Channel 14/0 : 0[0] -> 2[2] via P2P/CUMEM
qh100-gpu20:39634:39746 [5] NCCL INFO Channel 04/0 : 13[5] -> 14[6] via P2P/CUMEM
qh100-gpu19:49359:49471 [5] NCCL INFO Channel 10/0 : 5[5] -> 6[6] via P2P/CUMEM
qh100-gpu20:39629:39744 [0] NCCL INFO Channel 06/0 : 8[0] -> 10[2] via P2P/CUMEM
qh100-gpu20:39630:39748 [1] NCCL INFO Channel 04/0 : 9[1] -> 10[2] via P2P/CUMEM
qh100-gpu19:49359:49471 [5] NCCL INFO Channel 11/0 : 5[5] -> 6[6] via P2P/CUMEM
qh100-gpu19:49355:49472 [1] NCCL INFO Channel 11/0 : 1[1] -> 2[2] via P2P/CUMEM
qh100-gpu19:49354:49468 [0] NCCL INFO Channel 15/0 : 0[0] -> 2[2] via P2P/CUMEM
qh100-gpu19:49355:49472 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/CUMEM
qh100-gpu19:49359:49471 [5] NCCL INFO Channel 12/0 : 5[5] -> 6[6] via P2P/CUMEM
qh100-gpu20:39632:39747 [3] NCCL INFO Channel 07/0 : 11[3] -> 12[4] via P2P/CUMEM
qh100-gpu19:49359:49471 [5] NCCL INFO Channel 13/0 : 5[5] -> 6[6] via P2P/CUMEM
qh100-gpu19:49355:49472 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/CUMEM
qh100-gpu20:39634:39746 [5] NCCL INFO Channel 05/0 : 13[5] -> 14[6] via P2P/CUMEM
qh100-gpu19:49359:49471 [5] NCCL INFO Channel 14/0 : 5[5] -> 6[6] via P2P/CUMEM
qh100-gpu19:49355:49472 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/CUMEM
qh100-gpu19:49357:49473 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/CUMEM
qh100-gpu19:49355:49472 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/CUMEM
qh100-gpu19:49359:49471 [5] NCCL INFO Channel 15/0 : 5[5] -> 6[6] via P2P/CUMEM
qh100-gpu20:39629:39744 [0] NCCL INFO Channel 07/0 : 8[0] -> 10[2] via P2P/CUMEM
qh100-gpu20:39630:39748 [1] NCCL INFO Channel 05/0 : 9[1] -> 10[2] via P2P/CUMEM
qh100-gpu20:39632:39747 [3] NCCL INFO Channel 08/0 : 11[3] -> 12[4] via P2P/CUMEM
qh100-gpu20:39634:39746 [5] NCCL INFO Channel 06/0 : 13[5] -> 14[6] via P2P/CUMEM
qh100-gpu19:49357:49473 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/CUMEM
qh100-gpu20:39629:39744 [0] NCCL INFO Channel 08/0 : 8[0] -> 10[2] via P2P/CUMEM
qh100-gpu20:39630:39748 [1] NCCL INFO Channel 06/0 : 9[1] -> 10[2] via P2P/CUMEM
qh100-gpu19:49357:49473 [3] NCCL INFO Channel 04/0 : 3[3] -> 4[4] via P2P/CUMEM
qh100-gpu20:39632:39747 [3] NCCL INFO Channel 09/0 : 11[3] -> 12[4] via P2P/CUMEM
qh100-gpu20:39634:39746 [5] NCCL INFO Channel 07/0 : 13[5] -> 14[6] via P2P/CUMEM
qh100-gpu20:39629:39744 [0] NCCL INFO Channel 10/0 : 8[0] -> 10[2] via P2P/CUMEM
qh100-gpu19:49357:49473 [3] NCCL INFO Channel 05/0 : 3[3] -> 4[4] via P2P/CUMEM
qh100-gpu19:49357:49473 [3] NCCL INFO Channel 07/0 : 3[3] -> 4[4] via P2P/CUMEM
qh100-gpu20:39630:39748 [1] NCCL INFO Channel 07/0 : 9[1] -> 10[2] via P2P/CUMEM
qh100-gpu20:39632:39747 [3] NCCL INFO Channel 11/0 : 11[3] -> 12[4] via P2P/CUMEM
qh100-gpu19:49357:49473 [3] NCCL INFO Channel 08/0 : 3[3] -> 4[4] via P2P/CUMEM
qh100-gpu20:39634:39746 [5] NCCL INFO Channel 08/0 : 13[5] -> 14[6] via P2P/CUMEM
qh100-gpu20:39629:39744 [0] NCCL INFO Channel 11/0 : 8[0] -> 10[2] via P2P/CUMEM
qh100-gpu19:49357:49473 [3] NCCL INFO Channel 09/0 : 3[3] -> 4[4] via P2P/CUMEM
qh100-gpu19:49357:49473 [3] NCCL INFO Channel 11/0 : 3[3] -> 4[4] via P2P/CUMEM
qh100-gpu20:39630:39748 [1] NCCL INFO Channel 08/0 : 9[1] -> 10[2] via P2P/CUMEM
qh100-gpu19:49357:49473 [3] NCCL INFO Channel 12/0 : 3[3] -> 4[4] via P2P/CUMEM
qh100-gpu20:39632:39747 [3] NCCL INFO Channel 12/0 : 11[3] -> 12[4] via P2P/CUMEM
qh100-gpu19:49357:49473 [3] NCCL INFO Channel 13/0 : 3[3] -> 4[4] via P2P/CUMEM
qh100-gpu20:39634:39746 [5] NCCL INFO Channel 09/0 : 13[5] -> 14[6] via P2P/CUMEM
qh100-gpu19:49357:49473 [3] NCCL INFO Channel 15/0 : 3[3] -> 4[4] via P2P/CUMEM
qh100-gpu20:39629:39744 [0] NCCL INFO Channel 12/0 : 8[0] -> 10[2] via P2P/CUMEM
qh100-gpu19:49359:49471 [5] NCCL INFO Channel 00/0 : 5[5] -> 7[7] via P2P/CUMEM
qh100-gpu19:49355:49472 [1] NCCL INFO Channel 00/0 : 1[1] -> 3[3] via P2P/CUMEM
qh100-gpu19:49358:49469 [4] NCCL INFO Channel 00/0 : 4[4] -> 6[6] via P2P/CUMEM
qh100-gpu20:39630:39748 [1] NCCL INFO Channel 09/0 : 9[1] -> 10[2] via P2P/CUMEM
qh100-gpu19:49359:49471 [5] NCCL INFO Channel 01/0 : 5[5] -> 7[7] via P2P/CUMEM
qh100-gpu19:49355:49472 [1] NCCL INFO Channel 01/0 : 1[1] -> 3[3] via P2P/CUMEM
qh100-gpu20:39632:39747 [3] NCCL INFO Channel 13/0 : 11[3] -> 12[4] via P2P/CUMEM
qh100-gpu19:49358:49469 [4] NCCL INFO Channel 01/0 : 4[4] -> 6[6] via P2P/CUMEM
qh100-gpu20:39634:39746 [5] NCCL INFO Channel 10/0 : 13[5] -> 14[6] via P2P/CUMEM
qh100-gpu19:49359:49471 [5] NCCL INFO Channel 02/0 : 5[5] -> 7[7] via P2P/CUMEM
qh100-gpu20:39629:39744 [0] NCCL INFO Channel 14/0 : 8[0] -> 10[2] via P2P/CUMEM
qh100-gpu19:49355:49472 [1] NCCL INFO Channel 02/0 : 1[1] -> 3[3] via P2P/CUMEM
qh100-gpu19:49358:49469 [4] NCCL INFO Channel 02/0 : 4[4] -> 6[6] via P2P/CUMEM
qh100-gpu19:49359:49471 [5] NCCL INFO Channel 03/0 : 5[5] -> 7[7] via P2P/CUMEM
qh100-gpu20:39630:39748 [1] NCCL INFO Channel 10/0 : 9[1] -> 10[2] via P2P/CUMEM
qh100-gpu19:49355:49472 [1] NCCL INFO Channel 03/0 : 1[1] -> 3[3] via P2P/CUMEM
qh100-gpu20:39632:39747 [3] NCCL INFO Channel 15/0 : 11[3] -> 12[4] via P2P/CUMEM
qh100-gpu19:49358:49469 [4] NCCL INFO Channel 04/0 : 4[4] -> 6[6] via P2P/CUMEM
qh100-gpu20:39634:39746 [5] NCCL INFO Channel 11/0 : 13[5] -> 14[6] via P2P/CUMEM
qh100-gpu19:49359:49471 [5] NCCL INFO Channel 04/0 : 5[5] -> 7[7] via P2P/CUMEM
qh100-gpu20:39629:39744 [0] NCCL INFO Channel 15/0 : 8[0] -> 10[2] via P2P/CUMEM
qh100-gpu19:49355:49472 [1] NCCL INFO Channel 04/0 : 1[1] -> 3[3] via P2P/CUMEM
qh100-gpu19:49358:49469 [4] NCCL INFO Channel 05/0 : 4[4] -> 6[6] via P2P/CUMEM
qh100-gpu19:49359:49471 [5] NCCL INFO Channel 05/0 : 5[5] -> 7[7] via P2P/CUMEM
qh100-gpu20:39630:39748 [1] NCCL INFO Channel 11/0 : 9[1] -> 10[2] via P2P/CUMEM
qh100-gpu19:49355:49472 [1] NCCL INFO Channel 05/0 : 1[1] -> 3[3] via P2P/CUMEM
qh100-gpu19:49358:49469 [4] NCCL INFO Channel 06/0 : 4[4] -> 6[6] via P2P/CUMEM
qh100-gpu19:49359:49471 [5] NCCL INFO Channel 06/0 : 5[5] -> 7[7] via P2P/CUMEM
qh100-gpu20:39634:39746 [5] NCCL INFO Channel 12/0 : 13[5] -> 14[6] via P2P/CUMEM
qh100-gpu19:49355:49472 [1] NCCL INFO Channel 06/0 : 1[1] -> 3[3] via P2P/CUMEM
qh100-gpu20:39633:39743 [4] NCCL INFO Channel 00/0 : 12[4] -> 14[6] via P2P/CUMEM
qh100-gpu19:49358:49469 [4] NCCL INFO Channel 08/0 : 4[4] -> 6[6] via P2P/CUMEM
qh100-gpu20:39630:39748 [1] NCCL INFO Channel 12/0 : 9[1] -> 10[2] via P2P/CUMEM
qh100-gpu19:49359:49471 [5] NCCL INFO Channel 07/0 : 5[5] -> 7[7] via P2P/CUMEM
qh100-gpu19:49355:49472 [1] NCCL INFO Channel 07/0 : 1[1] -> 3[3] via P2P/CUMEM
qh100-gpu20:39634:39746 [5] NCCL INFO Channel 13/0 : 13[5] -> 14[6] via P2P/CUMEM
qh100-gpu19:49358:49469 [4] NCCL INFO Channel 09/0 : 4[4] -> 6[6] via P2P/CUMEM
qh100-gpu19:49359:49471 [5] NCCL INFO Channel 08/0 : 5[5] -> 7[7] via P2P/CUMEM
qh100-gpu19:49355:49472 [1] NCCL INFO Channel 08/0 : 1[1] -> 3[3] via P2P/CUMEM
qh100-gpu20:39633:39743 [4] NCCL INFO Channel 01/0 : 12[4] -> 14[6] via P2P/CUMEM
qh100-gpu20:39630:39748 [1] NCCL INFO Channel 13/0 : 9[1] -> 10[2] via P2P/CUMEM
qh100-gpu19:49358:49469 [4] NCCL INFO Channel 10/0 : 4[4] -> 6[6] via P2P/CUMEM
qh100-gpu19:49359:49471 [5] NCCL INFO Channel 09/0 : 5[5] -> 7[7] via P2P/CUMEM
qh100-gpu20:39634:39746 [5] NCCL INFO Channel 14/0 : 13[5] -> 14[6] via P2P/CUMEM
qh100-gpu19:49355:49472 [1] NCCL INFO Channel 09/0 : 1[1] -> 3[3] via P2P/CUMEM
qh100-gpu19:49358:49469 [4] NCCL INFO Channel 12/0 : 4[4] -> 6[6] via P2P/CUMEM
qh100-gpu19:49359:49471 [5] NCCL INFO Channel 10/0 : 5[5] -> 7[7] via P2P/CUMEM
qh100-gpu20:39633:39743 [4] NCCL INFO Channel 02/0 : 12[4] -> 14[6] via P2P/CUMEM
qh100-gpu19:49355:49472 [1] NCCL INFO Channel 10/0 : 1[1] -> 3[3] via P2P/CUMEM
qh100-gpu20:39630:39748 [1] NCCL INFO Channel 14/0 : 9[1] -> 10[2] via P2P/CUMEM
qh100-gpu19:49358:49469 [4] NCCL INFO Channel 13/0 : 4[4] -> 6[6] via P2P/CUMEM
qh100-gpu19:49359:49471 [5] NCCL INFO Channel 11/0 : 5[5] -> 7[7] via P2P/CUMEM
qh100-gpu20:39634:39746 [5] NCCL INFO Channel 15/0 : 13[5] -> 14[6] via P2P/CUMEM
qh100-gpu19:49355:49472 [1] NCCL INFO Channel 11/0 : 1[1] -> 3[3] via P2P/CUMEM
qh100-gpu19:49358:49469 [4] NCCL INFO Channel 14/0 : 4[4] -> 6[6] via P2P/CUMEM
qh100-gpu20:39633:39743 [4] NCCL INFO Channel 04/0 : 12[4] -> 14[6] via P2P/CUMEM
qh100-gpu19:49359:49471 [5] NCCL INFO Channel 12/0 : 5[5] -> 7[7] via P2P/CUMEM
qh100-gpu20:39630:39748 [1] NCCL INFO Channel 15/0 : 9[1] -> 10[2] via P2P/CUMEM
qh100-gpu19:49355:49472 [1] NCCL INFO Channel 12/0 : 1[1] -> 3[3] via P2P/CUMEM
qh100-gpu19:49354:49468 [0] NCCL INFO Channel 01/0 : 0[0] -> 7[7] via P2P/CUMEM
qh100-gpu19:49359:49471 [5] NCCL INFO Channel 13/0 : 5[5] -> 7[7] via P2P/CUMEM
qh100-gpu19:49355:49472 [1] NCCL INFO Channel 13/0 : 1[1] -> 3[3] via P2P/CUMEM
qh100-gpu20:39633:39743 [4] NCCL INFO Channel 05/0 : 12[4] -> 14[6] via P2P/CUMEM
qh100-gpu19:49359:49471 [5] NCCL INFO Channel 14/0 : 5[5] -> 7[7] via P2P/CUMEM
qh100-gpu19:49355:49472 [1] NCCL INFO Channel 14/0 : 1[1] -> 3[3] via P2P/CUMEM
qh100-gpu19:49354:49468 [0] NCCL INFO Channel 02/0 : 0[0] -> 7[7] via P2P/CUMEM
qh100-gpu20:39634:39746 [5] NCCL INFO Channel 00/0 : 13[5] -> 15[7] via P2P/CUMEM
qh100-gpu20:39630:39748 [1] NCCL INFO Channel 00/0 : 9[1] -> 11[3] via P2P/CUMEM
qh100-gpu19:49359:49471 [5] NCCL INFO Channel 15/0 : 5[5] -> 7[7] via P2P/CUMEM
qh100-gpu19:49355:49472 [1] NCCL INFO Channel 15/0 : 1[1] -> 3[3] via P2P/CUMEM
qh100-gpu19:49354:49468 [0] NCCL INFO Channel 03/0 : 0[0] -> 7[7] via P2P/CUMEM
qh100-gpu20:39633:39743 [4] NCCL INFO Channel 06/0 : 12[4] -> 14[6] via P2P/CUMEM
qh100-gpu19:49354:49468 [0] NCCL INFO Channel 05/0 : 0[0] -> 7[7] via P2P/CUMEM
qh100-gpu20:39634:39746 [5] NCCL INFO Channel 01/0 : 13[5] -> 15[7] via P2P/CUMEM
qh100-gpu19:49357:49473 [3] NCCL INFO Channel 00/0 : 3[3] -> 1[1] via P2P/CUMEM
qh100-gpu20:39630:39748 [1] NCCL INFO Channel 01/0 : 9[1] -> 11[3] via P2P/CUMEM
qh100-gpu19:49354:49468 [0] NCCL INFO Channel 06/0 : 0[0] -> 7[7] via P2P/CUMEM
qh100-gpu20:39633:39743 [4] NCCL INFO Channel 08/0 : 12[4] -> 14[6] via P2P/CUMEM
qh100-gpu19:49357:49473 [3] NCCL INFO Channel 01/0 : 3[3] -> 1[1] via P2P/CUMEM
qh100-gpu20:39634:39746 [5] NCCL INFO Channel 02/0 : 13[5] -> 15[7] via P2P/CUMEM
qh100-gpu19:49354:49468 [0] NCCL INFO Channel 07/0 : 0[0] -> 7[7] via P2P/CUMEM
qh100-gpu20:39630:39748 [1] NCCL INFO Channel 02/0 : 9[1] -> 11[3] via P2P/CUMEM
qh100-gpu19:49357:49473 [3] NCCL INFO Channel 02/0 : 3[3] -> 1[1] via P2P/CUMEM
qh100-gpu20:39633:39743 [4] NCCL INFO Channel 09/0 : 12[4] -> 14[6] via P2P/CUMEM
qh100-gpu19:49354:49468 [0] NCCL INFO Channel 09/0 : 0[0] -> 7[7] via P2P/CUMEM
qh100-gpu20:39634:39746 [5] NCCL INFO Channel 03/0 : 13[5] -> 15[7] via P2P/CUMEM
qh100-gpu19:49357:49473 [3] NCCL INFO Channel 03/0 : 3[3] -> 1[1] via P2P/CUMEM
qh100-gpu19:49354:49468 [0] NCCL INFO Channel 10/0 : 0[0] -> 7[7] via P2P/CUMEM
qh100-gpu20:39630:39748 [1] NCCL INFO Channel 03/0 : 9[1] -> 11[3] via P2P/CUMEM
qh100-gpu19:49357:49473 [3] NCCL INFO Channel 04/0 : 3[3] -> 1[1] via P2P/CUMEM
qh100-gpu20:39633:39743 [4] NCCL INFO Channel 10/0 : 12[4] -> 14[6] via P2P/CUMEM
qh100-gpu19:49354:49468 [0] NCCL INFO Channel 11/0 : 0[0] -> 7[7] via P2P/CUMEM
qh100-gpu20:39634:39746 [5] NCCL INFO Channel 04/0 : 13[5] -> 15[7] via P2P/CUMEM
qh100-gpu19:49357:49473 [3] NCCL INFO Channel 05/0 : 3[3] -> 1[1] via P2P/CUMEM
qh100-gpu20:39630:39748 [1] NCCL INFO Channel 04/0 : 9[1] -> 11[3] via P2P/CUMEM
qh100-gpu19:49354:49468 [0] NCCL INFO Channel 13/0 : 0[0] -> 7[7] via P2P/CUMEM
qh100-gpu20:39633:39743 [4] NCCL INFO Channel 12/0 : 12[4] -> 14[6] via P2P/CUMEM
qh100-gpu19:49357:49473 [3] NCCL INFO Channel 06/0 : 3[3] -> 1[1] via P2P/CUMEM
qh100-gpu19:49354:49468 [0] NCCL INFO Channel 14/0 : 0[0] -> 7[7] via P2P/CUMEM
qh100-gpu20:39630:39748 [1] NCCL INFO Channel 05/0 : 9[1] -> 11[3] via P2P/CUMEM
qh100-gpu19:49357:49473 [3] NCCL INFO Channel 07/0 : 3[3] -> 1[1] via P2P/CUMEM
qh100-gpu20:39633:39743 [4] NCCL INFO Channel 13/0 : 12[4] -> 14[6] via P2P/CUMEM
qh100-gpu19:49354:49468 [0] NCCL INFO Channel 15/0 : 0[0] -> 7[7] via P2P/CUMEM
qh100-gpu19:49357:49473 [3] NCCL INFO Channel 08/0 : 3[3] -> 1[1] via P2P/CUMEM
qh100-gpu20:39630:39748 [1] NCCL INFO Channel 06/0 : 9[1] -> 11[3] via P2P/CUMEM
qh100-gpu19:49357:49473 [3] NCCL INFO Channel 09/0 : 3[3] -> 1[1] via P2P/CUMEM
qh100-gpu20:39633:39743 [4] NCCL INFO Channel 14/0 : 12[4] -> 14[6] via P2P/CUMEM
qh100-gpu19:49357:49473 [3] NCCL INFO Channel 10/0 : 3[3] -> 1[1] via P2P/CUMEM
qh100-gpu20:39630:39748 [1] NCCL INFO Channel 07/0 : 9[1] -> 11[3] via P2P/CUMEM
qh100-gpu19:49357:49473 [3] NCCL INFO Channel 11/0 : 3[3] -> 1[1] via P2P/CUMEM
qh100-gpu19:49361:49474 [7] NCCL INFO Channel 01/0 : 7[7] -> 0[0] via P2P/CUMEM
qh100-gpu19:49357:49473 [3] NCCL INFO Channel 12/0 : 3[3] -> 1[1] via P2P/CUMEM
qh100-gpu20:39630:39748 [1] NCCL INFO Channel 08/0 : 9[1] -> 11[3] via P2P/CUMEM
qh100-gpu19:49361:49474 [7] NCCL INFO Channel 02/0 : 7[7] -> 0[0] via P2P/CUMEM
qh100-gpu19:49357:49473 [3] NCCL INFO Channel 13/0 : 3[3] -> 1[1] via P2P/CUMEM
qh100-gpu19:49356:49470 [2] NCCL INFO Channel 01/0 : 10[2] -> 2[2] [receive] via NET/IBext/1/GDRDMA
qh100-gpu19:49356:49470 [2] NCCL INFO Channel 05/0 : 10[2] -> 2[2] [receive] via NET/IBext/1/GDRDMA
qh100-gpu19:49356:49470 [2] NCCL INFO Channel 09/0 : 10[2] -> 2[2] [receive] via NET/IBext/1/GDRDMA
qh100-gpu20:39630:39748 [1] NCCL INFO Channel 09/0 : 9[1] -> 11[3] via P2P/CUMEM
qh100-gpu19:49356:49470 [2] NCCL INFO Channel 13/0 : 10[2] -> 2[2] [receive] via NET/IBext/1/GDRDMA
qh100-gpu19:49361:49474 [7] NCCL INFO Channel 03/0 : 7[7] -> 0[0] via P2P/CUMEM
qh100-gpu19:49360:49467 [6] NCCL INFO Channel 03/0 : 14[6] -> 6[6] [receive] via NET/IBext/3/GDRDMA
qh100-gpu19:49357:49473 [3] NCCL INFO Channel 14/0 : 3[3] -> 1[1] via P2P/CUMEM
qh100-gpu19:49356:49470 [2] NCCL INFO Channel 01/0 : 2[2] -> 10[2] [send] via NET/IBext/1/GDRDMA
qh100-gpu19:49360:49467 [6] NCCL INFO Channel 07/0 : 14[6] -> 6[6] [receive] via NET/IBext/3/GDRDMA
qh100-gpu19:49356:49470 [2] NCCL INFO Channel 05/0 : 2[2] -> 10[2] [send] via NET/IBext/1/GDRDMA
qh100-gpu19:49358:49469 [4] NCCL INFO Channel 02/0 : 12[4] -> 4[4] [receive] via NET/IBext/2/GDRDMA
qh100-gpu19:49360:49467 [6] NCCL INFO Channel 11/0 : 14[6] -> 6[6] [receive] via NET/IBext/3/GDRDMA
qh100-gpu19:49356:49470 [2] NCCL INFO Channel 09/0 : 2[2] -> 10[2] [send] via NET/IBext/1/GDRDMA
qh100-gpu19:49358:49469 [4] NCCL INFO Channel 06/0 : 12[4] -> 4[4] [receive] via NET/IBext/2/GDRDMA
qh100-gpu19:49360:49467 [6] NCCL INFO Channel 15/0 : 14[6] -> 6[6] [receive] via NET/IBext/3/GDRDMA
qh100-gpu19:49356:49470 [2] NCCL INFO Channel 13/0 : 2[2] -> 10[2] [send] via NET/IBext/1/GDRDMA
qh100-gpu19:49358:49469 [4] NCCL INFO Channel 10/0 : 12[4] -> 4[4] [receive] via NET/IBext/2/GDRDMA
qh100-gpu19:49360:49467 [6] NCCL INFO Channel 03/0 : 6[6] -> 14[6] [send] via NET/IBext/3/GDRDMA
qh100-gpu19:49361:49474 [7] NCCL INFO Channel 05/0 : 7[7] -> 0[0] via P2P/CUMEM
qh100-gpu19:49360:49467 [6] NCCL INFO Channel 07/0 : 6[6] -> 14[6] [send] via NET/IBext/3/GDRDMA
qh100-gpu19:49358:49469 [4] NCCL INFO Channel 14/0 : 12[4] -> 4[4] [receive] via NET/IBext/2/GDRDMA
qh100-gpu19:49357:49473 [3] NCCL INFO Channel 15/0 : 3[3] -> 1[1] via P2P/CUMEM
qh100-gpu19:49358:49469 [4] NCCL INFO Channel 02/0 : 4[4] -> 12[4] [send] via NET/IBext/2/GDRDMA
qh100-gpu19:49360:49467 [6] NCCL INFO Channel 11/0 : 6[6] -> 14[6] [send] via NET/IBext/3/GDRDMA
qh100-gpu19:49360:49467 [6] NCCL INFO Channel 15/0 : 6[6] -> 14[6] [send] via NET/IBext/3/GDRDMA
qh100-gpu19:49358:49469 [4] NCCL INFO Channel 06/0 : 4[4] -> 12[4] [send] via NET/IBext/2/GDRDMA
qh100-gpu20:39630:39748 [1] NCCL INFO Channel 10/0 : 9[1] -> 11[3] via P2P/CUMEM
qh100-gpu19:49358:49469 [4] NCCL INFO Channel 10/0 : 4[4] -> 12[4] [send] via NET/IBext/2/GDRDMA
qh100-gpu19:49358:49469 [4] NCCL INFO Channel 14/0 : 4[4] -> 12[4] [send] via NET/IBext/2/GDRDMA
qh100-gpu19:49361:49474 [7] NCCL INFO Channel 06/0 : 7[7] -> 0[0] via P2P/CUMEM
qh100-gpu19:49361:49474 [7] NCCL INFO Channel 07/0 : 7[7] -> 0[0] via P2P/CUMEM
qh100-gpu20:39630:39748 [1] NCCL INFO Channel 11/0 : 9[1] -> 11[3] via P2P/CUMEM
qh100-gpu19:49361:49474 [7] NCCL INFO Channel 09/0 : 7[7] -> 0[0] via P2P/CUMEM
qh100-gpu20:39629:39744 [0] NCCL INFO Channel 01/0 : 8[0] -> 15[7] via P2P/CUMEM
qh100-gpu19:49361:49474 [7] NCCL INFO Channel 10/0 : 7[7] -> 0[0] via P2P/CUMEM
qh100-gpu20:39630:39748 [1] NCCL INFO Channel 12/0 : 9[1] -> 11[3] via P2P/CUMEM
qh100-gpu19:49361:49474 [7] NCCL INFO Channel 11/0 : 7[7] -> 0[0] via P2P/CUMEM
qh100-gpu20:39629:39744 [0] NCCL INFO Channel 02/0 : 8[0] -> 15[7] via P2P/CUMEM
qh100-gpu20:39634:39746 [5] NCCL INFO Channel 05/0 : 13[5] -> 15[7] via P2P/CUMEM
qh100-gpu20:39630:39748 [1] NCCL INFO Channel 13/0 : 9[1] -> 11[3] via P2P/CUMEM
qh100-gpu19:49361:49474 [7] NCCL INFO Channel 13/0 : 7[7] -> 0[0] via P2P/CUMEM
qh100-gpu20:39629:39744 [0] NCCL INFO Channel 03/0 : 8[0] -> 15[7] via P2P/CUMEM
qh100-gpu20:39634:39746 [5] NCCL INFO Channel 06/0 : 13[5] -> 15[7] via P2P/CUMEM
qh100-gpu20:39630:39748 [1] NCCL INFO Channel 14/0 : 9[1] -> 11[3] via P2P/CUMEM
qh100-gpu19:49361:49474 [7] NCCL INFO Channel 14/0 : 7[7] -> 0[0] via P2P/CUMEM
qh100-gpu20:39634:39746 [5] NCCL INFO Channel 07/0 : 13[5] -> 15[7] via P2P/CUMEM
qh100-gpu20:39629:39744 [0] NCCL INFO Channel 05/0 : 8[0] -> 15[7] via P2P/CUMEM
qh100-gpu19:49361:49474 [7] NCCL INFO Channel 15/0 : 7[7] -> 0[0] via P2P/CUMEM
qh100-gpu20:39630:39748 [1] NCCL INFO Channel 15/0 : 9[1] -> 11[3] via P2P/CUMEM
qh100-gpu20:39634:39746 [5] NCCL INFO Channel 08/0 : 13[5] -> 15[7] via P2P/CUMEM
qh100-gpu20:39629:39744 [0] NCCL INFO Channel 06/0 : 8[0] -> 15[7] via P2P/CUMEM
qh100-gpu20:39632:39747 [3] NCCL INFO Channel 00/0 : 11[3] -> 9[1] via P2P/CUMEM
qh100-gpu20:39629:39744 [0] NCCL INFO Channel 07/0 : 8[0] -> 15[7] via P2P/CUMEM
qh100-gpu20:39634:39746 [5] NCCL INFO Channel 09/0 : 13[5] -> 15[7] via P2P/CUMEM
qh100-gpu20:39632:39747 [3] NCCL INFO Channel 01/0 : 11[3] -> 9[1] via P2P/CUMEM
qh100-gpu20:39634:39746 [5] NCCL INFO Channel 10/0 : 13[5] -> 15[7] via P2P/CUMEM
qh100-gpu20:39629:39744 [0] NCCL INFO Channel 09/0 : 8[0] -> 15[7] via P2P/CUMEM
qh100-gpu20:39632:39747 [3] NCCL INFO Channel 02/0 : 11[3] -> 9[1] via P2P/CUMEM
qh100-gpu20:39634:39746 [5] NCCL INFO Channel 11/0 : 13[5] -> 15[7] via P2P/CUMEM
qh100-gpu20:39629:39744 [0] NCCL INFO Channel 10/0 : 8[0] -> 15[7] via P2P/CUMEM
qh100-gpu20:39632:39747 [3] NCCL INFO Channel 03/0 : 11[3] -> 9[1] via P2P/CUMEM
qh100-gpu20:39634:39746 [5] NCCL INFO Channel 12/0 : 13[5] -> 15[7] via P2P/CUMEM
qh100-gpu20:39629:39744 [0] NCCL INFO Channel 11/0 : 8[0] -> 15[7] via P2P/CUMEM
qh100-gpu20:39632:39747 [3] NCCL INFO Channel 04/0 : 11[3] -> 9[1] via P2P/CUMEM
qh100-gpu20:39634:39746 [5] NCCL INFO Channel 13/0 : 13[5] -> 15[7] via P2P/CUMEM
qh100-gpu20:39629:39744 [0] NCCL INFO Channel 13/0 : 8[0] -> 15[7] via P2P/CUMEM
qh100-gpu20:39632:39747 [3] NCCL INFO Channel 05/0 : 11[3] -> 9[1] via P2P/CUMEM
qh100-gpu20:39634:39746 [5] NCCL INFO Channel 14/0 : 13[5] -> 15[7] via P2P/CUMEM
qh100-gpu20:39629:39744 [0] NCCL INFO Channel 14/0 : 8[0] -> 15[7] via P2P/CUMEM
qh100-gpu20:39632:39747 [3] NCCL INFO Channel 06/0 : 11[3] -> 9[1] via P2P/CUMEM
qh100-gpu19:49354:49468 [0] NCCL INFO Channel 00/0 : 8[0] -> 0[0] [receive] via NET/IBext/0/GDRDMA
qh100-gpu19:49354:49468 [0] NCCL INFO Channel 04/0 : 8[0] -> 0[0] [receive] via NET/IBext/0/GDRDMA
qh100-gpu19:49354:49468 [0] NCCL INFO Channel 08/0 : 8[0] -> 0[0] [receive] via NET/IBext/0/GDRDMA
qh100-gpu20:39634:39746 [5] NCCL INFO Channel 15/0 : 13[5] -> 15[7] via P2P/CUMEM
qh100-gpu19:49354:49468 [0] NCCL INFO Channel 12/0 : 8[0] -> 0[0] [receive] via NET/IBext/0/GDRDMA
qh100-gpu20:39629:39744 [0] NCCL INFO Channel 15/0 : 8[0] -> 15[7] via P2P/CUMEM
qh100-gpu19:49354:49468 [0] NCCL INFO Channel 00/0 : 0[0] -> 8[0] [send] via NET/IBext/0/GDRDMA
qh100-gpu19:49354:49468 [0] NCCL INFO Channel 04/0 : 0[0] -> 8[0] [send] via NET/IBext/0/GDRDMA
qh100-gpu20:39632:39747 [3] NCCL INFO Channel 07/0 : 11[3] -> 9[1] via P2P/CUMEM
qh100-gpu19:49354:49468 [0] NCCL INFO Channel 08/0 : 0[0] -> 8[0] [send] via NET/IBext/0/GDRDMA
qh100-gpu19:49354:49468 [0] NCCL INFO Channel 12/0 : 0[0] -> 8[0] [send] via NET/IBext/0/GDRDMA
qh100-gpu20:39632:39747 [3] NCCL INFO Channel 08/0 : 11[3] -> 9[1] via P2P/CUMEM
qh100-gpu20:39632:39747 [3] NCCL INFO Channel 09/0 : 11[3] -> 9[1] via P2P/CUMEM
qh100-gpu20:39632:39747 [3] NCCL INFO Channel 10/0 : 11[3] -> 9[1] via P2P/CUMEM
qh100-gpu20:39632:39747 [3] NCCL INFO Channel 11/0 : 11[3] -> 9[1] via P2P/CUMEM
qh100-gpu20:39632:39747 [3] NCCL INFO Channel 12/0 : 11[3] -> 9[1] via P2P/CUMEM
qh100-gpu20:39632:39747 [3] NCCL INFO Channel 13/0 : 11[3] -> 9[1] via P2P/CUMEM
qh100-gpu20:39631:39742 [2] NCCL INFO Channel 01/0 : 2[2] -> 10[2] [receive] via NET/IBext/1/GDRDMA
qh100-gpu20:39635:39741 [6] NCCL INFO Channel 03/0 : 6[6] -> 14[6] [receive] via NET/IBext/3/GDRDMA
qh100-gpu20:39633:39743 [4] NCCL INFO Channel 02/0 : 4[4] -> 12[4] [receive] via NET/IBext/2/GDRDMA
qh100-gpu20:39635:39741 [6] NCCL INFO Channel 07/0 : 6[6] -> 14[6] [receive] via NET/IBext/3/GDRDMA
qh100-gpu20:39631:39742 [2] NCCL INFO Channel 05/0 : 2[2] -> 10[2] [receive] via NET/IBext/1/GDRDMA
qh100-gpu20:39631:39742 [2] NCCL INFO Channel 09/0 : 2[2] -> 10[2] [receive] via NET/IBext/1/GDRDMA
qh100-gpu20:39633:39743 [4] NCCL INFO Channel 06/0 : 4[4] -> 12[4] [receive] via NET/IBext/2/GDRDMA
qh100-gpu20:39633:39743 [4] NCCL INFO Channel 10/0 : 4[4] -> 12[4] [receive] via NET/IBext/2/GDRDMA
qh100-gpu20:39635:39741 [6] NCCL INFO Channel 11/0 : 6[6] -> 14[6] [receive] via NET/IBext/3/GDRDMA
qh100-gpu20:39635:39741 [6] NCCL INFO Channel 15/0 : 6[6] -> 14[6] [receive] via NET/IBext/3/GDRDMA
qh100-gpu20:39631:39742 [2] NCCL INFO Channel 13/0 : 2[2] -> 10[2] [receive] via NET/IBext/1/GDRDMA
qh100-gpu20:39633:39743 [4] NCCL INFO Channel 14/0 : 4[4] -> 12[4] [receive] via NET/IBext/2/GDRDMA
qh100-gpu20:39635:39741 [6] NCCL INFO Channel 03/0 : 14[6] -> 6[6] [send] via NET/IBext/3/GDRDMA
qh100-gpu20:39632:39747 [3] NCCL INFO Channel 14/0 : 11[3] -> 9[1] via P2P/CUMEM
qh100-gpu20:39631:39742 [2] NCCL INFO Channel 01/0 : 10[2] -> 2[2] [send] via NET/IBext/1/GDRDMA
qh100-gpu20:39633:39743 [4] NCCL INFO Channel 02/0 : 12[4] -> 4[4] [send] via NET/IBext/2/GDRDMA
qh100-gpu20:39635:39741 [6] NCCL INFO Channel 07/0 : 14[6] -> 6[6] [send] via NET/IBext/3/GDRDMA
qh100-gpu20:39631:39742 [2] NCCL INFO Channel 05/0 : 10[2] -> 2[2] [send] via NET/IBext/1/GDRDMA
qh100-gpu20:39633:39743 [4] NCCL INFO Channel 06/0 : 12[4] -> 4[4] [send] via NET/IBext/2/GDRDMA
qh100-gpu20:39631:39742 [2] NCCL INFO Channel 09/0 : 10[2] -> 2[2] [send] via NET/IBext/1/GDRDMA
qh100-gpu20:39633:39743 [4] NCCL INFO Channel 10/0 : 12[4] -> 4[4] [send] via NET/IBext/2/GDRDMA
qh100-gpu20:39631:39742 [2] NCCL INFO Channel 13/0 : 10[2] -> 2[2] [send] via NET/IBext/1/GDRDMA
qh100-gpu20:39633:39743 [4] NCCL INFO Channel 14/0 : 12[4] -> 4[4] [send] via NET/IBext/2/GDRDMA
qh100-gpu20:39635:39741 [6] NCCL INFO Channel 11/0 : 14[6] -> 6[6] [send] via NET/IBext/3/GDRDMA
qh100-gpu20:39635:39741 [6] NCCL INFO Channel 15/0 : 14[6] -> 6[6] [send] via NET/IBext/3/GDRDMA
qh100-gpu19:49356:49470 [2] NCCL INFO Channel 00/0 : 2[2] -> 0[0] via P2P/CUMEM
qh100-gpu20:39632:39747 [3] NCCL INFO Channel 15/0 : 11[3] -> 9[1] via P2P/CUMEM
qh100-gpu19:49360:49467 [6] NCCL INFO Channel 00/0 : 6[6] -> 4[4] via P2P/CUMEM
qh100-gpu20:39631:39742 [2] NCCL INFO Channel 00/0 : 10[2] -> 8[0] via P2P/CUMEM
qh100-gpu20:39635:39741 [6] NCCL INFO Channel 00/0 : 14[6] -> 12[4] via P2P/CUMEM
qh100-gpu19:49360:49467 [6] NCCL INFO Channel 01/0 : 6[6] -> 4[4] via P2P/CUMEM
qh100-gpu19:49356:49470 [2] NCCL INFO Channel 02/0 : 2[2] -> 0[0] via P2P/CUMEM
qh100-gpu19:49360:49467 [6] NCCL INFO Channel 02/0 : 6[6] -> 4[4] via P2P/CUMEM
qh100-gpu19:49356:49470 [2] NCCL INFO Channel 03/0 : 2[2] -> 0[0] via P2P/CUMEM
qh100-gpu19:49360:49467 [6] NCCL INFO Channel 04/0 : 6[6] -> 4[4] via P2P/CUMEM
qh100-gpu20:39631:39742 [2] NCCL INFO Channel 02/0 : 10[2] -> 8[0] via P2P/CUMEM
qh100-gpu20:39635:39741 [6] NCCL INFO Channel 01/0 : 14[6] -> 12[4] via P2P/CUMEM
qh100-gpu19:49356:49470 [2] NCCL INFO Channel 04/0 : 2[2] -> 0[0] via P2P/CUMEM
qh100-gpu19:49360:49467 [6] NCCL INFO Channel 05/0 : 6[6] -> 4[4] via P2P/CUMEM
qh100-gpu19:49356:49470 [2] NCCL INFO Channel 06/0 : 2[2] -> 0[0] via P2P/CUMEM
qh100-gpu19:49360:49467 [6] NCCL INFO Channel 06/0 : 6[6] -> 4[4] via P2P/CUMEM
qh100-gpu19:49356:49470 [2] NCCL INFO Channel 07/0 : 2[2] -> 0[0] via P2P/CUMEM
qh100-gpu20:39631:39742 [2] NCCL INFO Channel 03/0 : 10[2] -> 8[0] via P2P/CUMEM
qh100-gpu19:49360:49467 [6] NCCL INFO Channel 08/0 : 6[6] -> 4[4] via P2P/CUMEM
qh100-gpu20:39635:39741 [6] NCCL INFO Channel 02/0 : 14[6] -> 12[4] via P2P/CUMEM
qh100-gpu19:49356:49470 [2] NCCL INFO Channel 08/0 : 2[2] -> 0[0] via P2P/CUMEM
qh100-gpu19:49360:49467 [6] NCCL INFO Channel 09/0 : 6[6] -> 4[4] via P2P/CUMEM
qh100-gpu19:49356:49470 [2] NCCL INFO Channel 10/0 : 2[2] -> 0[0] via P2P/CUMEM
qh100-gpu19:49360:49467 [6] NCCL INFO Channel 10/0 : 6[6] -> 4[4] via P2P/CUMEM
qh100-gpu20:39631:39742 [2] NCCL INFO Channel 04/0 : 10[2] -> 8[0] via P2P/CUMEM
qh100-gpu19:49356:49470 [2] NCCL INFO Channel 11/0 : 2[2] -> 0[0] via P2P/CUMEM
qh100-gpu19:49360:49467 [6] NCCL INFO Channel 12/0 : 6[6] -> 4[4] via P2P/CUMEM
qh100-gpu19:49356:49470 [2] NCCL INFO Channel 12/0 : 2[2] -> 0[0] via P2P/CUMEM
qh100-gpu19:49360:49467 [6] NCCL INFO Channel 13/0 : 6[6] -> 4[4] via P2P/CUMEM
qh100-gpu19:49356:49470 [2] NCCL INFO Channel 14/0 : 2[2] -> 0[0] via P2P/CUMEM
qh100-gpu19:49360:49467 [6] NCCL INFO Channel 14/0 : 6[6] -> 4[4] via P2P/CUMEM
qh100-gpu20:39631:39742 [2] NCCL INFO Channel 06/0 : 10[2] -> 8[0] via P2P/CUMEM
qh100-gpu19:49356:49470 [2] NCCL INFO Channel 15/0 : 2[2] -> 0[0] via P2P/CUMEM
qh100-gpu19:49358:49469 [4] NCCL INFO Channel 00/0 : 4[4] -> 3[3] via P2P/CUMEM
qh100-gpu19:49360:49467 [6] NCCL INFO Channel 00/0 : 6[6] -> 5[5] via P2P/CUMEM
qh100-gpu20:39631:39742 [2] NCCL INFO Channel 07/0 : 10[2] -> 8[0] via P2P/CUMEM
qh100-gpu19:49358:49469 [4] NCCL INFO Channel 01/0 : 4[4] -> 3[3] via P2P/CUMEM
qh100-gpu19:49360:49467 [6] NCCL INFO Channel 01/0 : 6[6] -> 5[5] via P2P/CUMEM
qh100-gpu19:49358:49469 [4] NCCL INFO Channel 03/0 : 4[4] -> 3[3] via P2P/CUMEM
qh100-gpu19:49360:49467 [6] NCCL INFO Channel 02/0 : 6[6] -> 5[5] via P2P/CUMEM
qh100-gpu19:49358:49469 [4] NCCL INFO Channel 04/0 : 4[4] -> 3[3] via P2P/CUMEM
qh100-gpu19:49360:49467 [6] NCCL INFO Channel 03/0 : 6[6] -> 5[5] via P2P/CUMEM
qh100-gpu20:39636:39745 [7] NCCL INFO Channel 01/0 : 15[7] -> 8[0] via P2P/CUMEM
qh100-gpu19:49358:49469 [4] NCCL INFO Channel 05/0 : 4[4] -> 3[3] via P2P/CUMEM
qh100-gpu20:39631:39742 [2] NCCL INFO Channel 08/0 : 10[2] -> 8[0] via P2P/CUMEM
qh100-gpu19:49360:49467 [6] NCCL INFO Channel 04/0 : 6[6] -> 5[5] via P2P/CUMEM
qh100-gpu19:49358:49469 [4] NCCL INFO Channel 07/0 : 4[4] -> 3[3] via P2P/CUMEM
qh100-gpu19:49360:49467 [6] NCCL INFO Channel 05/0 : 6[6] -> 5[5] via P2P/CUMEM
qh100-gpu19:49358:49469 [4] NCCL INFO Channel 08/0 : 4[4] -> 3[3] via P2P/CUMEM
qh100-gpu19:49360:49467 [6] NCCL INFO Channel 06/0 : 6[6] -> 5[5] via P2P/CUMEM
qh100-gpu19:49358:49469 [4] NCCL INFO Channel 09/0 : 4[4] -> 3[3] via P2P/CUMEM
qh100-gpu20:39635:39741 [6] NCCL INFO Channel 04/0 : 14[6] -> 12[4] via P2P/CUMEM
qh100-gpu19:49360:49467 [6] NCCL INFO Channel 07/0 : 6[6] -> 5[5] via P2P/CUMEM
qh100-gpu20:39636:39745 [7] NCCL INFO Channel 02/0 : 15[7] -> 8[0] via P2P/CUMEM
qh100-gpu19:49358:49469 [4] NCCL INFO Channel 11/0 : 4[4] -> 3[3] via P2P/CUMEM
qh100-gpu19:49360:49467 [6] NCCL INFO Channel 08/0 : 6[6] -> 5[5] via P2P/CUMEM
qh100-gpu19:49358:49469 [4] NCCL INFO Channel 12/0 : 4[4] -> 3[3] via P2P/CUMEM
qh100-gpu19:49360:49467 [6] NCCL INFO Channel 09/0 : 6[6] -> 5[5] via P2P/CUMEM
qh100-gpu19:49358:49469 [4] NCCL INFO Channel 13/0 : 4[4] -> 3[3] via P2P/CUMEM
qh100-gpu19:49360:49467 [6] NCCL INFO Channel 10/0 : 6[6] -> 5[5] via P2P/CUMEM
qh100-gpu20:39636:39745 [7] NCCL INFO Channel 03/0 : 15[7] -> 8[0] via P2P/CUMEM
qh100-gpu20:39635:39741 [6] NCCL INFO Channel 05/0 : 14[6] -> 12[4] via P2P/CUMEM
qh100-gpu19:49358:49469 [4] NCCL INFO Channel 15/0 : 4[4] -> 3[3] via P2P/CUMEM
qh100-gpu19:49360:49467 [6] NCCL INFO Channel 11/0 : 6[6] -> 5[5] via P2P/CUMEM
qh100-gpu19:49360:49467 [6] NCCL INFO Channel 12/0 : 6[6] -> 5[5] via P2P/CUMEM
qh100-gpu20:39636:39745 [7] NCCL INFO Channel 05/0 : 15[7] -> 8[0] via P2P/CUMEM
qh100-gpu19:49360:49467 [6] NCCL INFO Channel 13/0 : 6[6] -> 5[5] via P2P/CUMEM
qh100-gpu20:39631:39742 [2] NCCL INFO Channel 10/0 : 10[2] -> 8[0] via P2P/CUMEM
qh100-gpu20:39635:39741 [6] NCCL INFO Channel 06/0 : 14[6] -> 12[4] via P2P/CUMEM
qh100-gpu19:49360:49467 [6] NCCL INFO Channel 14/0 : 6[6] -> 5[5] via P2P/CUMEM
qh100-gpu20:39636:39745 [7] NCCL INFO Channel 06/0 : 15[7] -> 8[0] via P2P/CUMEM
qh100-gpu20:39635:39741 [6] NCCL INFO Channel 08/0 : 14[6] -> 12[4] via P2P/CUMEM
qh100-gpu20:39636:39745 [7] NCCL INFO Channel 07/0 : 15[7] -> 8[0] via P2P/CUMEM
qh100-gpu20:39631:39742 [2] NCCL INFO Channel 11/0 : 10[2] -> 8[0] via P2P/CUMEM
qh100-gpu20:39635:39741 [6] NCCL INFO Channel 09/0 : 14[6] -> 12[4] via P2P/CUMEM
qh100-gpu20:39631:39742 [2] NCCL INFO Channel 12/0 : 10[2] -> 8[0] via P2P/CUMEM
qh100-gpu20:39636:39745 [7] NCCL INFO Channel 09/0 : 15[7] -> 8[0] via P2P/CUMEM
qh100-gpu20:39635:39741 [6] NCCL INFO Channel 10/0 : 14[6] -> 12[4] via P2P/CUMEM
qh100-gpu19:49360:49467 [6] NCCL INFO Channel 15/0 : 6[6] -> 5[5] via P2P/CUMEM
qh100-gpu20:39631:39742 [2] NCCL INFO Channel 14/0 : 10[2] -> 8[0] via P2P/CUMEM
qh100-gpu20:39636:39745 [7] NCCL INFO Channel 10/0 : 15[7] -> 8[0] via P2P/CUMEM
qh100-gpu20:39635:39741 [6] NCCL INFO Channel 12/0 : 14[6] -> 12[4] via P2P/CUMEM
qh100-gpu20:39631:39742 [2] NCCL INFO Channel 15/0 : 10[2] -> 8[0] via P2P/CUMEM
qh100-gpu20:39635:39741 [6] NCCL INFO Channel 13/0 : 14[6] -> 12[4] via P2P/CUMEM
qh100-gpu20:39635:39741 [6] NCCL INFO Channel 14/0 : 14[6] -> 12[4] via P2P/CUMEM
qh100-gpu20:39635:39741 [6] NCCL INFO Channel 00/0 : 14[6] -> 13[5] via P2P/CUMEM
qh100-gpu20:39633:39743 [4] NCCL INFO Channel 00/0 : 12[4] -> 11[3] via P2P/CUMEM
qh100-gpu20:39635:39741 [6] NCCL INFO Channel 01/0 : 14[6] -> 13[5] via P2P/CUMEM
qh100-gpu20:39633:39743 [4] NCCL INFO Channel 01/0 : 12[4] -> 11[3] via P2P/CUMEM
qh100-gpu20:39635:39741 [6] NCCL INFO Channel 02/0 : 14[6] -> 13[5] via P2P/CUMEM
qh100-gpu20:39629:39744 [0] NCCL INFO Channel 00/0 : 0[0] -> 8[0] [receive] via NET/IBext/0/GDRDMA
qh100-gpu20:39633:39743 [4] NCCL INFO Channel 03/0 : 12[4] -> 11[3] via P2P/CUMEM
qh100-gpu20:39629:39744 [0] NCCL INFO Channel 04/0 : 0[0] -> 8[0] [receive] via NET/IBext/0/GDRDMA
qh100-gpu20:39629:39744 [0] NCCL INFO Channel 08/0 : 0[0] -> 8[0] [receive] via NET/IBext/0/GDRDMA
qh100-gpu20:39629:39744 [0] NCCL INFO Channel 12/0 : 0[0] -> 8[0] [receive] via NET/IBext/0/GDRDMA
qh100-gpu20:39635:39741 [6] NCCL INFO Channel 03/0 : 14[6] -> 13[5] via P2P/CUMEM
qh100-gpu20:39629:39744 [0] NCCL INFO Channel 00/0 : 8[0] -> 0[0] [send] via NET/IBext/0/GDRDMA
qh100-gpu20:39633:39743 [4] NCCL INFO Channel 04/0 : 12[4] -> 11[3] via P2P/CUMEM
qh100-gpu20:39629:39744 [0] NCCL INFO Channel 04/0 : 8[0] -> 0[0] [send] via NET/IBext/0/GDRDMA
qh100-gpu20:39629:39744 [0] NCCL INFO Channel 08/0 : 8[0] -> 0[0] [send] via NET/IBext/0/GDRDMA
qh100-gpu20:39629:39744 [0] NCCL INFO Channel 12/0 : 8[0] -> 0[0] [send] via NET/IBext/0/GDRDMA
qh100-gpu20:39635:39741 [6] NCCL INFO Channel 04/0 : 14[6] -> 13[5] via P2P/CUMEM
qh100-gpu20:39636:39745 [7] NCCL INFO Channel 11/0 : 15[7] -> 8[0] via P2P/CUMEM
qh100-gpu20:39633:39743 [4] NCCL INFO Channel 05/0 : 12[4] -> 11[3] via P2P/CUMEM
qh100-gpu20:39635:39741 [6] NCCL INFO Channel 05/0 : 14[6] -> 13[5] via P2P/CUMEM
qh100-gpu20:39633:39743 [4] NCCL INFO Channel 07/0 : 12[4] -> 11[3] via P2P/CUMEM
qh100-gpu20:39636:39745 [7] NCCL INFO Channel 13/0 : 15[7] -> 8[0] via P2P/CUMEM
qh100-gpu20:39635:39741 [6] NCCL INFO Channel 06/0 : 14[6] -> 13[5] via P2P/CUMEM
qh100-gpu20:39633:39743 [4] NCCL INFO Channel 08/0 : 12[4] -> 11[3] via P2P/CUMEM
qh100-gpu20:39636:39745 [7] NCCL INFO Channel 14/0 : 15[7] -> 8[0] via P2P/CUMEM
qh100-gpu20:39635:39741 [6] NCCL INFO Channel 07/0 : 14[6] -> 13[5] via P2P/CUMEM
qh100-gpu20:39633:39743 [4] NCCL INFO Channel 09/0 : 12[4] -> 11[3] via P2P/CUMEM
qh100-gpu20:39636:39745 [7] NCCL INFO Channel 15/0 : 15[7] -> 8[0] via P2P/CUMEM
qh100-gpu20:39635:39741 [6] NCCL INFO Channel 08/0 : 14[6] -> 13[5] via P2P/CUMEM
qh100-gpu20:39633:39743 [4] NCCL INFO Channel 11/0 : 12[4] -> 11[3] via P2P/CUMEM
qh100-gpu20:39635:39741 [6] NCCL INFO Channel 09/0 : 14[6] -> 13[5] via P2P/CUMEM
qh100-gpu20:39633:39743 [4] NCCL INFO Channel 12/0 : 12[4] -> 11[3] via P2P/CUMEM
qh100-gpu20:39635:39741 [6] NCCL INFO Channel 10/0 : 14[6] -> 13[5] via P2P/CUMEM
qh100-gpu20:39633:39743 [4] NCCL INFO Channel 13/0 : 12[4] -> 11[3] via P2P/CUMEM
qh100-gpu20:39635:39741 [6] NCCL INFO Channel 11/0 : 14[6] -> 13[5] via P2P/CUMEM
qh100-gpu20:39633:39743 [4] NCCL INFO Channel 15/0 : 12[4] -> 11[3] via P2P/CUMEM
qh100-gpu20:39635:39741 [6] NCCL INFO Channel 12/0 : 14[6] -> 13[5] via P2P/CUMEM
qh100-gpu20:39635:39741 [6] NCCL INFO Channel 13/0 : 14[6] -> 13[5] via P2P/CUMEM
qh100-gpu20:39635:39741 [6] NCCL INFO Channel 14/0 : 14[6] -> 13[5] via P2P/CUMEM
qh100-gpu20:39635:39741 [6] NCCL INFO Channel 15/0 : 14[6] -> 13[5] via P2P/CUMEM
qh100-gpu20:39636:39745 [7] NCCL INFO Channel 00/0 : 15[7] -> 13[5] via P2P/CUMEM
qh100-gpu20:39636:39745 [7] NCCL INFO Channel 01/0 : 15[7] -> 13[5] via P2P/CUMEM
qh100-gpu20:39636:39745 [7] NCCL INFO Channel 02/0 : 15[7] -> 13[5] via P2P/CUMEM
qh100-gpu20:39636:39745 [7] NCCL INFO Channel 03/0 : 15[7] -> 13[5] via P2P/CUMEM
qh100-gpu20:39636:39745 [7] NCCL INFO Channel 04/0 : 15[7] -> 13[5] via P2P/CUMEM
qh100-gpu20:39636:39745 [7] NCCL INFO Channel 05/0 : 15[7] -> 13[5] via P2P/CUMEM
qh100-gpu20:39636:39745 [7] NCCL INFO Channel 06/0 : 15[7] -> 13[5] via P2P/CUMEM
qh100-gpu19:49361:49474 [7] NCCL INFO Channel 00/0 : 7[7] -> 5[5] via P2P/CUMEM
qh100-gpu19:49361:49474 [7] NCCL INFO Channel 01/0 : 7[7] -> 5[5] via P2P/CUMEM
qh100-gpu20:39636:39745 [7] NCCL INFO Channel 07/0 : 15[7] -> 13[5] via P2P/CUMEM
qh100-gpu20:39636:39745 [7] NCCL INFO Channel 08/0 : 15[7] -> 13[5] via P2P/CUMEM
qh100-gpu20:39636:39745 [7] NCCL INFO Channel 09/0 : 15[7] -> 13[5] via P2P/CUMEM
qh100-gpu20:39636:39745 [7] NCCL INFO Channel 10/0 : 15[7] -> 13[5] via P2P/CUMEM
qh100-gpu19:49361:49474 [7] NCCL INFO Channel 02/0 : 7[7] -> 5[5] via P2P/CUMEM
qh100-gpu19:49361:49474 [7] NCCL INFO Channel 03/0 : 7[7] -> 5[5] via P2P/CUMEM
qh100-gpu19:49361:49474 [7] NCCL INFO Channel 04/0 : 7[7] -> 5[5] via P2P/CUMEM
qh100-gpu19:49361:49474 [7] NCCL INFO Channel 05/0 : 7[7] -> 5[5] via P2P/CUMEM
qh100-gpu20:39636:39745 [7] NCCL INFO Channel 11/0 : 15[7] -> 13[5] via P2P/CUMEM
qh100-gpu19:49361:49474 [7] NCCL INFO Channel 06/0 : 7[7] -> 5[5] via P2P/CUMEM
qh100-gpu20:39636:39745 [7] NCCL INFO Channel 12/0 : 15[7] -> 13[5] via P2P/CUMEM
qh100-gpu19:49361:49474 [7] NCCL INFO Channel 07/0 : 7[7] -> 5[5] via P2P/CUMEM
qh100-gpu19:49361:49474 [7] NCCL INFO Channel 08/0 : 7[7] -> 5[5] via P2P/CUMEM
qh100-gpu19:49361:49474 [7] NCCL INFO Channel 09/0 : 7[7] -> 5[5] via P2P/CUMEM
qh100-gpu20:39636:39745 [7] NCCL INFO Channel 13/0 : 15[7] -> 13[5] via P2P/CUMEM
qh100-gpu19:49361:49474 [7] NCCL INFO Channel 10/0 : 7[7] -> 5[5] via P2P/CUMEM
qh100-gpu20:39631:39742 [2] NCCL INFO Channel 00/0 : 10[2] -> 9[1] via P2P/CUMEM
qh100-gpu20:39636:39745 [7] NCCL INFO Channel 14/0 : 15[7] -> 13[5] via P2P/CUMEM
qh100-gpu19:49361:49474 [7] NCCL INFO Channel 11/0 : 7[7] -> 5[5] via P2P/CUMEM
qh100-gpu20:39631:39742 [2] NCCL INFO Channel 01/0 : 10[2] -> 9[1] via P2P/CUMEM
qh100-gpu20:39636:39745 [7] NCCL INFO Channel 15/0 : 15[7] -> 13[5] via P2P/CUMEM
qh100-gpu19:49356:49470 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/CUMEM
qh100-gpu19:49361:49474 [7] NCCL INFO Channel 12/0 : 7[7] -> 5[5] via P2P/CUMEM
qh100-gpu19:49361:49474 [7] NCCL INFO Channel 13/0 : 7[7] -> 5[5] via P2P/CUMEM
qh100-gpu20:39631:39742 [2] NCCL INFO Channel 02/0 : 10[2] -> 9[1] via P2P/CUMEM
qh100-gpu19:49356:49470 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/CUMEM
qh100-gpu19:49361:49474 [7] NCCL INFO Channel 14/0 : 7[7] -> 5[5] via P2P/CUMEM
qh100-gpu20:39631:39742 [2] NCCL INFO Channel 03/0 : 10[2] -> 9[1] via P2P/CUMEM
qh100-gpu19:49356:49470 [2] NCCL INFO Channel 02/0 : 2[2] -> 1[1] via P2P/CUMEM
qh100-gpu19:49361:49474 [7] NCCL INFO Channel 15/0 : 7[7] -> 5[5] via P2P/CUMEM
qh100-gpu19:49356:49470 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/CUMEM
qh100-gpu19:49356:49470 [2] NCCL INFO Channel 04/0 : 2[2] -> 1[1] via P2P/CUMEM
qh100-gpu19:49356:49470 [2] NCCL INFO Channel 05/0 : 2[2] -> 1[1] via P2P/CUMEM
qh100-gpu19:49356:49470 [2] NCCL INFO Channel 06/0 : 2[2] -> 1[1] via P2P/CUMEM
qh100-gpu20:39631:39742 [2] NCCL INFO Channel 04/0 : 10[2] -> 9[1] via P2P/CUMEM
qh100-gpu19:49356:49470 [2] NCCL INFO Channel 07/0 : 2[2] -> 1[1] via P2P/CUMEM
qh100-gpu20:39631:39742 [2] NCCL INFO Channel 05/0 : 10[2] -> 9[1] via P2P/CUMEM
qh100-gpu19:49356:49470 [2] NCCL INFO Channel 08/0 : 2[2] -> 1[1] via P2P/CUMEM
qh100-gpu19:49356:49470 [2] NCCL INFO Channel 09/0 : 2[2] -> 1[1] via P2P/CUMEM
qh100-gpu20:39631:39742 [2] NCCL INFO Channel 06/0 : 10[2] -> 9[1] via P2P/CUMEM
qh100-gpu19:49356:49470 [2] NCCL INFO Channel 10/0 : 2[2] -> 1[1] via P2P/CUMEM
qh100-gpu20:39631:39742 [2] NCCL INFO Channel 07/0 : 10[2] -> 9[1] via P2P/CUMEM
qh100-gpu19:49356:49470 [2] NCCL INFO Channel 11/0 : 2[2] -> 1[1] via P2P/CUMEM
qh100-gpu20:39631:39742 [2] NCCL INFO Channel 08/0 : 10[2] -> 9[1] via P2P/CUMEM
qh100-gpu19:49356:49470 [2] NCCL INFO Channel 12/0 : 2[2] -> 1[1] via P2P/CUMEM
qh100-gpu19:49356:49470 [2] NCCL INFO Channel 13/0 : 2[2] -> 1[1] via P2P/CUMEM
qh100-gpu20:39631:39742 [2] NCCL INFO Channel 09/0 : 10[2] -> 9[1] via P2P/CUMEM
qh100-gpu19:49356:49470 [2] NCCL INFO Channel 14/0 : 2[2] -> 1[1] via P2P/CUMEM
qh100-gpu19:49356:49470 [2] NCCL INFO Channel 15/0 : 2[2] -> 1[1] via P2P/CUMEM
qh100-gpu20:39631:39742 [2] NCCL INFO Channel 10/0 : 10[2] -> 9[1] via P2P/CUMEM
qh100-gpu20:39631:39742 [2] NCCL INFO Channel 11/0 : 10[2] -> 9[1] via P2P/CUMEM
qh100-gpu20:39631:39742 [2] NCCL INFO Channel 12/0 : 10[2] -> 9[1] via P2P/CUMEM
qh100-gpu20:39631:39742 [2] NCCL INFO Channel 13/0 : 10[2] -> 9[1] via P2P/CUMEM
qh100-gpu20:39631:39742 [2] NCCL INFO Channel 14/0 : 10[2] -> 9[1] via P2P/CUMEM
qh100-gpu20:39631:39742 [2] NCCL INFO Channel 15/0 : 10[2] -> 9[1] via P2P/CUMEM
qh100-gpu19:49358:49455 [4] NCCL INFO NCCL_IB_QPS_PER_CONNECTION set by environment to 2.
qh100-gpu20:39633:39728 [4] NCCL INFO NCCL_IB_QPS_PER_CONNECTION set by environment to 2.
qh100-gpu20:39633:39728 [4] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3.
qh100-gpu19:49358:49455 [4] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3.
qh100-gpu19:49354:49453 [0] NCCL INFO NCCL_IB_QPS_PER_CONNECTION set by environment to 2.
qh100-gpu20:39629:39725 [0] NCCL INFO NCCL_IB_QPS_PER_CONNECTION set by environment to 2.
qh100-gpu19:49354:49453 [0] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3.
qh100-gpu20:39629:39725 [0] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3.
qh100-gpu20:39631:39726 [2] NCCL INFO NCCL_IB_QPS_PER_CONNECTION set by environment to 2.
qh100-gpu19:49356:49460 [2] NCCL INFO NCCL_IB_QPS_PER_CONNECTION set by environment to 2.
qh100-gpu20:39631:39726 [2] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3.
qh100-gpu19:49356:49460 [2] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3.
qh100-gpu19:49360:49451 [6] NCCL INFO NCCL_IB_QPS_PER_CONNECTION set by environment to 2.
qh100-gpu20:39635:39729 [6] NCCL INFO NCCL_IB_QPS_PER_CONNECTION set by environment to 2.
qh100-gpu19:49360:49451 [6] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3.
qh100-gpu20:39635:39729 [6] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3.
qh100-gpu19:49361:49474 [7] NCCL INFO Connected all trees
qh100-gpu19:49354:49468 [0] NCCL INFO Connected all trees
qh100-gpu19:49355:49472 [1] NCCL INFO Connected all trees
qh100-gpu19:49356:49470 [2] NCCL INFO Connected all trees
qh100-gpu20:39629:39744 [0] NCCL INFO Connected all trees
qh100-gpu19:49359:49471 [5] NCCL INFO Connected all trees
qh100-gpu19:49360:49467 [6] NCCL INFO Connected all trees
qh100-gpu19:49358:49469 [4] NCCL INFO Connected all trees
qh100-gpu19:49357:49473 [3] NCCL INFO Connected all trees
qh100-gpu20:39631:39742 [2] NCCL INFO Connected all trees
qh100-gpu20:39630:39748 [1] NCCL INFO Connected all trees
qh100-gpu20:39636:39745 [7] NCCL INFO Connected all trees
qh100-gpu20:39634:39746 [5] NCCL INFO Connected all trees
qh100-gpu20:39632:39747 [3] NCCL INFO Connected all trees
qh100-gpu20:39633:39743 [4] NCCL INFO Connected all trees
qh100-gpu20:39635:39741 [6] NCCL INFO Connected all trees
sjeaugey commented 2 months ago

4 x 49GB/s = 196 GB/s. That's your network bandwidth and what you should also see when setting NCCL_ALGO=RING. However, on 2 nodes, the Tree algorithm puts more traffic on NVLink and less on the network, allowing to reach a bandwidth that's a mix between NVLink and the network bandwidth, hence can be higher than the network bandwidth.

ProHuper commented 2 months ago

4 x 49GB/s = 196 GB/s. That's your network bandwidth and what you should also see when setting NCCL_ALGO=RING. However, on 2 nodes, the Tree algorithm puts more traffic on NVLink and less on the network, allowing to reach a bandwidth that's a mix between NVLink and the network bandwidth, hence can be higher than the network bandwidth.

Thanks for replying. As you mentioned, under the RING-ALGO, the busbw I measured is very close to the theoretical peak (196). However, under the TREE-ALG, the busbw I measured is 309, and I'm not quite sure if this is close to the theoretical bandwidth. Is there a way to determine the theoretical busbw for the tree algorithm with 2 nodes?

sjeaugey commented 2 months ago

I'm not quite sure if this is close to the theoretical bandwidth.

The rings are close to theoretical, so your network hardware is functioning perfectly. You can check the intra-node NVLink performance is at 370 to ensure NVLink is functioning properly. If both are good, then the Tree performance is the best is can be.

ProHuper commented 2 months ago

I'm not quite sure if this is close to the theoretical bandwidth.

The rings are close to theoretical, so your network hardware is functioning perfectly. You can check the intra-node NVLink performance is at 370 to ensure NVLink is functioning properly. If both are good, then the Tree performance is the best is can be.

If the communication in the Tree-ALGO overlaps well, the algbw show be close to the network bandwidth? when i use one nic or 2 nics,it is so :

#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s) 
 17179869184    4294967296     float     sum      -1   347989   49.37   92.57      0   348000   49.37   92.56      0
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s) 
 17179869184    4294967296     float     sum      -1   174239   98.60  184.87      0   174222   98.61  184.89      0

but when i use 4 nics,it it not:

#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s) 
 17179869184    4294967296     float     sum      -1   104090  165.05  309.47      0   103759  165.57  310.45      

Also, if I set NCCL_MIN_NCHANNELS=24 (default is 16 in 2 node2 Tree-ALGO),the algbw increases,but it still does not meet expectations.

#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s) 
 17179869184    4294967296     float     sum      -1    96251  178.49  334.67      0    96260  178.47  334.64      0
sjeaugey commented 1 month ago

The default number of channels is 16 because beyond that, even though performance would be a bit better, it would use too much GPU Compute resources, as well as too much memory for buffers. That would severely impact the training.

In other words, that's all good for benchmarks, but not a good compromise for real applications.

I can't confirm whether it's possible to get better performance with the Tree algorithm and 4 NICs. We rarely run in that configuration and the particular case of 2 nodes is not a case we spend a lot of time optimizing.

ProHuper commented 1 month ago

The default number of channels is 16 because beyond that, even though performance would be a bit better, it would use too much GPU Compute resources, as well as too much memory for buffers. That would severely impact the training.

In other words, that's all good for benchmarks, but not a good compromise for real applications.

I can't confirm whether it's possible to get better performance with the Tree algorithm and 4 NICs. We rarely run in that configuration and the particular case of 2 nodes is not a case we spend a lot of time optimizing.

Alright, thanks !

ProHuper commented 1 month ago

The default number of channels is 16 because beyond that, even though performance would be a bit better, it would use too much GPU Compute resources, as well as too much memory for buffers. That would severely impact the training.

In other words, that's all good for benchmarks, but not a good compromise for real applications.

I can't confirm whether it's possible to get better performance with the Tree algorithm and 4 NICs. We rarely run in that configuration and the particular case of 2 nodes is not a case we spend a lot of time optimizing.

Hello, about this issue, I've made some further test. It seems that the incorrect logical topology can cause NCCL to select the wrong nic in certain scenarios. The physical topology of the nodes is shown in the diagram below, each node has 8GPUs and 4 nics, GPU0/GPU1/NIC0 is under the same PCIe switch: image

So if I specify GPU1 on both nodes for allreduce, it should select NIC0 because it's the closest one. But log shows that NIC1 was selected actually:

qh100-gpu20:67697:67704 [0] NCCL INFO P2P Chunksize set to 131072
qh100-gpu19:87121:87129 [0] NCCL INFO Channel 00/0 : 1[1] -> 0[1] [receive] via NET/IB/1
qh100-gpu19:87121:87129 [0] NCCL INFO Channel 01/0 : 1[1] -> 0[1] [receive] via NET/IB/1
qh100-gpu19:87121:87129 [0] NCCL INFO Channel 02/0 : 1[1] -> 0[1] [receive] via NET/IB/1
qh100-gpu19:87121:87129 [0] NCCL INFO Channel 03/0 : 1[1] -> 0[1] [receive] via NET/IB/1
qh100-gpu19:87121:87129 [0] NCCL INFO Channel 00/0 : 0[1] -> 1[1] [send] via NET/IB/1
qh100-gpu19:87121:87129 [0] NCCL INFO Channel 01/0 : 0[1] -> 1[1] [send] via NET/IB/1
qh100-gpu19:87121:87129 [0] NCCL INFO Channel 02/0 : 0[1] -> 1[1] [send] via NET/IB/1
qh100-gpu19:87121:87129 [0] NCCL INFO Channel 03/0 : 0[1] -> 1[1] [send] via NET/IB/1
qh100-gpu20:67697:67704 [0] NCCL INFO Channel 00/0 : 0[1] -> 1[1] [receive] via NET/IB/1
qh100-gpu20:67697:67704 [0] NCCL INFO Channel 01/0 : 0[1] -> 1[1] [receive] via NET/IB/1
qh100-gpu20:67697:67704 [0] NCCL INFO Channel 02/0 : 0[1] -> 1[1] [receive] via NET/IB/1
qh100-gpu20:67697:67704 [0] NCCL INFO Channel 03/0 : 0[1] -> 1[1] [receive] via NET/IB/1
qh100-gpu20:67697:67704 [0] NCCL INFO Channel 00/0 : 1[1] -> 0[1] [send] via NET/IB/1
qh100-gpu20:67697:67704 [0] NCCL INFO Channel 01/0 : 1[1] -> 0[1] [send] via NET/IB/1
qh100-gpu20:67697:67704 [0] NCCL INFO Channel 02/0 : 1[1] -> 0[1] [send] via NET/IB/1
qh100-gpu20:67697:67704 [0] NCCL INFO Channel 03/0 : 1[1] -> 0[1] [send] via NET/IB/1
qh100-gpu19:87121:87137 [0] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3.
qh100-gpu20:67697:67712 [0] NCCL INFO NCCL_IB_GID_INDEX set by environment to 3.
qh100-gpu19:87121:87129 [0] NCCL INFO Connected all rings
qh100-gpu19:87121:87129 [0] NCCL INFO Connected all trees
qh100-gpu19:87121:87129 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512

And here's the result of lspci -tv, in which the distance from GPU0 to both NIC0 and NIC1 is the same: image

It seems like NCCL is building topo using this logical topology, which mismatches with the actual physical topology.