NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
775 stars 226 forks source link

Nccl test seems run seperately on multi nodes #213

Closed jianh619 closed 2 months ago

jianh619 commented 2 months ago

I'm running nccl test on two H800 nodes , but seems they run seperately on each node

Test runs with container , and I also follow the guide to compile with MPI setting .

$ make MPI=1 MPI_HOME=/path/to/mpi CUDA_HOME=/path/to/cuda NCCL_HOME=/path/to/nccl

Here's the command

mpirun --allow-run-as-root -np 2 -H 10.0.0.15,10.0.0.16 -mca plm_rsh_args "-p 9001" -x NCCL_DEBUG=TRACE -x NCCL_DEBUG_FILE=debug.log -x NCCL_TOPO_DUMP_FILE=topo-file-1 -x NCCL_IB_HCA=mlx5_0,mlx5_5,mlx5_6,mlx5_7,mlx5_8,mlx5_10,mlx5_11,mlx5_12 -x NCCL_SOCKET_IFNAME=bond0 --bind-to numa /opt/nccl-tests/build/all_reduce_perf -b 1G -e 8G -f 2 -g 8

the output as below :

Warning: Permanently added '[10.0.0.15]:9001' (ED25519) to the list of known hosts.
# nThread 1 nGpus 8 minBytes 1073741824 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# nThread 1 nGpus 8 minBytes 1073741824 maxBytes 8589934592 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid    798 on       hw14 device  0 [0x0f] NVIDIA H800
#  Rank  1 Group  0 Pid    798 on       hw14 device  1 [0x34] NVIDIA H800
#  Rank  2 Group  0 Pid    798 on       hw14 device  2 [0x48] NVIDIA H800
#  Rank  3 Group  0 Pid    798 on       hw14 device  3 [0x5a] NVIDIA H800
#  Rank  4 Group  0 Pid    798 on       hw14 device  4 [0x87] NVIDIA H800
#  Rank  5 Group  0 Pid    798 on       hw14 device  5 [0xae] NVIDIA H800
#  Rank  6 Group  0 Pid    798 on       hw14 device  6 [0xc2] NVIDIA H800
#  Rank  7 Group  0 Pid    798 on       hw14 device  7 [0xd7] NVIDIA H800
#  Rank  0 Group  0 Pid    720 on       hw13 device  0 [0x0f] NVIDIA H800
#  Rank  1 Group  0 Pid    720 on       hw13 device  1 [0x34] NVIDIA H800
#  Rank  2 Group  0 Pid    720 on       hw13 device  2 [0x48] NVIDIA H800
#  Rank  3 Group  0 Pid    720 on       hw13 device  3 [0x5a] NVIDIA H800
#  Rank  4 Group  0 Pid    720 on       hw13 device  4 [0x87] NVIDIA H800
#  Rank  5 Group  0 Pid    720 on       hw13 device  5 [0xae] NVIDIA H800
#  Rank  6 Group  0 Pid    720 on       hw13 device  6 [0xc2] NVIDIA H800
#  Rank  7 Group  0 Pid    720 on       hw13 device  7 [0xd7] NVIDIA H800
NCCL version 2.20.5+cuda12.4
NCCL version 2.20.5+cuda12.4
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
  1073741824     268435456     float     sum      -1   8836.3  121.51  212.65      0   8837.4  121.50  212.62      0
  1073741824     268435456     float     sum      -1   8827.6  121.63  212.86      0   8827.6  121.63  212.86      0
  2147483648     536870912     float     sum      -1    17549  122.37  214.15      0    17541  122.43  214.25      0
  2147483648     536870912     float     sum      -1    17536  122.46  214.31      0    17528  122.52  214.41      0
  4294967296    1073741824     float     sum      -1    34939  122.93  215.12      0    34941  122.92  215.11      0
  4294967296    1073741824     float     sum      -1    34924  122.98  215.21      0    34914  123.02  215.28      0

The rank should increases from 0 to 15 , but it repeated from 0 to 7 . And the results show two counts each buffersize .

here's the topo dump file , it works as expected .

<system version="1">
  <cpu numaid="0" affinity="00000000,0000ffff,ffffffff,00000000,0000ffff,ffffffff" arch="x86_64" vendor="GenuineIntel" familyid="6" modelid="143">
    <pci busid="0000:0c:00.0" class="0x060400" vendor="0x1000" device="0xc030" subsystem_vendor="0x1000" subsystem_device="0x100b" link_speed="32.0 GT/s PCIe" link_width="16">
      <pci busid="0000:0e:00.0" class="0x020700" vendor="0x15b3" device="0x1021" subsystem_vendor="0x15b3" subsystem_device="0x0023" link_speed="32.0 GT/s PCIe" link_width="16">
        <nic>
          <net name="mlx5_0" dev="0" speed="400000" port="1" latency="0.000000" guid="0x22cfe30003ae6d94" maxconn="131072" gdr="1"/>
        </nic>
      </pci>
      <pci busid="0000:0f:00.0" class="0x030200" vendor="0x10de" device="0x2324" subsystem_vendor="0x10de" subsystem_device="0x17a6" link_speed="32.0 GT/s PCIe" link_width="16">
        <gpu dev="0" sm="90" rank="0" gdr="1">
          <nvlink target="0000:04:00.0" count="2" tclass="0x068000"/>
          <nvlink target="0000:05:00.0" count="2" tclass="0x068000"/>
          <nvlink target="0000:06:00.0" count="2" tclass="0x068000"/>
          <nvlink target="0000:03:00.0" count="2" tclass="0x068000"/>
        </gpu>
      </pci>
    </pci>
    <pci busid="0000:32:00.0" class="0x060400" vendor="0x1000" device="0xc030" subsystem_vendor="0x1000" subsystem_device="0x100b" link_speed="32.0 GT/s PCIe" link_width="16">
      <pci busid="0000:34:00.0" class="0x030200" vendor="0x10de" device="0x2324" subsystem_vendor="0x10de" subsystem_device="0x17a6" link_speed="32.0 GT/s PCIe" link_width="16">
        <gpu dev="1" sm="90" rank="1" gdr="1">
          <nvlink target="0000:04:00.0" count="2" tclass="0x068000"/>
          <nvlink target="0000:06:00.0" count="2" tclass="0x068000"/>
          <nvlink target="0000:05:00.0" count="2" tclass="0x068000"/>
          <nvlink target="0000:03:00.0" count="2" tclass="0x068000"/>
        </gpu>
      </pci>
      <pci busid="0000:35:00.0" class="0x020700" vendor="0x15b3" device="0x1021" subsystem_vendor="0x15b3" subsystem_device="0x0023" link_speed="32.0 GT/s PCIe" link_width="16">
        <nic>
          <net name="mlx5_5" dev="1" speed="400000" port="1" latency="0.000000" guid="0xec58a90003ae6d94" maxconn="131072" gdr="1"/>
        </nic>
      </pci>
    </pci>
    <pci busid="0000:45:00.0" class="0x060400" vendor="0x1000" device="0xc030" subsystem_vendor="0x1000" subsystem_device="0x100b" link_speed="32.0 GT/s PCIe" link_width="16">
      <pci busid="0000:47:00.0" class="0x020700" vendor="0x15b3" device="0x1021" subsystem_vendor="0x15b3" subsystem_device="0x0023" link_speed="32.0 GT/s PCIe" link_width="16">
        <nic>
          <net name="mlx5_6" dev="2" speed="400000" port="1" latency="0.000000" guid="0x42cfe30003ae6d94" maxconn="131072" gdr="1"/>
        </nic>
      </pci>
      <pci busid="0000:48:00.0" class="0x030200" vendor="0x10de" device="0x2324" subsystem_vendor="0x10de" subsystem_device="0x17a6" link_speed="32.0 GT/s PCIe" link_width="16">
        <gpu dev="2" sm="90" rank="2" gdr="1">
          <nvlink target="0000:04:00.0" count="2" tclass="0x068000"/>
          <nvlink target="0000:06:00.0" count="2" tclass="0x068000"/>
          <nvlink target="0000:05:00.0" count="2" tclass="0x068000"/>
          <nvlink target="0000:03:00.0" count="2" tclass="0x068000"/>
        </gpu>
      </pci>
    </pci>
    <pci busid="0000:58:00.0" class="0x060400" vendor="0x1000" device="0xc030" subsystem_vendor="0x1000" subsystem_device="0x100b" link_speed="32.0 GT/s PCIe" link_width="16">
      <pci busid="0000:5a:00.0" class="0x030200" vendor="0x10de" device="0x2324" subsystem_vendor="0x10de" subsystem_device="0x17a6" link_speed="32.0 GT/s PCIe" link_width="16">
        <gpu dev="3" sm="90" rank="3" gdr="1">
          <nvlink target="0000:05:00.0" count="2" tclass="0x068000"/>
          <nvlink target="0000:06:00.0" count="2" tclass="0x068000"/>
          <nvlink target="0000:03:00.0" count="2" tclass="0x068000"/>
          <nvlink target="0000:04:00.0" count="2" tclass="0x068000"/>
        </gpu>
      </pci>
      <pci busid="0000:5b:00.0" class="0x020700" vendor="0x15b3" device="0x1021" subsystem_vendor="0x15b3" subsystem_device="0x0023" link_speed="32.0 GT/s PCIe" link_width="16">
        <nic>
          <net name="mlx5_7" dev="3" speed="400000" port="1" latency="0.000000" guid="0xaacfe30003ae6d94" maxconn="131072" gdr="1"/>
        </nic>
      </pci>
    </pci>
  </cpu>
  <cpu numaid="1" affinity="ffffffff,ffff0000,00000000,ffffffff,ffff0000,00000000" arch="x86_64" vendor="GenuineIntel" familyid="6" modelid="143">
    <pci busid="0000:84:00.0" class="0x060400" vendor="0x1000" device="0xc030" subsystem_vendor="0x1000" subsystem_device="0x100b" link_speed="32.0 GT/s PCIe" link_width="16">
      <pci busid="0000:86:00.0" class="0x020700" vendor="0x15b3" device="0x1021" subsystem_vendor="0x15b3" subsystem_device="0x0023" link_speed="32.0 GT/s PCIe" link_width="16">
        <nic>
          <net name="mlx5_8" dev="4" speed="400000" port="1" latency="0.000000" guid="0xd455a90003ae6d94" maxconn="131072" gdr="1"/>
        </nic>
      </pci>
      <pci busid="0000:87:00.0" class="0x030200" vendor="0x10de" device="0x2324" subsystem_vendor="0x10de" subsystem_device="0x17a6" link_speed="32.0 GT/s PCIe" link_width="16">
        <gpu dev="4" sm="90" rank="4" gdr="1">
          <nvlink target="0000:04:00.0" count="2" tclass="0x068000"/>
          <nvlink target="0000:03:00.0" count="2" tclass="0x068000"/>
          <nvlink target="0000:06:00.0" count="2" tclass="0x068000"/>
          <nvlink target="0000:05:00.0" count="2" tclass="0x068000"/>
        </gpu>
      </pci>
    </pci>
    <pci busid="0000:ac:00.0" class="0x060400" vendor="0x1000" device="0xc030" subsystem_vendor="0x1000" subsystem_device="0x100b" link_speed="32.0 GT/s PCIe" link_width="16">
      <pci busid="0000:ae:00.0" class="0x030200" vendor="0x10de" device="0x2324" subsystem_vendor="0x10de" subsystem_device="0x17a6" link_speed="32.0 GT/s PCIe" link_width="16">
        <gpu dev="5" sm="90" rank="5" gdr="1">
          <nvlink target="0000:06:00.0" count="2" tclass="0x068000"/>
          <nvlink target="0000:04:00.0" count="2" tclass="0x068000"/>
          <nvlink target="0000:03:00.0" count="2" tclass="0x068000"/>
          <nvlink target="0000:05:00.0" count="2" tclass="0x068000"/>
        </gpu>
      </pci>
      <pci busid="0000:af:00.0" class="0x020700" vendor="0x15b3" device="0x1021" subsystem_vendor="0x15b3" subsystem_device="0x0023" link_speed="32.0 GT/s PCIe" link_width="16">
        <nic>
          <net name="mlx5_10" dev="5" speed="400000" port="1" latency="0.000000" guid="0xe458a90003ae6d94" maxconn="131072" gdr="1"/>
        </nic>
      </pci>
    </pci>
    <pci busid="0000:c0:00.0" class="0x060400" vendor="0x1000" device="0xc030" subsystem_vendor="0x1000" subsystem_device="0x100b" link_speed="32.0 GT/s PCIe" link_width="16">
      <pci busid="0000:c2:00.0" class="0x030200" vendor="0x10de" device="0x2324" subsystem_vendor="0x10de" subsystem_device="0x17a6" link_speed="32.0 GT/s PCIe" link_width="16">
        <gpu dev="6" sm="90" rank="6" gdr="1">
          <nvlink target="0000:06:00.0" count="2" tclass="0x068000"/>
          <nvlink target="0000:04:00.0" count="2" tclass="0x068000"/>
          <nvlink target="0000:05:00.0" count="2" tclass="0x068000"/>
          <nvlink target="0000:03:00.0" count="2" tclass="0x068000"/>
        </gpu>
      </pci>
      <pci busid="0000:c3:00.0" class="0x020700" vendor="0x15b3" device="0x1021" subsystem_vendor="0x15b3" subsystem_device="0x0023" link_speed="32.0 GT/s PCIe" link_width="16">
        <nic>
          <net name="mlx5_11" dev="6" speed="400000" port="1" latency="0.000000" guid="0xa2cfe30003ae6d94" maxconn="131072" gdr="1"/>
        </nic>
      </pci>
    </pci>
    <pci busid="0000:d4:00.0" class="0x060400" vendor="0x1000" device="0xc030" subsystem_vendor="0x1000" subsystem_device="0x100b" link_speed="32.0 GT/s PCIe" link_width="16">
      <pci busid="0000:d6:00.0" class="0x020700" vendor="0x15b3" device="0x1021" subsystem_vendor="0x15b3" subsystem_device="0x0023" link_speed="32.0 GT/s PCIe" link_width="16">
        <nic>
          <net name="mlx5_12" dev="7" speed="400000" port="1" latency="0.000000" guid="0x8456a90003ae6d94" maxconn="131072" gdr="1"/>
        </nic>
      </pci>
      <pci busid="0000:d7:00.0" class="0x030200" vendor="0x10de" device="0x2324" subsystem_vendor="0x10de" subsystem_device="0x17a6" link_speed="32.0 GT/s PCIe" link_width="16">
        <gpu dev="7" sm="90" rank="7" gdr="1">
          <nvlink target="0000:05:00.0" count="2" tclass="0x068000"/>
          <nvlink target="0000:04:00.0" count="2" tclass="0x068000"/>
          <nvlink target="0000:03:00.0" count="2" tclass="0x068000"/>
          <nvlink target="0000:06:00.0" count="2" tclass="0x068000"/>
        </gpu>
      </pci>
    </pci>
  </cpu>
</system>

and trace log , there's message showing "Failed to find ncclNetPlugin_v8 " , I have no idea it matters or not .

hw14:798:798 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0
hw14:798:798 [0] NCCL INFO Bootstrap : Using bond0:10.0.0.16<0>
hw14:798:798 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v8 symbol.
hw14:798:798 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6)
hw14:798:798 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
hw14:798:798 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v6)
hw14:798:798 [7] NCCL INFO cudaDriverVersion 12030
hw14:798:798 [7] NCCL INFO NCCL version 2.20.5+cuda12.4
hw14:798:814 [4] NCCL INFO Plugin Path : /usr/local/lib/libnccl-net.so
hw14:798:814 [4] NCCL INFO P2P plugin IBext
hw14:798:814 [4] NCCL INFO NCCL_IBEXT_DISABLE set by environment to 1.
hw14:798:814 [4] NCCL INFO net.cc:111 -> 3
hw14:798:814 [4] NCCL INFO NCCL_SOCKET_IFNAME set by environment to bond0
hw14:798:814 [4] NCCL INFO NCCL_IB_HCA set to mlx5_0,mlx5_5,mlx5_6,mlx5_7,mlx5_8,mlx5_10,mlx5_11,mlx5_12
hw14:798:814 [4] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_5:1/IB [2]mlx5_6:1/IB [3]mlx5_7:1/IB [4]mlx5_8:1/IB [5]mlx5_10:1/IB [6]mlx5_11:1/IB [7]mlx5_12:1/IB [RO]; OOB bond0:10.0.0.16<0>
hw14:798:814 [4] NCCL INFO Using non-device net plugin version 0
hw14:798:814 [4] NCCL INFO Using network IB
hw14:798:816 [6] NCCL INFO Using non-device net plugin version 0
hw14:798:816 [6] NCCL INFO Using network IB
hw14:798:815 [5] NCCL INFO Using non-device net plugin version 0
hw14:798:815 [5] NCCL INFO Using network IB
hw14:798:817 [7] NCCL INFO Using non-device net plugin version 0
hw14:798:817 [7] NCCL INFO Using network IB
hw14:798:813 [3] NCCL INFO Using non-device net plugin version 0
hw14:798:813 [3] NCCL INFO Using network IB
hw14:798:810 [0] NCCL INFO Using non-device net plugin version 0
hw14:798:810 [0] NCCL INFO Using network IB
hw14:798:811 [1] NCCL INFO Using non-device net plugin version 0
hw14:798:811 [1] NCCL INFO Using network IB
hw14:798:812 [2] NCCL INFO Using non-device net plugin version 0
hw14:798:812 [2] NCCL INFO Using network IB
hw14:798:812 [2] NCCL INFO comm 0x55a7be570210 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 48000 commId 0xa861fef54fe1a5ac - Init START
hw14:798:811 [1] NCCL INFO comm 0x55a7be569d00 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 34000 commId 0xa861fef54fe1a5ac - Init START
hw14:798:815 [5] NCCL INFO comm 0x55a7be582ff0 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId ae000 commId 0xa861fef54fe1a5ac - Init START
hw14:798:814 [4] NCCL INFO comm 0x55a7be57cb50 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId 87000 commId 0xa861fef54fe1a5ac - Init START
hw14:798:816 [6] NCCL INFO comm 0x55a7be589490 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId c2000 commId 0xa861fef54fe1a5ac - Init START
hw14:798:817 [7] NCCL INFO comm 0x55a7be58f930 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId d7000 commId 0xa861fef54fe1a5ac - Init START
hw14:798:810 [0] NCCL INFO comm 0x55a7be5601c0 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId f000 commId 0xa861fef54fe1a5ac - Init START
hw14:798:813 [3] NCCL INFO comm 0x55a7be5766b0 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId 5a000 commId 0xa861fef54fe1a5ac - Init START
hw14:798:810 [0] NCCL INFO NCCL_TOPO_DUMP_FILE set by environment to topo-file-1
hw14:798:810 [0] NCCL INFO Setting affinity for GPU 0 to ffff,ffffffff,00000000,0000ffff,ffffffff
hw14:798:810 [0] NCCL INFO NVLS multicast support is available on dev 0
hw14:798:815 [5] NCCL INFO NVLS multicast support is available on dev 5
hw14:798:817 [7] NCCL INFO NVLS multicast support is available on dev 7
hw14:798:813 [3] NCCL INFO Setting affinity for GPU 3 to ffff,ffffffff,00000000,0000ffff,ffffffff
hw14:798:813 [3] NCCL INFO NVLS multicast support is available on dev 3
hw14:798:812 [2] NCCL INFO Setting affinity for GPU 2 to ffff,ffffffff,00000000,0000ffff,ffffffff
hw14:798:812 [2] NCCL INFO NVLS multicast support is available on dev 2
hw14:798:814 [4] NCCL INFO NVLS multicast support is available on dev 4
hw14:798:811 [1] NCCL INFO Setting affinity for GPU 1 to ffff,ffffffff,00000000,0000ffff,ffffffff
hw14:798:811 [1] NCCL INFO NVLS multicast support is available on dev 1
hw14:798:816 [6] NCCL INFO NVLS multicast support is available on dev 6
hw14:798:816 [6] NCCL INFO comm 0x55a7be589490 rank 6 nRanks 8 nNodes 1 localRanks 8 localRank 6 MNNVL 0
hw14:798:817 [7] NCCL INFO comm 0x55a7be58f930 rank 7 nRanks 8 nNodes 1 localRanks 8 localRank 7 MNNVL 0
hw14:798:814 [4] NCCL INFO comm 0x55a7be57cb50 rank 4 nRanks 8 nNodes 1 localRanks 8 localRank 4 MNNVL 0
hw14:798:813 [3] NCCL INFO comm 0x55a7be5766b0 rank 3 nRanks 8 nNodes 1 localRanks 8 localRank 3 MNNVL 0
hw14:798:811 [1] NCCL INFO comm 0x55a7be569d00 rank 1 nRanks 8 nNodes 1 localRanks 8 localRank 1 MNNVL 0
hw14:798:812 [2] NCCL INFO comm 0x55a7be570210 rank 2 nRanks 8 nNodes 1 localRanks 8 localRank 2 MNNVL 0
hw14:798:810 [0] NCCL INFO comm 0x55a7be5601c0 rank 0 nRanks 8 nNodes 1 localRanks 8 localRank 0 MNNVL 0
hw14:798:815 [5] NCCL INFO comm 0x55a7be582ff0 rank 5 nRanks 8 nNodes 1 localRanks 8 localRank 5 MNNVL 0
hw14:798:812 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1
hw14:798:814 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3
hw14:798:817 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6
hw14:798:814 [4] NCCL INFO P2P Chunksize set to 524288
hw14:798:815 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4
hw14:798:811 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0
hw14:798:811 [1] NCCL INFO P2P Chunksize set to 524288
hw14:798:810 [0] NCCL INFO Channel 00/16 :    0   1   2   3   4   5   6   7
hw14:798:810 [0] NCCL INFO Channel 01/16 :    0   1   2   3   4   5   6   7
hw14:798:810 [0] NCCL INFO Channel 02/16 :    0   1   2   3   4   5   6   7
hw14:798:812 [2] NCCL INFO P2P Chunksize set to 524288
hw14:798:813 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2
hw14:798:816 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5
hw14:798:816 [6] NCCL INFO P2P Chunksize set to 524288
hw14:798:810 [0] NCCL INFO Channel 03/16 :    0   1   2   3   4   5   6   7
hw14:798:810 [0] NCCL INFO Channel 04/16 :    0   1   2   3   4   5   6   7
hw14:798:810 [0] NCCL INFO Channel 05/16 :    0   1   2   3   4   5   6   7
hw14:798:810 [0] NCCL INFO Channel 06/16 :    0   1   2   3   4   5   6   7
hw14:798:810 [0] NCCL INFO Channel 07/16 :    0   1   2   3   4   5   6   7
hw14:798:810 [0] NCCL INFO Channel 08/16 :    0   1   2   3   4   5   6   7
hw14:798:810 [0] NCCL INFO Channel 09/16 :    0   1   2   3   4   5   6   7
hw14:798:810 [0] NCCL INFO Channel 10/16 :    0   1   2   3   4   5   6   7
hw14:798:810 [0] NCCL INFO Channel 11/16 :    0   1   2   3   4   5   6   7
hw14:798:810 [0] NCCL INFO Channel 12/16 :    0   1   2   3   4   5   6   7
hw14:798:810 [0] NCCL INFO Channel 13/16 :    0   1   2   3   4   5   6   7
hw14:798:810 [0] NCCL INFO Channel 14/16 :    0   1   2   3   4   5   6   7
hw14:798:810 [0] NCCL INFO Channel 15/16 :    0   1   2   3   4   5   6   7
hw14:798:810 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1
hw14:798:817 [7] NCCL INFO P2P Chunksize set to 524288
hw14:798:815 [5] NCCL INFO P2P Chunksize set to 524288
hw14:798:813 [3] NCCL INFO P2P Chunksize set to 524288
hw14:798:810 [0] NCCL INFO P2P Chunksize set to 524288
hw14:798:815 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/direct pointer
hw14:798:814 [4] NCCL INFO Channel 00/0 : 4[4] -> 5[5] via P2P/direct pointer
hw14:798:813 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[4] via P2P/direct pointer
hw14:798:815 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/direct pointer
hw14:798:813 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[4] via P2P/direct pointer
hw14:798:811 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/direct pointer
hw14:798:814 [4] NCCL INFO Channel 01/0 : 4[4] -> 5[5] via P2P/direct pointer
hw14:798:815 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/direct pointer
hw14:798:811 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/direct pointer
hw14:798:814 [4] NCCL INFO Channel 02/0 : 4[4] -> 5[5] via P2P/direct pointer
hw14:798:817 [7] NCCL INFO Channel 00/0 : 7[7] -> 0[0] via P2P/direct pointer
hw14:798:815 [5] NCCL INFO Channel 03/0 : 5[5] -> 6[6] via P2P/direct pointer
hw14:798:811 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/direct pointer
hw14:798:814 [4] NCCL INFO Channel 03/0 : 4[4] -> 5[5] via P2P/direct pointer
hw14:798:813 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[4] via P2P/direct pointer
hw14:798:811 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/direct pointer
hw14:798:815 [5] NCCL INFO Channel 04/0 : 5[5] -> 6[6] via P2P/direct pointer
hw14:798:813 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[4] via P2P/direct pointer
hw14:798:817 [7] NCCL INFO Channel 01/0 : 7[7] -> 0[0] via P2P/direct pointer
hw14:798:814 [4] NCCL INFO Channel 04/0 : 4[4] -> 5[5] via P2P/direct pointer
hw14:798:815 [5] NCCL INFO Channel 05/0 : 5[5] -> 6[6] via P2P/direct pointer
hw14:798:817 [7] NCCL INFO Channel 02/0 : 7[7] -> 0[0] via P2P/direct pointer
hw14:798:811 [1] NCCL INFO Channel 04/0 : 1[1] -> 2[2] via P2P/direct pointer
hw14:798:813 [3] NCCL INFO Channel 04/0 : 3[3] -> 4[4] via P2P/direct pointer
hw14:798:817 [7] NCCL INFO Channel 03/0 : 7[7] -> 0[0] via P2P/direct pointer
hw14:798:811 [1] NCCL INFO Channel 05/0 : 1[1] -> 2[2] via P2P/direct pointer
hw14:798:814 [4] NCCL INFO Channel 05/0 : 4[4] -> 5[5] via P2P/direct pointer
hw14:798:815 [5] NCCL INFO Channel 06/0 : 5[5] -> 6[6] via P2P/direct pointer
hw14:798:813 [3] NCCL INFO Channel 05/0 : 3[3] -> 4[4] via P2P/direct pointer
hw14:798:817 [7] NCCL INFO Channel 04/0 : 7[7] -> 0[0] via P2P/direct pointer
hw14:798:813 [3] NCCL INFO Channel 06/0 : 3[3] -> 4[4] via P2P/direct pointer
hw14:798:814 [4] NCCL INFO Channel 06/0 : 4[4] -> 5[5] via P2P/direct pointer
hw14:798:811 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/direct pointer
hw14:798:816 [6] NCCL INFO Channel 00/0 : 6[6] -> 7[7] via P2P/direct pointer
hw14:798:817 [7] NCCL INFO Channel 05/0 : 7[7] -> 0[0] via P2P/direct pointer
hw14:798:814 [4] NCCL INFO Channel 07/0 : 4[4] -> 5[5] via P2P/direct pointer
hw14:798:811 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/direct pointer
hw14:798:816 [6] NCCL INFO Channel 01/0 : 6[6] -> 7[7] via P2P/direct pointer
hw14:798:815 [5] NCCL INFO Channel 07/0 : 5[5] -> 6[6] via P2P/direct pointer
hw14:798:813 [3] NCCL INFO Channel 07/0 : 3[3] -> 4[4] via P2P/direct pointer
hw14:798:817 [7] NCCL INFO Channel 06/0 : 7[7] -> 0[0] via P2P/direct pointer
hw14:798:814 [4] NCCL INFO Channel 08/0 : 4[4] -> 5[5] via P2P/direct pointer
hw14:798:813 [3] NCCL INFO Channel 08/0 : 3[3] -> 4[4] via P2P/direct pointer
hw14:798:815 [5] NCCL INFO Channel 08/0 : 5[5] -> 6[6] via P2P/direct pointer
hw14:798:811 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/direct pointer
hw14:798:817 [7] NCCL INFO Channel 07/0 : 7[7] -> 0[0] via P2P/direct pointer
hw14:798:814 [4] NCCL INFO Channel 09/0 : 4[4] -> 5[5] via P2P/direct pointer
hw14:798:816 [6] NCCL INFO Channel 02/0 : 6[6] -> 7[7] via P2P/direct pointer
hw14:798:815 [5] NCCL INFO Channel 09/0 : 5[5] -> 6[6] via P2P/direct pointer
hw14:798:813 [3] NCCL INFO Channel 09/0 : 3[3] -> 4[4] via P2P/direct pointer
hw14:798:812 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/direct pointer
hw14:798:811 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/direct pointer
hw14:798:817 [7] NCCL INFO Channel 08/0 : 7[7] -> 0[0] via P2P/direct pointer
hw14:798:815 [5] NCCL INFO Channel 10/0 : 5[5] -> 6[6] via P2P/direct pointer
hw14:798:816 [6] NCCL INFO Channel 03/0 : 6[6] -> 7[7] via P2P/direct pointer
hw14:798:810 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/direct pointer
hw14:798:812 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/direct pointer
hw14:798:811 [1] NCCL INFO Channel 10/0 : 1[1] -> 2[2] via P2P/direct pointer
hw14:798:817 [7] NCCL INFO Channel 09/0 : 7[7] -> 0[0] via P2P/direct pointer
hw14:798:814 [4] NCCL INFO Channel 10/0 : 4[4] -> 5[5] via P2P/direct pointer
hw14:798:810 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/direct pointer
hw14:798:812 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/direct pointer
hw14:798:816 [6] NCCL INFO Channel 04/0 : 6[6] -> 7[7] via P2P/direct pointer
hw14:798:811 [1] NCCL INFO Channel 11/0 : 1[1] -> 2[2] via P2P/direct pointer
hw14:798:813 [3] NCCL INFO Channel 10/0 : 3[3] -> 4[4] via P2P/direct pointer
hw14:798:814 [4] NCCL INFO Channel 11/0 : 4[4] -> 5[5] via P2P/direct pointer
hw14:798:815 [5] NCCL INFO Channel 11/0 : 5[5] -> 6[6] via P2P/direct pointer
hw14:798:816 [6] NCCL INFO Channel 05/0 : 6[6] -> 7[7] via P2P/direct pointer
hw14:798:817 [7] NCCL INFO Channel 10/0 : 7[7] -> 0[0] via P2P/direct pointer
hw14:798:812 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/direct pointer
hw14:798:813 [3] NCCL INFO Channel 11/0 : 3[3] -> 4[4] via P2P/direct pointer
hw14:798:814 [4] NCCL INFO Channel 12/0 : 4[4] -> 5[5] via P2P/direct pointer
hw14:798:815 [5] NCCL INFO Channel 12/0 : 5[5] -> 6[6] via P2P/direct pointer
hw14:798:810 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/direct pointer
hw14:798:811 [1] NCCL INFO Channel 12/0 : 1[1] -> 2[2] via P2P/direct pointer
hw14:798:812 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/direct pointer
hw14:798:813 [3] NCCL INFO Channel 12/0 : 3[3] -> 4[4] via P2P/direct pointer
hw14:798:815 [5] NCCL INFO Channel 13/0 : 5[5] -> 6[6] via P2P/direct pointer
hw14:798:814 [4] NCCL INFO Channel 13/0 : 4[4] -> 5[5] via P2P/direct pointer
hw14:798:811 [1] NCCL INFO Channel 13/0 : 1[1] -> 2[2] via P2P/direct pointer
hw14:798:816 [6] NCCL INFO Channel 06/0 : 6[6] -> 7[7] via P2P/direct pointer
hw14:798:817 [7] NCCL INFO Channel 11/0 : 7[7] -> 0[0] via P2P/direct pointer
hw14:798:813 [3] NCCL INFO Channel 13/0 : 3[3] -> 4[4] via P2P/direct pointer
hw14:798:815 [5] NCCL INFO Channel 14/0 : 5[5] -> 6[6] via P2P/direct pointer
hw14:798:812 [2] NCCL INFO Channel 05/0 : 2[2] -> 3[3] via P2P/direct pointer
hw14:798:811 [1] NCCL INFO Channel 14/0 : 1[1] -> 2[2] via P2P/direct pointer
hw14:798:814 [4] NCCL INFO Channel 14/0 : 4[4] -> 5[5] via P2P/direct pointer
hw14:798:813 [3] NCCL INFO Channel 14/0 : 3[3] -> 4[4] via P2P/direct pointer
hw14:798:815 [5] NCCL INFO Channel 15/0 : 5[5] -> 6[6] via P2P/direct pointer
hw14:798:816 [6] NCCL INFO Channel 07/0 : 6[6] -> 7[7] via P2P/direct pointer
hw14:798:810 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/direct pointer
hw14:798:817 [7] NCCL INFO Channel 12/0 : 7[7] -> 0[0] via P2P/direct pointer
hw14:798:812 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/direct pointer
hw14:798:813 [3] NCCL INFO Channel 15/0 : 3[3] -> 4[4] via P2P/direct pointer
hw14:798:811 [1] NCCL INFO Channel 15/0 : 1[1] -> 2[2] via P2P/direct pointer
hw14:798:816 [6] NCCL INFO Channel 08/0 : 6[6] -> 7[7] via P2P/direct pointer
hw14:798:814 [4] NCCL INFO Channel 15/0 : 4[4] -> 5[5] via P2P/direct pointer
hw14:798:812 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/direct pointer
hw14:798:816 [6] NCCL INFO Channel 09/0 : 6[6] -> 7[7] via P2P/direct pointer
hw14:798:817 [7] NCCL INFO Channel 13/0 : 7[7] -> 0[0] via P2P/direct pointer
hw14:798:812 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/direct pointer
hw14:798:810 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/direct pointer
hw14:798:816 [6] NCCL INFO Channel 10/0 : 6[6] -> 7[7] via P2P/direct pointer
hw14:798:817 [7] NCCL INFO Channel 14/0 : 7[7] -> 0[0] via P2P/direct pointer
hw14:798:812 [2] NCCL INFO Channel 09/0 : 2[2] -> 3[3] via P2P/direct pointer
hw14:798:810 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/direct pointer
hw14:798:816 [6] NCCL INFO Channel 11/0 : 6[6] -> 7[7] via P2P/direct pointer
hw14:798:817 [7] NCCL INFO Channel 15/0 : 7[7] -> 0[0] via P2P/direct pointer
hw14:798:812 [2] NCCL INFO Channel 10/0 : 2[2] -> 3[3] via P2P/direct pointer
hw14:798:810 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/direct pointer
hw14:798:816 [6] NCCL INFO Channel 12/0 : 6[6] -> 7[7] via P2P/direct pointer
hw14:798:812 [2] NCCL INFO Channel 11/0 : 2[2] -> 3[3] via P2P/direct pointer
hw14:798:810 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/direct pointer
hw14:798:816 [6] NCCL INFO Channel 13/0 : 6[6] -> 7[7] via P2P/direct pointer
hw14:798:812 [2] NCCL INFO Channel 12/0 : 2[2] -> 3[3] via P2P/direct pointer
hw14:798:810 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/direct pointer
hw14:798:816 [6] NCCL INFO Channel 14/0 : 6[6] -> 7[7] via P2P/direct pointer
hw14:798:812 [2] NCCL INFO Channel 13/0 : 2[2] -> 3[3] via P2P/direct pointer
hw14:798:810 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/direct pointer
hw14:798:816 [6] NCCL INFO Channel 15/0 : 6[6] -> 7[7] via P2P/direct pointer
hw14:798:812 [2] NCCL INFO Channel 14/0 : 2[2] -> 3[3] via P2P/direct pointer
hw14:798:810 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/direct pointer
hw14:798:810 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/direct pointer
hw14:798:812 [2] NCCL INFO Channel 15/0 : 2[2] -> 3[3] via P2P/direct pointer
hw14:798:810 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/direct pointer
hw14:798:810 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/direct pointer
hw14:798:810 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/direct pointer
hw14:798:810 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/direct pointer
hw14:798:811 [1] NCCL INFO Connected all rings
hw14:798:812 [2] NCCL INFO Connected all rings
hw14:798:813 [3] NCCL INFO Connected all rings
hw14:798:810 [0] NCCL INFO Connected all rings
hw14:798:817 [7] NCCL INFO Connected all rings
hw14:798:817 [7] NCCL INFO Channel 00/0 : 7[7] -> 6[6] via P2P/direct pointer
hw14:798:816 [6] NCCL INFO Connected all rings
hw14:798:814 [4] NCCL INFO Connected all rings
hw14:798:815 [5] NCCL INFO Connected all rings
hw14:798:817 [7] NCCL INFO Channel 01/0 : 7[7] -> 6[6] via P2P/direct pointer
hw14:798:817 [7] NCCL INFO Channel 02/0 : 7[7] -> 6[6] via P2P/direct pointer
hw14:798:817 [7] NCCL INFO Channel 03/0 : 7[7] -> 6[6] via P2P/direct pointer
hw14:798:817 [7] NCCL INFO Channel 04/0 : 7[7] -> 6[6] via P2P/direct pointer
hw14:798:817 [7] NCCL INFO Channel 05/0 : 7[7] -> 6[6] via P2P/direct pointer
hw14:798:817 [7] NCCL INFO Channel 06/0 : 7[7] -> 6[6] via P2P/direct pointer
hw14:798:817 [7] NCCL INFO Channel 07/0 : 7[7] -> 6[6] via P2P/direct pointer
hw14:798:817 [7] NCCL INFO Channel 08/0 : 7[7] -> 6[6] via P2P/direct pointer
hw14:798:811 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/direct pointer
hw14:798:812 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/direct pointer
hw14:798:817 [7] NCCL INFO Channel 09/0 : 7[7] -> 6[6] via P2P/direct pointer
hw14:798:812 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/direct pointer
hw14:798:817 [7] NCCL INFO Channel 10/0 : 7[7] -> 6[6] via P2P/direct pointer
hw14:798:811 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/direct pointer
hw14:798:812 [2] NCCL INFO Channel 02/0 : 2[2] -> 1[1] via P2P/direct pointer
hw14:798:811 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/direct pointer
hw14:798:817 [7] NCCL INFO Channel 11/0 : 7[7] -> 6[6] via P2P/direct pointer
hw14:798:812 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/direct pointer
hw14:798:813 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/direct pointer
hw14:798:813 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/direct pointer
hw14:798:811 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/direct pointer
hw14:798:817 [7] NCCL INFO Channel 12/0 : 7[7] -> 6[6] via P2P/direct pointer
hw14:798:813 [3] NCCL INFO Channel 02/0 : 3[3] -> 2[2] via P2P/direct pointer
hw14:798:811 [1] NCCL INFO Channel 04/0 : 1[1] -> 0[0] via P2P/direct pointer
hw14:798:817 [7] NCCL INFO Channel 13/0 : 7[7] -> 6[6] via P2P/direct pointer
hw14:798:812 [2] NCCL INFO Channel 04/0 : 2[2] -> 1[1] via P2P/direct pointer
hw14:798:813 [3] NCCL INFO Channel 03/0 : 3[3] -> 2[2] via P2P/direct pointer
hw14:798:817 [7] NCCL INFO Channel 14/0 : 7[7] -> 6[6] via P2P/direct pointer
hw14:798:813 [3] NCCL INFO Channel 04/0 : 3[3] -> 2[2] via P2P/direct pointer
hw14:798:817 [7] NCCL INFO Channel 15/0 : 7[7] -> 6[6] via P2P/direct pointer
hw14:798:812 [2] NCCL INFO Channel 05/0 : 2[2] -> 1[1] via P2P/direct pointer
hw14:798:811 [1] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/direct pointer
hw14:798:813 [3] NCCL INFO Channel 05/0 : 3[3] -> 2[2] via P2P/direct pointer
hw14:798:812 [2] NCCL INFO Channel 06/0 : 2[2] -> 1[1] via P2P/direct pointer
hw14:798:815 [5] NCCL INFO Channel 00/0 : 5[5] -> 4[4] via P2P/direct pointer
hw14:798:811 [1] NCCL INFO Channel 06/0 : 1[1] -> 0[0] via P2P/direct pointer
hw14:798:816 [6] NCCL INFO Channel 00/0 : 6[6] -> 5[5] via P2P/direct pointer
hw14:798:813 [3] NCCL INFO Channel 06/0 : 3[3] -> 2[2] via P2P/direct pointer
hw14:798:812 [2] NCCL INFO Channel 07/0 : 2[2] -> 1[1] via P2P/direct pointer
hw14:798:815 [5] NCCL INFO Channel 01/0 : 5[5] -> 4[4] via P2P/direct pointer
hw14:798:811 [1] NCCL INFO Channel 07/0 : 1[1] -> 0[0] via P2P/direct pointer
hw14:798:816 [6] NCCL INFO Channel 01/0 : 6[6] -> 5[5] via P2P/direct pointer
hw14:798:813 [3] NCCL INFO Channel 07/0 : 3[3] -> 2[2] via P2P/direct pointer
hw14:798:812 [2] NCCL INFO Channel 08/0 : 2[2] -> 1[1] via P2P/direct pointer
hw14:798:816 [6] NCCL INFO Channel 02/0 : 6[6] -> 5[5] via P2P/direct pointer
hw14:798:815 [5] NCCL INFO Channel 02/0 : 5[5] -> 4[4] via P2P/direct pointer
hw14:798:813 [3] NCCL INFO Channel 08/0 : 3[3] -> 2[2] via P2P/direct pointer
hw14:798:814 [4] NCCL INFO Channel 00/0 : 4[4] -> 3[3] via P2P/direct pointer
hw14:798:812 [2] NCCL INFO Channel 09/0 : 2[2] -> 1[1] via P2P/direct pointer
hw14:798:811 [1] NCCL INFO Channel 08/0 : 1[1] -> 0[0] via P2P/direct pointer
hw14:798:815 [5] NCCL INFO Channel 03/0 : 5[5] -> 4[4] via P2P/direct pointer
hw14:798:813 [3] NCCL INFO Channel 09/0 : 3[3] -> 2[2] via P2P/direct pointer
hw14:798:816 [6] NCCL INFO Channel 03/0 : 6[6] -> 5[5] via P2P/direct pointer
hw14:798:812 [2] NCCL INFO Channel 10/0 : 2[2] -> 1[1] via P2P/direct pointer
hw14:798:811 [1] NCCL INFO Channel 09/0 : 1[1] -> 0[0] via P2P/direct pointer
hw14:798:814 [4] NCCL INFO Channel 01/0 : 4[4] -> 3[3] via P2P/direct pointer
hw14:798:815 [5] NCCL INFO Channel 04/0 : 5[5] -> 4[4] via P2P/direct pointer
hw14:798:816 [6] NCCL INFO Channel 04/0 : 6[6] -> 5[5] via P2P/direct pointer
hw14:798:813 [3] NCCL INFO Channel 10/0 : 3[3] -> 2[2] via P2P/direct pointer
hw14:798:814 [4] NCCL INFO Channel 02/0 : 4[4] -> 3[3] via P2P/direct pointer
hw14:798:812 [2] NCCL INFO Channel 11/0 : 2[2] -> 1[1] via P2P/direct pointer
hw14:798:815 [5] NCCL INFO Channel 05/0 : 5[5] -> 4[4] via P2P/direct pointer
hw14:798:816 [6] NCCL INFO Channel 05/0 : 6[6] -> 5[5] via P2P/direct pointer
hw14:798:814 [4] NCCL INFO Channel 03/0 : 4[4] -> 3[3] via P2P/direct pointer
hw14:798:811 [1] NCCL INFO Channel 10/0 : 1[1] -> 0[0] via P2P/direct pointer
hw14:798:812 [2] NCCL INFO Channel 12/0 : 2[2] -> 1[1] via P2P/direct pointer
hw14:798:815 [5] NCCL INFO Channel 06/0 : 5[5] -> 4[4] via P2P/direct pointer
hw14:798:814 [4] NCCL INFO Channel 04/0 : 4[4] -> 3[3] via P2P/direct pointer
hw14:798:813 [3] NCCL INFO Channel 11/0 : 3[3] -> 2[2] via P2P/direct pointer
hw14:798:811 [1] NCCL INFO Channel 11/0 : 1[1] -> 0[0] via P2P/direct pointer
hw14:798:812 [2] NCCL INFO Channel 13/0 : 2[2] -> 1[1] via P2P/direct pointer
hw14:798:816 [6] NCCL INFO Channel 06/0 : 6[6] -> 5[5] via P2P/direct pointer
hw14:798:814 [4] NCCL INFO Channel 05/0 : 4[4] -> 3[3] via P2P/direct pointer
hw14:798:813 [3] NCCL INFO Channel 12/0 : 3[3] -> 2[2] via P2P/direct pointer
hw14:798:815 [5] NCCL INFO Channel 07/0 : 5[5] -> 4[4] via P2P/direct pointer
hw14:798:811 [1] NCCL INFO Channel 12/0 : 1[1] -> 0[0] via P2P/direct pointer
hw14:798:814 [4] NCCL INFO Channel 06/0 : 4[4] -> 3[3] via P2P/direct pointer
hw14:798:816 [6] NCCL INFO Channel 07/0 : 6[6] -> 5[5] via P2P/direct pointer
hw14:798:813 [3] NCCL INFO Channel 13/0 : 3[3] -> 2[2] via P2P/direct pointer
hw14:798:811 [1] NCCL INFO Channel 13/0 : 1[1] -> 0[0] via P2P/direct pointer
hw14:798:814 [4] NCCL INFO Channel 07/0 : 4[4] -> 3[3] via P2P/direct pointer
hw14:798:815 [5] NCCL INFO Channel 08/0 : 5[5] -> 4[4] via P2P/direct pointer
hw14:798:816 [6] NCCL INFO Channel 08/0 : 6[6] -> 5[5] via P2P/direct pointer
hw14:798:813 [3] NCCL INFO Channel 14/0 : 3[3] -> 2[2] via P2P/direct pointer
hw14:798:812 [2] NCCL INFO Channel 14/0 : 2[2] -> 1[1] via P2P/direct pointer
hw14:798:814 [4] NCCL INFO Channel 08/0 : 4[4] -> 3[3] via P2P/direct pointer
hw14:798:815 [5] NCCL INFO Channel 09/0 : 5[5] -> 4[4] via P2P/direct pointer
hw14:798:811 [1] NCCL INFO Channel 14/0 : 1[1] -> 0[0] via P2P/direct pointer
hw14:798:816 [6] NCCL INFO Channel 09/0 : 6[6] -> 5[5] via P2P/direct pointer
hw14:798:812 [2] NCCL INFO Channel 15/0 : 2[2] -> 1[1] via P2P/direct pointer
hw14:798:813 [3] NCCL INFO Channel 15/0 : 3[3] -> 2[2] via P2P/direct pointer
hw14:798:815 [5] NCCL INFO Channel 10/0 : 5[5] -> 4[4] via P2P/direct pointer
hw14:798:814 [4] NCCL INFO Channel 09/0 : 4[4] -> 3[3] via P2P/direct pointer
hw14:798:816 [6] NCCL INFO Channel 10/0 : 6[6] -> 5[5] via P2P/direct pointer
hw14:798:815 [5] NCCL INFO Channel 11/0 : 5[5] -> 4[4] via P2P/direct pointer
hw14:798:814 [4] NCCL INFO Channel 10/0 : 4[4] -> 3[3] via P2P/direct pointer
hw14:798:816 [6] NCCL INFO Channel 11/0 : 6[6] -> 5[5] via P2P/direct pointer
hw14:798:811 [1] NCCL INFO Channel 15/0 : 1[1] -> 0[0] via P2P/direct pointer
hw14:798:815 [5] NCCL INFO Channel 12/0 : 5[5] -> 4[4] via P2P/direct pointer
hw14:798:814 [4] NCCL INFO Channel 11/0 : 4[4] -> 3[3] via P2P/direct pointer
hw14:798:816 [6] NCCL INFO Channel 12/0 : 6[6] -> 5[5] via P2P/direct pointer
hw14:798:815 [5] NCCL INFO Channel 13/0 : 5[5] -> 4[4] via P2P/direct pointer
hw14:798:814 [4] NCCL INFO Channel 12/0 : 4[4] -> 3[3] via P2P/direct pointer
hw14:798:816 [6] NCCL INFO Channel 13/0 : 6[6] -> 5[5] via P2P/direct pointer
hw14:798:814 [4] NCCL INFO Channel 13/0 : 4[4] -> 3[3] via P2P/direct pointer
hw14:798:815 [5] NCCL INFO Channel 14/0 : 5[5] -> 4[4] via P2P/direct pointer
hw14:798:816 [6] NCCL INFO Channel 14/0 : 6[6] -> 5[5] via P2P/direct pointer
hw14:798:815 [5] NCCL INFO Channel 15/0 : 5[5] -> 4[4] via P2P/direct pointer
hw14:798:816 [6] NCCL INFO Channel 15/0 : 6[6] -> 5[5] via P2P/direct pointer
hw14:798:814 [4] NCCL INFO Channel 14/0 : 4[4] -> 3[3] via P2P/direct pointer
hw14:798:814 [4] NCCL INFO Channel 15/0 : 4[4] -> 3[3] via P2P/direct pointer
hw14:798:810 [0] NCCL INFO Connected all trees
hw14:798:811 [1] NCCL INFO Connected all trees
hw14:798:817 [7] NCCL INFO Connected all trees
hw14:798:812 [2] NCCL INFO Connected all trees
hw14:798:813 [3] NCCL INFO Connected all trees
hw14:798:814 [4] NCCL INFO Connected all trees
hw14:798:816 [6] NCCL INFO Connected all trees
hw14:798:815 [5] NCCL INFO Connected all trees
hw14:798:810 [0] NCCL INFO NVLS comm 0x55a7be5601c0 headRank 0 nHeads 8 buffSize 4194304 memSize 2097152 nvlsPerRankSize 201326592 nvlsTotalSize 1610612736
hw14:798:811 [1] NCCL INFO NVLS comm 0x55a7be569d00 headRank 1 nHeads 8 buffSize 4194304 memSize 2097152 nvlsPerRankSize 201326592 nvlsTotalSize 1610612736
hw14:798:813 [3] NCCL INFO NVLS comm 0x55a7be5766b0 headRank 3 nHeads 8 buffSize 4194304 memSize 2097152 nvlsPerRankSize 201326592 nvlsTotalSize 1610612736
hw14:798:816 [6] NCCL INFO NVLS comm 0x55a7be589490 headRank 6 nHeads 8 buffSize 4194304 memSize 2097152 nvlsPerRankSize 201326592 nvlsTotalSize 1610612736
hw14:798:815 [5] NCCL INFO NVLS comm 0x55a7be582ff0 headRank 5 nHeads 8 buffSize 4194304 memSize 2097152 nvlsPerRankSize 201326592 nvlsTotalSize 1610612736
hw14:798:817 [7] NCCL INFO NVLS comm 0x55a7be58f930 headRank 7 nHeads 8 buffSize 4194304 memSize 2097152 nvlsPerRankSize 201326592 nvlsTotalSize 1610612736
hw14:798:814 [4] NCCL INFO NVLS comm 0x55a7be57cb50 headRank 4 nHeads 8 buffSize 4194304 memSize 2097152 nvlsPerRankSize 201326592 nvlsTotalSize 1610612736
hw14:798:812 [2] NCCL INFO NVLS comm 0x55a7be570210 headRank 2 nHeads 8 buffSize 4194304 memSize 2097152 nvlsPerRankSize 201326592 nvlsTotalSize 1610612736
hw14:798:811 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
hw14:798:811 [1] NCCL INFO 16 coll channels, 0 collnet channels, 16 nvls channels, 16 p2p channels, 16 p2p channels per peer
hw14:798:815 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
hw14:798:815 [5] NCCL INFO 16 coll channels, 0 collnet channels, 16 nvls channels, 16 p2p channels, 16 p2p channels per peer
hw14:798:817 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
hw14:798:817 [7] NCCL INFO 16 coll channels, 0 collnet channels, 16 nvls channels, 16 p2p channels, 16 p2p channels per peer
hw14:798:810 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
hw14:798:810 [0] NCCL INFO 16 coll channels, 0 collnet channels, 16 nvls channels, 16 p2p channels, 16 p2p channels per peer
hw14:798:813 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
hw14:798:813 [3] NCCL INFO 16 coll channels, 0 collnet channels, 16 nvls channels, 16 p2p channels, 16 p2p channels per peer
hw14:798:812 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
hw14:798:812 [2] NCCL INFO 16 coll channels, 0 collnet channels, 16 nvls channels, 16 p2p channels, 16 p2p channels per peer
hw14:798:814 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
hw14:798:814 [4] NCCL INFO 16 coll channels, 0 collnet channels, 16 nvls channels, 16 p2p channels, 16 p2p channels per peer
hw14:798:816 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
hw14:798:816 [6] NCCL INFO 16 coll channels, 0 collnet channels, 16 nvls channels, 16 p2p channels, 16 p2p channels per peer
hw14:798:815 [5] NCCL INFO comm 0x55a7be582ff0 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId ae000 commId 0xa861fef54fe1a5ac - Init COMPLETE
hw14:798:817 [7] NCCL INFO comm 0x55a7be58f930 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId d7000 commId 0xa861fef54fe1a5ac - Init COMPLETE
hw14:798:813 [3] NCCL INFO comm 0x55a7be5766b0 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId 5a000 commId 0xa861fef54fe1a5ac - Init COMPLETE
hw14:798:810 [0] NCCL INFO comm 0x55a7be5601c0 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId f000 commId 0xa861fef54fe1a5ac - Init COMPLETE
hw14:798:812 [2] NCCL INFO comm 0x55a7be570210 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 48000 commId 0xa861fef54fe1a5ac - Init COMPLETE
hw14:798:816 [6] NCCL INFO comm 0x55a7be589490 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId c2000 commId 0xa861fef54fe1a5ac - Init COMPLETE
hw14:798:811 [1] NCCL INFO comm 0x55a7be569d00 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 34000 commId 0xa861fef54fe1a5ac - Init COMPLETE
hw14:798:814 [4] NCCL INFO comm 0x55a7be57cb50 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId 87000 commId 0xa861fef54fe1a5ac - Init COMPLETE
hw14:798:798 [7] NCCL INFO comm 0x55a7be5601c0 rank 0 nranks 8 cudaDev 0 busId f000 - Destroy COMPLETE
hw14:798:798 [7] NCCL INFO comm 0x55a7be569d00 rank 1 nranks 8 cudaDev 1 busId 34000 - Destroy COMPLETE
hw14:798:798 [7] NCCL INFO comm 0x55a7be570210 rank 2 nranks 8 cudaDev 2 busId 48000 - Destroy COMPLETE
hw14:798:798 [7] NCCL INFO comm 0x55a7be5766b0 rank 3 nranks 8 cudaDev 3 busId 5a000 - Destroy COMPLETE
hw14:798:798 [7] NCCL INFO comm 0x55a7be57cb50 rank 4 nranks 8 cudaDev 4 busId 87000 - Destroy COMPLETE
hw14:798:798 [7] NCCL INFO comm 0x55a7be582ff0 rank 5 nranks 8 cudaDev 5 busId ae000 - Destroy COMPLETE
hw14:798:798 [7] NCCL INFO comm 0x55a7be589490 rank 6 nranks 8 cudaDev 6 busId c2000 - Destroy COMPLETE
hw14:798:798 [7] NCCL INFO comm 0x55a7be58f930 rank 7 nranks 8 cudaDev 7 busId d7000 - Destroy COMPLETE
AddyLaddy commented 2 months ago

In the Nvidia provided containers the MPI=1 built tests are usually called:

/opt/nccl-tests/build/all_reduce_perf_mpi

Did you build /opt/nccl-tests/build/all_reduce_perf yourself using MPI=1 ?

FrankLeeeee commented 2 months ago

I encountered the same problem on H100 as well. I built nccl-tests with OpenMPI and did not use Docker.

jianh619 commented 2 months ago

In the Nvidia provided containers the MPI=1 built tests are usually called:

/opt/nccl-tests/build/all_reduce_perf_mpi

Did you build /opt/nccl-tests/build/all_reduce_perf yourself using MPI=1 ?

Yes , I build the image myself , compiling nccl test with MPI=1

BTW , is there official container provided by Nvidia ? Coudl you let me know where I can get the download link?

FrankLeeeee commented 2 months ago

I solved this issue by using OpenMPI 4.1 instead. I originally built nccl-tests with openmpi 5.0 but it runs separately on each node. After switching to OpenMPI 4.1 and rebuilding it, it works as expected now.

kiskra-nvidia commented 2 months ago

Yes , I build the image myself , compiling nccl test with MPI=1

It sure looks like, for whatever reason, either your MPI compilation or your MPI installation does not work as expected.

Does a simple MPI "hello world" type program work correctly (you know, one that would report the rank and size of MPI_COMM_WORLD from each launched process)?

Can you verify if your all_reduce_perf actually uses MPI? Say, check with ldd if it links with the MPI library:

ldd all_reduce_perf | grep mpi

Or check with nm if it has any MPI symbols:

nm all_reduce_perf | grep MPI

BTW , is there official container provided by Nvidia ? Coudl you let me know where I can get the download link?

Docker container nvidia/cuda:12.2.2-devel-ubuntu22.04 contains NCCL 2.19.3. A number of containers in Nvidia's NGC catalog (https://catalog.ngc.nvidia.com/) contain NCCL as well. I believe TensorRT does (https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorrt), and PyTorch (https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch). Also https://catalog.ngc.nvidia.com/orgs/nvidia/containers/hpc-benchmarks, https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nvhpc...

jianh619 commented 2 months ago

Thanks guys , it should be some reason for compilation , rebuild the image , works now .