NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
809 stars 229 forks source link

Bandwidth result not equal to ib_write_bw result #154

Closed Jiaao-Bai closed 1 year ago

Jiaao-Bai commented 1 year ago

hi, i run sendrecv_perf on 2 nodes with a100, the bandwidth is 0.60GB/s, but the ib_write_bw result is 23Gb/s please give me some advice.

env: nccl version: 2.18.1-1 cuda: 11.6 2 servers, each one has 2 * 25 Gbps bonded network card, and 4 a100 gpus

i do some work on ncclGetUniqueId funtion to run nccl without mpi.. please ignore the log and env ( NCCL_COMM_ID_NOT_PEER_0)

command on peer0:

NCCL_DEBUG_SUBSYS="ALL"  NCCL_DEBUG="TRACE" NCCL_NET_GDR_LEVEL=9  NCCL_IB_GID_INDEX=3  NCCL_MIN_P2P_NCHANNELS=2  NCCL_TOPO_DUMP_FILE=$PWD/topology.xml LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib NCCL_COMM_ID="10.156.9.18:8888"  ./build/sendrecv_perf -b 1M -e 10M -f 2 -g 1 -w 1

command on peer1:

 NCCL_DEBUG_SUBSYS="ALL"  NCCL_DEBUG="TRACE"  NCCL_NET_GDR_LEVEL=9 NCCL_IB_GID_INDEX=3 CUDA_VISIBLE_DEVICES=2   NCCL_TOPO_DUMP_FILE=$PWD/topology.xml LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib  NCCL_COMM_ID_NOT_PEER_0="10.156.9.18:8888"  ./build/sendrecv_perf  -b 1M -e 10M -f 2 -g 1 -w 1

log peer0.log peer1.log

topology

<system version="1">
  <cpu numaid="0" affinity="000000ff,ffff0000,00ffffff" arch="x86_64" vendor="GenuineIntel" familyid="6" modelid="106">
    <pci busid="0000:4b:00.0" class="0x030200" vendor="0x10de" device="0x20b5" subsystem_vendor="0x10de" subsystem_device="0x1533" link_speed="16.0 GT/s PCIe" link_width="16">
      <gpu dev="0" sm="80" rank="0" gdr="1"/>
    </pci>
  </cpu>
  <cpu numaid="1" affinity="ffffff00,0000ffff,ff000000" arch="x86_64" vendor="GenuineIntel" familyid="6" modelid="106">
    <pci busid="0000:98:00.0" class="0x020000" vendor="0x15b3" device="0x1017" subsystem_vendor="0x15b3" subsystem_device="0x0052" link_speed="8.0 GT/s PCIe" link_width="16">
      <nic>
        <net name="mlx5_bond_0" dev="0" speed="25000" port="1" latency="0.000000" guid="0x5addcb0003ebc008" maxconn="131072" gdr="1"/>
      </nic>
    </pci>
  </cpu>
</system>
sjeaugey commented 1 year ago

Can you remove all tracing/logging, and run with -b 8 -e 4G -f 2? That would give us a better idea as to what's going on.

Jiaao-Bai commented 1 year ago

after using nccl v2.17.1, the result is ok

HeGaoYuan commented 8 months ago

after using nccl v2.17.1, the result is ok

Why using nccl v2.17.1, the result is ok? @Jiaao-Bai