NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.27k stars 829 forks source link

NCCL Infiniband Issue #1412

Closed JuiceLemonLemon closed 3 months ago

JuiceLemonLemon commented 3 months ago

Hi, I have similar problem with https://github.com/NVIDIA/nccl/issues/307, two machines in a cluster connected with 200Gb/sec bandwidth infiniband.

ibstatus:

Infiniband device 'mlx5_0' port 1 status:
        default gid:     fe80:0000:0000:0000:88e9:a4ff:ff4c:2bdc
        base lid:        0x22
        sm lid:          0x47
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            200 Gb/sec (4X HDR)
        link_layer:      InfiniBand

Infiniband device 'mlx5_1' port 1 status:
        default gid:     fe80:0000:0000:0000:88e9:a4ff:ff4c:7b90
        base lid:        0xe
        sm lid:          0x47
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            200 Gb/sec (4X HDR)
        link_layer:      InfiniBand

Infiniband device 'mlx5_2' port 1 status:
        default gid:     fe80:0000:0000:0000:88e9:a4ff:ff4c:7b88
        base lid:        0x2a
        sm lid:          0x47
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            200 Gb/sec (4X HDR)
        link_layer:      InfiniBand

Infiniband device 'mlx5_3' port 1 status:
        default gid:     fe80:0000:0000:0000:8ae9:a4ff:fe46:3734
        base lid:        0x0
        sm lid:          0x0
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            40 Gb/sec (4X QDR)
        link_layer:      Ethernet

Infiniband device 'mlx5_4' port 1 status:
        default gid:     fe80:0000:0000:0000:8ae9:a4ff:fe46:3735
        base lid:        0x0
        sm lid:          0x0
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            40 Gb/sec (4X QDR)
        link_layer:      Ethernet

Infiniband device 'mlx5_5' port 1 status:
        default gid:     fe80:0000:0000:0000:88e9:a4ff:ff4c:5ba8
        base lid:        0x10
        sm lid:          0x47
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            200 Gb/sec (4X HDR)
        link_layer:      InfiniBand

ib_send_bw shows:

---------------------------------------------------------------------------------------
                    Send BW Test
 Dual-port       : OFF          Device         : mlx5_0
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 RX depth        : 512
 CQ Moderation   : 1
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x22 QPN 0x0031 PSN 0x7ffe0d
 remote address: LID 0x47 QPN 0x0034 PSN 0x8cd88e
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
Conflicting CPU frequency values detected: 3339.967000 != 2865.821000. CPU Frequency is not max.
 65536      1000             0.00               20273.41                   0.324375
---------------------------------------------------------------------------------------

nvidia-smi topo -m shows:

       GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    SYS     PXB     SYS     SYS     SYS     SYS     24-31,88-95     3               N/A
GPU1    NV12     X      NV12    NV12    NV12    NV12    NV12    NV12    SYS     PXB     SYS     SYS     SYS     SYS     24-31,88-95     3               N/A
GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    PXB     SYS     SYS     SYS     SYS     SYS     8-15,72-79      1               N/A
GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    PXB     SYS     SYS     SYS     SYS     SYS     8-15,72-79      1               N/A
GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    SYS     SYS     SYS     SYS     SYS     PXB     56-63,120-127   7               N/A
GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    SYS     SYS     SYS     SYS     SYS     PXB     56-63,120-127   7               N/A
GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    SYS     SYS     PXB     SYS     SYS     SYS     40-47,104-111   5               N/A
GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      SYS     SYS     PXB     SYS     SYS     SYS     40-47,104-111   5               N/A
NIC0    SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS      X      SYS     SYS     SYS     SYS     SYS
NIC1    PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS     SYS     SYS
NIC2    SYS     SYS     SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS      X      SYS     SYS     SYS
NIC3    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      PIX     SYS
NIC4    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX      X      SYS
NIC5    SYS     SYS     SYS     SYS     PXB     PXB     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5

but nccl-tests only achieves about 36GB/s, which is far below the expected bandwidth (each infiniband device bandwidth is 200Gb/sec, and we have 4 infiniband devices, so the expected bandwidth should be 200Gb/sec / 8 * 4 = 100GB/sec),

Command: mpirun --bind-to none --mca btl '^openib' -n 2 --host ip1,ip2 -x LD_LIBRARY_PATH ./build/all_gather_perf -b 16M -e 1024M -i 16777216 -g 8 -d half -f 2

#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
    16777216        524288      half    none      -1    567.5   29.57   27.72      0    538.0   31.18   29.23      0
    33554432       1048576      half    none      -1    909.2   36.91   34.60      0    907.6   36.97   34.66      0
    67108864       2097152      half    none      -1   1714.0   39.15   36.71      0   1705.5   39.35   36.89      0
   134217728       4194304      half    none      -1   3413.5   39.32   36.86      0   3409.4   39.37   36.91      0
   268435456       8388608      half    none      -1   6907.1   38.86   36.43      0   6929.7   38.74   36.32      0
   536870912      16777216      half    none      -1    13901   38.62   36.21      0    13901   38.62   36.21      0
  1073741824      33554432      half    none      -1    27651   38.83   36.41      0    27565   38.95   36.52      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 35.119
#
kiskra-nvidia commented 3 months ago

Please run with NCCL_DEBUG=INFO to see what NICs NCCL chooses. You may need to exclude mlx5_3 and mlx5_4 with something like NCCL_IB_HCA=^mlx5_3,mlx5_4.

kiskra-nvidia commented 3 months ago

And thank you for including so much relevant info in your report, BTW!

JuiceLemonLemon commented 3 months ago

Please run with NCCL_DEBUG=INFO to see what NICs NCCL chooses. You may need to exclude mlx5_3 and mlx5_4 with something like NCCL_IB_HCA=^mlx5_3,mlx5_4.

ok, I run the below command. NCCL_DEBUG=INFO NCCL_IB_HCA=^mlx5_3,mlx5_4 mpirun --bind-to none --mca btl '^openib' -n 2 --host ip1,ip2 -x LD_LIBRARY_PATH ./build/all_gather_perf -b 16M -e 1024M -i 16777216 -g 8 -d half -f 2

# nThread 1 nGpus 8 minBytes 16777216 maxBytes 1073741824 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 2017630 on   swat1-04 device  0 [0x07] NVIDIA A100-SXM4-80GB
#  Rank  1 Group  0 Pid 2017630 on   swat1-04 device  1 [0x0b] NVIDIA A100-SXM4-80GB
#  Rank  2 Group  0 Pid 2017630 on   swat1-04 device  2 [0x48] NVIDIA A100-SXM4-80GB
#  Rank  3 Group  0 Pid 2017630 on   swat1-04 device  3 [0x4c] NVIDIA A100-SXM4-80GB
#  Rank  4 Group  0 Pid 2017630 on   swat1-04 device  4 [0x88] NVIDIA A100-SXM4-80GB
#  Rank  5 Group  0 Pid 2017630 on   swat1-04 device  5 [0x8b] NVIDIA A100-SXM4-80GB
#  Rank  6 Group  0 Pid 2017630 on   swat1-04 device  6 [0xc8] NVIDIA A100-SXM4-80GB
#  Rank  7 Group  0 Pid 2017630 on   swat1-04 device  7 [0xcb] NVIDIA A100-SXM4-80GB
#  Rank  8 Group  0 Pid 1161920 on   swat1-05 device  0 [0x07] NVIDIA A100-SXM4-80GB
#  Rank  9 Group  0 Pid 1161920 on   swat1-05 device  1 [0x0b] NVIDIA A100-SXM4-80GB
#  Rank 10 Group  0 Pid 1161920 on   swat1-05 device  2 [0x48] NVIDIA A100-SXM4-80GB
#  Rank 11 Group  0 Pid 1161920 on   swat1-05 device  3 [0x4c] NVIDIA A100-SXM4-80GB
#  Rank 12 Group  0 Pid 1161920 on   swat1-05 device  4 [0x88] NVIDIA A100-SXM4-80GB
#  Rank 13 Group  0 Pid 1161920 on   swat1-05 device  5 [0x8b] NVIDIA A100-SXM4-80GB
#  Rank 14 Group  0 Pid 1161920 on   swat1-05 device  6 [0xc8] NVIDIA A100-SXM4-80GB
#  Rank 15 Group  0 Pid 1161920 on   swat1-05 device  7 [0xcb] NVIDIA A100-SXM4-80GB
swat1-04:2017630:2017630 [0] NCCL INFO Bootstrap : Using ens21f0:75.12.36.64<0>
swat1-04:2017630:2017630 [0] NCCL INFO cudaDriverVersion 12020
swat1-04:2017630:2017630 [0] NCCL INFO NCCL version 2.22.3+cuda12.6
swat1-04:2017630:2017672 [3] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
swat1-04:2017630:2017672 [3] NCCL INFO NCCL_IB_HCA set to ^mlx5_3,mlx5_4
swat1-04:2017630:2017672 [3] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_1:1/IB [2]mlx5_2:1/IB [3]mlx5_5:1/IB [RO]; OOB ens21f0:75.12.36.64<0>
swat1-04:2017630:2017675 [6] NCCL INFO Using network IB
swat1-04:2017630:2017671 [2] NCCL INFO Using network IB
swat1-04:2017630:2017672 [3] NCCL INFO Using network IB
swat1-04:2017630:2017670 [1] NCCL INFO Using network IB
swat1-04:2017630:2017669 [0] NCCL INFO Using network IB
swat1-04:2017630:2017676 [7] NCCL INFO Using network IB
swat1-04:2017630:2017673 [4] NCCL INFO Using network IB
swat1-04:2017630:2017674 [5] NCCL INFO Using network IB
swat1-04:2017630:2017673 [4] NCCL INFO ncclCommInitRank comm 0x8e48640 rank 4 nranks 16 cudaDev 4 nvmlDev 4 busId 88000 commId 0x49bbd7790311d5eb - Init START
swat1-04:2017630:2017675 [6] NCCL INFO ncclCommInitRank comm 0x8ec0ac0 rank 6 nranks 16 cudaDev 6 nvmlDev 6 busId c8000 commId 0x49bbd7790311d5eb - Init START
swat1-04:2017630:2017670 [1] NCCL INFO ncclCommInitRank comm 0x8d93f80 rank 1 nranks 16 cudaDev 1 nvmlDev 1 busId b000 commId 0x49bbd7790311d5eb - Init START
swat1-04:2017630:2017674 [5] NCCL INFO ncclCommInitRank comm 0x8e84880 rank 5 nranks 16 cudaDev 5 nvmlDev 5 busId 8b000 commId 0x49bbd7790311d5eb - Init START
swat1-04:2017630:2017672 [3] NCCL INFO ncclCommInitRank comm 0x8e0c400 rank 3 nranks 16 cudaDev 3 nvmlDev 3 busId 4c000 commId 0x49bbd7790311d5eb - Init START
swat1-04:2017630:2017676 [7] NCCL INFO ncclCommInitRank comm 0x8efcbc0 rank 7 nranks 16 cudaDev 7 nvmlDev 7 busId cb000 commId 0x49bbd7790311d5eb - Init START
swat1-04:2017630:2017669 [0] NCCL INFO ncclCommInitRank comm 0x8d57d40 rank 0 nranks 16 cudaDev 0 nvmlDev 0 busId 7000 commId 0x49bbd7790311d5eb - Init START
swat1-04:2017630:2017671 [2] NCCL INFO ncclCommInitRank comm 0x8dd01c0 rank 2 nranks 16 cudaDev 2 nvmlDev 2 busId 48000 commId 0x49bbd7790311d5eb - Init START
swat1-04:2017630:2017670 [1] NCCL INFO Setting affinity for GPU 1 to ff000000,00000000,ff000000
swat1-04:2017630:2017671 [2] NCCL INFO Setting affinity for GPU 2 to ff00,00000000,0000ff00
swat1-04:2017630:2017670 [1] NCCL INFO NVLS multicast support is not available on dev 1
swat1-04:2017630:2017671 [2] NCCL INFO NVLS multicast support is not available on dev 2
swat1-04:2017630:2017673 [4] NCCL INFO Setting affinity for GPU 4 to ff000000,00000000,ff000000,00000000
swat1-04:2017630:2017673 [4] NCCL INFO NVLS multicast support is not available on dev 4
swat1-04:2017630:2017674 [5] NCCL INFO Setting affinity for GPU 5 to ff000000,00000000,ff000000,00000000
swat1-04:2017630:2017674 [5] NCCL INFO NVLS multicast support is not available on dev 5
swat1-04:2017630:2017669 [0] NCCL INFO Setting affinity for GPU 0 to ff000000,00000000,ff000000
swat1-04:2017630:2017675 [6] NCCL INFO Setting affinity for GPU 6 to ff00,00000000,0000ff00,00000000
swat1-04:2017630:2017669 [0] NCCL INFO NVLS multicast support is not available on dev 0
swat1-04:2017630:2017675 [6] NCCL INFO NVLS multicast support is not available on dev 6
swat1-04:2017630:2017672 [3] NCCL INFO Setting affinity for GPU 3 to ff00,00000000,0000ff00
swat1-04:2017630:2017676 [7] NCCL INFO Setting affinity for GPU 7 to ff00,00000000,0000ff00,00000000
swat1-04:2017630:2017676 [7] NCCL INFO NVLS multicast support is not available on dev 7
swat1-04:2017630:2017672 [3] NCCL INFO NVLS multicast support is not available on dev 3
swat1-04:2017630:2017669 [0] NCCL INFO comm 0x8d57d40 rank 0 nRanks 16 nNodes 2 localRanks 8 localRank 0 MNNVL 0
swat1-04:2017630:2017670 [1] NCCL INFO comm 0x8d93f80 rank 1 nRanks 16 nNodes 2 localRanks 8 localRank 1 MNNVL 0
swat1-04:2017630:2017671 [2] NCCL INFO comm 0x8dd01c0 rank 2 nRanks 16 nNodes 2 localRanks 8 localRank 2 MNNVL 0
swat1-04:2017630:2017675 [6] NCCL INFO comm 0x8ec0ac0 rank 6 nRanks 16 nNodes 2 localRanks 8 localRank 6 MNNVL 0
swat1-04:2017630:2017672 [3] NCCL INFO comm 0x8e0c400 rank 3 nRanks 16 nNodes 2 localRanks 8 localRank 3 MNNVL 0
swat1-04:2017630:2017674 [5] NCCL INFO comm 0x8e84880 rank 5 nRanks 16 nNodes 2 localRanks 8 localRank 5 MNNVL 0
swat1-04:2017630:2017676 [7] NCCL INFO comm 0x8efcbc0 rank 7 nRanks 16 nNodes 2 localRanks 8 localRank 7 MNNVL 0
swat1-04:2017630:2017669 [0] NCCL INFO Channel 00/08 :    0   5   4   7   6   3   2   1   8  13  12  15  14  11  10   9
swat1-04:2017630:2017669 [0] NCCL INFO Channel 01/08 :    0   5   4   7   6   3  10   9   8  13  12  15  14  11   2   1
swat1-04:2017630:2017669 [0] NCCL INFO Channel 02/08 :    0   5   4   7  14  11  10   9   8  13  12  15   6   3   2   1
swat1-04:2017630:2017670 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0
swat1-04:2017630:2017670 [1] NCCL INFO P2P Chunksize set to 131072
swat1-04:2017630:2017669 [0] NCCL INFO Channel 03/08 :    0   5  12  15  14  11  10   9   8  13   4   7   6   3   2   1
swat1-04:2017630:2017669 [0] NCCL INFO Channel 04/08 :    0   5   4   7   6   3   2   1   8  13  12  15  14  11  10   9
swat1-04:2017630:2017671 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/10/-1->2->-1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->10 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1
swat1-04:2017630:2017671 [2] NCCL INFO P2P Chunksize set to 131072
swat1-04:2017630:2017673 [4] NCCL INFO comm 0x8e48640 rank 4 nRanks 16 nNodes 2 localRanks 8 localRank 4 MNNVL 0
swat1-04:2017630:2017669 [0] NCCL INFO Channel 05/08 :    0   5   4   7   6   3  10   9   8  13  12  15  14  11   2   1
swat1-04:2017630:2017669 [0] NCCL INFO Channel 06/08 :    0   5   4   7  14  11  10   9   8  13  12  15   6   3   2   1
swat1-04:2017630:2017669 [0] NCCL INFO Channel 07/08 :    0   5  12  15  14  11  10   9   8  13   4   7   6   3   2   1
swat1-04:2017630:2017669 [0] NCCL INFO Trees [0] 1/8/-1->0->-1 [1] 1/-1/-1->0->5 [2] 1/-1/-1->0->5 [3] 1/-1/-1->0->5 [4] 1/-1/-1->0->8 [5] 1/-1/-1->0->5 [6] 1/-1/-1->0->5 [7] 1/-1/-1->0->5
swat1-04:2017630:2017669 [0] NCCL INFO P2P Chunksize set to 131072
swat1-04:2017630:2017672 [3] NCCL INFO Trees [0] 6/-1/-1->3->2 [1] 6/-1/-1->3->2 [2] -1/-1/-1->3->2 [3] 6/-1/-1->3->2 [4] 6/-1/-1->3->2 [5] 6/-1/-1->3->2 [6] -1/-1/-1->3->2 [7] 6/-1/-1->3->2
swat1-04:2017630:2017672 [3] NCCL INFO P2P Chunksize set to 131072
swat1-04:2017630:2017674 [5] NCCL INFO Trees [0] -1/-1/-1->5->4 [1] 0/-1/-1->5->4 [2] 0/-1/-1->5->4 [3] 0/-1/-1->5->4 [4] -1/-1/-1->5->4 [5] 0/-1/-1->5->4 [6] 0/-1/-1->5->4 [7] 0/-1/-1->5->4
swat1-04:2017630:2017674 [5] NCCL INFO P2P Chunksize set to 131072
swat1-04:2017630:2017675 [6] NCCL INFO Trees [0] 7/-1/-1->6->3 [1] 7/-1/-1->6->3 [2] 7/14/-1->6->-1 [3] 7/-1/-1->6->3 [4] 7/-1/-1->6->3 [5] 7/-1/-1->6->3 [6] 7/-1/-1->6->14 [7] 7/-1/-1->6->3
swat1-04:2017630:2017675 [6] NCCL INFO P2P Chunksize set to 131072
swat1-04:2017630:2017676 [7] NCCL INFO Trees [0] 4/-1/-1->7->6 [1] 4/-1/-1->7->6 [2] 4/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] 4/-1/-1->7->6 [5] 4/-1/-1->7->6 [6] 4/-1/-1->7->6 [7] -1/-1/-1->7->6
swat1-04:2017630:2017676 [7] NCCL INFO P2P Chunksize set to 131072
swat1-04:2017630:2017673 [4] NCCL INFO Trees [0] 5/-1/-1->4->7 [1] 5/-1/-1->4->7 [2] 5/-1/-1->4->7 [3] 5/12/-1->4->-1 [4] 5/-1/-1->4->7 [5] 5/-1/-1->4->7 [6] 5/-1/-1->4->7 [7] 5/-1/-1->4->12
swat1-04:2017630:2017673 [4] NCCL INFO P2P Chunksize set to 131072
swat1-04:2017630:2017671 [2] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
swat1-04:2017630:2017671 [2] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 2 p2p channels per peer
swat1-04:2017630:2017674 [5] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
swat1-04:2017630:2017674 [5] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 2 p2p channels per peer
swat1-04:2017630:2017673 [4] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
swat1-04:2017630:2017673 [4] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 2 p2p channels per peer
swat1-04:2017630:2017669 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
swat1-04:2017630:2017669 [0] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 2 p2p channels per peer
swat1-04:2017630:2017669 [0] NCCL INFO CC Off, Multi-GPU CC Off, workFifoBytes 1048576
swat1-04:2017630:2017675 [6] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
swat1-04:2017630:2017675 [6] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 2 p2p channels per peer
swat1-04:2017630:2017672 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
swat1-04:2017630:2017672 [3] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 2 p2p channels per peer
swat1-04:2017630:2017670 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
swat1-04:2017630:2017670 [1] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 2 p2p channels per peer
swat1-04:2017630:2017676 [7] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
swat1-04:2017630:2017676 [7] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 2 p2p channels per peer
swat1-04:2017630:2017671 [2] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
swat1-04:2017630:2017671 [2] NCCL INFO ncclCommInitRank comm 0x8dd01c0 rank 2 nranks 16 cudaDev 2 nvmlDev 2 busId 48000 commId 0x49bbd7790311d5eb - Init COMPLETE
swat1-04:2017630:2017671 [2] NCCL INFO Init timings: rank 2 nranks 16 total 1.08 (kernels 0.51, bootstrap 0.22, allgathers 0.01, topo 0.28, graphs 0.04, connections 0.02, rest 0.01)
swat1-04:2017630:2017675 [6] NCCL INFO ncclCommInitRank comm 0x8ec0ac0 rank 6 nranks 16 cudaDev 6 nvmlDev 6 busId c8000 commId 0x49bbd7790311d5eb - Init COMPLETE
swat1-04:2017630:2017675 [6] NCCL INFO Init timings: rank 6 nranks 16 total 1.08 (kernels 0.50, bootstrap 0.22, allgathers 0.00, topo 0.28, graphs 0.04, connections 0.02, rest 0.00)
swat1-04:2017630:2017673 [4] NCCL INFO ncclCommInitRank comm 0x8e48640 rank 4 nranks 16 cudaDev 4 nvmlDev 4 busId 88000 commId 0x49bbd7790311d5eb - Init COMPLETE
swat1-04:2017630:2017669 [0] NCCL INFO ncclCommInitRank comm 0x8d57d40 rank 0 nranks 16 cudaDev 0 nvmlDev 0 busId 7000 commId 0x49bbd7790311d5eb - Init COMPLETE
swat1-04:2017630:2017669 [0] NCCL INFO Init timings: rank 0 nranks 16 total 1.08 (kernels 0.51, bootstrap 0.22, allgathers 0.00, topo 0.28, graphs 0.04, connections 0.02, rest 0.00)
swat1-04:2017630:2017670 [1] NCCL INFO ncclCommInitRank comm 0x8d93f80 rank 1 nranks 16 cudaDev 1 nvmlDev 1 busId b000 commId 0x49bbd7790311d5eb - Init COMPLETE
swat1-04:2017630:2017674 [5] NCCL INFO ncclCommInitRank comm 0x8e84880 rank 5 nranks 16 cudaDev 5 nvmlDev 5 busId 8b000 commId 0x49bbd7790311d5eb - Init COMPLETE
swat1-04:2017630:2017676 [7] NCCL INFO ncclCommInitRank comm 0x8efcbc0 rank 7 nranks 16 cudaDev 7 nvmlDev 7 busId cb000 commId 0x49bbd7790311d5eb - Init COMPLETE
swat1-04:2017630:2017676 [7] NCCL INFO Init timings: rank 7 nranks 16 total 1.08 (kernels 0.51, bootstrap 0.22, allgathers 0.00, topo 0.28, graphs 0.04, connections 0.02, rest 0.00)
swat1-04:2017630:2017672 [3] NCCL INFO ncclCommInitRank comm 0x8e0c400 rank 3 nranks 16 cudaDev 3 nvmlDev 3 busId 4c000 commId 0x49bbd7790311d5eb - Init COMPLETE
swat1-04:2017630:2017672 [3] NCCL INFO Init timings: rank 3 nranks 16 total 1.08 (kernels 0.50, bootstrap 0.23, allgathers 0.00, topo 0.28, graphs 0.04, connections 0.02, rest 0.00)
swat1-04:2017630:2017673 [4] NCCL INFO Init timings: rank 4 nranks 16 total 1.08 (kernels 0.50, bootstrap 0.22, allgathers 0.02, topo 0.28, graphs 0.03, connections 0.02, rest 0.00)
swat1-04:2017630:2017670 [1] NCCL INFO Init timings: rank 1 nranks 16 total 1.08 (kernels 0.51, bootstrap 0.22, allgathers 0.00, topo 0.28, graphs 0.04, connections 0.02, rest 0.00)
swat1-04:2017630:2017674 [5] NCCL INFO Init timings: rank 5 nranks 16 total 1.08 (kernels 0.51, bootstrap 0.22, allgathers 0.02, topo 0.28, graphs 0.03, connections 0.02, rest 0.00)
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
swat1-04:2017630:2017700 [4] NCCL INFO Channel 00/0 : 4[4] -> 7[7] via P2P/direct pointer/read
swat1-04:2017630:2017704 [0] NCCL INFO Channel 00/0 : 0[0] -> 5[5] via P2P/direct pointer/read
swat1-04:2017630:2017700 [4] NCCL INFO Channel 01/0 : 4[4] -> 7[7] via P2P/direct pointer/read
swat1-04:2017630:2017704 [0] NCCL INFO Channel 01/0 : 0[0] -> 5[5] via P2P/direct pointer/read
swat1-04:2017630:2017700 [4] NCCL INFO Channel 02/0 : 4[4] -> 7[7] via P2P/direct pointer/read
swat1-04:2017630:2017704 [0] NCCL INFO Channel 02/0 : 0[0] -> 5[5] via P2P/direct pointer/read
swat1-04:2017630:2017700 [4] NCCL INFO Channel 03/0 : 4[4] -> 7[7] via P2P/direct pointer/read
swat1-04:2017630:2017704 [0] NCCL INFO Channel 03/0 : 0[0] -> 5[5] via P2P/direct pointer/read
swat1-04:2017630:2017700 [4] NCCL INFO Channel 04/0 : 4[4] -> 7[7] via P2P/direct pointer/read
swat1-04:2017630:2017704 [0] NCCL INFO Channel 04/0 : 0[0] -> 5[5] via P2P/direct pointer/read
swat1-04:2017630:2017700 [4] NCCL INFO Channel 05/0 : 4[4] -> 7[7] via P2P/direct pointer/read
swat1-04:2017630:2017704 [0] NCCL INFO Channel 05/0 : 0[0] -> 5[5] via P2P/direct pointer/read
swat1-04:2017630:2017700 [4] NCCL INFO Channel 06/0 : 4[4] -> 7[7] via P2P/direct pointer/read
swat1-04:2017630:2017704 [0] NCCL INFO Channel 06/0 : 0[0] -> 5[5] via P2P/direct pointer/read
swat1-04:2017630:2017700 [4] NCCL INFO Channel 07/0 : 4[4] -> 7[7] via P2P/direct pointer/read
swat1-04:2017630:2017703 [1] NCCL INFO Channel 00/0 : 1[1] -> 8[0] [send] via NET/IB/1
swat1-04:2017630:2017704 [0] NCCL INFO Channel 07/0 : 0[0] -> 5[5] via P2P/direct pointer/read
swat1-04:2017630:2017703 [1] NCCL INFO Channel 04/0 : 1[1] -> 8[0] [send] via NET/IB/1
swat1-04:2017630:2017701 [3] NCCL INFO Channel 01/0 : 3[3] -> 10[2] [send] via NET/IB/0
swat1-04:2017630:2017698 [6] NCCL INFO Channel 02/0 : 15[7] -> 6[6] [receive] via NET/IB/2
swat1-04:2017630:2017701 [3] NCCL INFO Channel 05/0 : 3[3] -> 10[2] [send] via NET/IB/0
swat1-04:2017630:2017698 [6] NCCL INFO Channel 06/0 : 15[7] -> 6[6] [receive] via NET/IB/2
swat1-04:2017630:2017698 [6] NCCL INFO Channel 00/0 : 6[6] -> 3[3] via P2P/direct pointer/read
swat1-04:2017630:2017698 [6] NCCL INFO Channel 01/0 : 6[6] -> 3[3] via P2P/direct pointer/read
swat1-04:2017630:2017698 [6] NCCL INFO Channel 02/0 : 6[6] -> 3[3] via P2P/direct pointer/read
swat1-04:2017630:2017698 [6] NCCL INFO Channel 03/0 : 6[6] -> 3[3] via P2P/direct pointer/read
swat1-04:2017630:2017697 [7] NCCL INFO Channel 02/0 : 7[7] -> 14[6] [send] via NET/IB/2
swat1-04:2017630:2017704 [0] NCCL INFO Channel 00/0 : 9[1] -> 0[0] [receive] via NET/IB/1
swat1-04:2017630:2017697 [7] NCCL INFO Channel 06/0 : 7[7] -> 14[6] [send] via NET/IB/2
swat1-04:2017630:2017704 [0] NCCL INFO Channel 04/0 : 9[1] -> 0[0] [receive] via NET/IB/1
swat1-04:2017630:2017698 [6] NCCL INFO Channel 04/0 : 6[6] -> 3[3] via P2P/direct pointer/read
swat1-04:2017630:2017697 [7] NCCL INFO Channel 00/0 : 7[7] -> 6[6] via P2P/direct pointer/read
swat1-04:2017630:2017699 [5] NCCL INFO Channel 03/0 : 5[5] -> 12[4] [send] via NET/IB/3
swat1-04:2017630:2017700 [4] NCCL INFO Channel 03/0 : 13[5] -> 4[4] [receive] via NET/IB/3
swat1-04:2017630:2017698 [6] NCCL INFO Channel 05/0 : 6[6] -> 3[3] via P2P/direct pointer/read
swat1-04:2017630:2017699 [5] NCCL INFO Channel 07/0 : 5[5] -> 12[4] [send] via NET/IB/3
swat1-04:2017630:2017700 [4] NCCL INFO Channel 07/0 : 13[5] -> 4[4] [receive] via NET/IB/3
swat1-04:2017630:2017697 [7] NCCL INFO Channel 01/0 : 7[7] -> 6[6] via P2P/direct pointer/read
swat1-04:2017630:2017699 [5] NCCL INFO Channel 00/0 : 5[5] -> 4[4] via P2P/direct pointer/read
swat1-04:2017630:2017698 [6] NCCL INFO Channel 06/0 : 6[6] -> 3[3] via P2P/direct pointer/read
swat1-04:2017630:2017697 [7] NCCL INFO Channel 03/0 : 7[7] -> 6[6] via P2P/direct pointer/read
swat1-04:2017630:2017699 [5] NCCL INFO Channel 01/0 : 5[5] -> 4[4] via P2P/direct pointer/read
swat1-04:2017630:2017698 [6] NCCL INFO Channel 07/0 : 6[6] -> 3[3] via P2P/direct pointer/read
swat1-04:2017630:2017697 [7] NCCL INFO Channel 04/0 : 7[7] -> 6[6] via P2P/direct pointer/read
swat1-04:2017630:2017699 [5] NCCL INFO Channel 02/0 : 5[5] -> 4[4] via P2P/direct pointer/read
swat1-04:2017630:2017701 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/direct pointer/read
swat1-04:2017630:2017702 [2] NCCL INFO Channel 01/0 : 11[3] -> 2[2] [receive] via NET/IB/0
swat1-04:2017630:2017697 [7] NCCL INFO Channel 05/0 : 7[7] -> 6[6] via P2P/direct pointer/read
swat1-04:2017630:2017702 [2] NCCL INFO Channel 05/0 : 11[3] -> 2[2] [receive] via NET/IB/0
swat1-04:2017630:2017699 [5] NCCL INFO Channel 04/0 : 5[5] -> 4[4] via P2P/direct pointer/read
swat1-04:2017630:2017701 [3] NCCL INFO Channel 02/0 : 3[3] -> 2[2] via P2P/direct pointer/read
swat1-04:2017630:2017703 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/direct pointer/read
swat1-04:2017630:2017699 [5] NCCL INFO Channel 05/0 : 5[5] -> 4[4] via P2P/direct pointer/read
swat1-04:2017630:2017701 [3] NCCL INFO Channel 03/0 : 3[3] -> 2[2] via P2P/direct pointer/read
swat1-04:2017630:2017703 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/direct pointer/read
swat1-04:2017630:2017697 [7] NCCL INFO Channel 07/0 : 7[7] -> 6[6] via P2P/direct pointer/read
swat1-04:2017630:2017699 [5] NCCL INFO Channel 06/0 : 5[5] -> 4[4] via P2P/direct pointer/read
swat1-04:2017630:2017703 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/direct pointer/read
swat1-04:2017630:2017701 [3] NCCL INFO Channel 04/0 : 3[3] -> 2[2] via P2P/direct pointer/read
swat1-04:2017630:2017703 [1] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/direct pointer/read
swat1-04:2017630:2017701 [3] NCCL INFO Channel 06/0 : 3[3] -> 2[2] via P2P/direct pointer/read
swat1-04:2017630:2017701 [3] NCCL INFO Channel 07/0 : 3[3] -> 2[2] via P2P/direct pointer/read
swat1-04:2017630:2017703 [1] NCCL INFO Channel 06/0 : 1[1] -> 0[0] via P2P/direct pointer/read
swat1-04:2017630:2017703 [1] NCCL INFO Channel 07/0 : 1[1] -> 0[0] via P2P/direct pointer/read
swat1-04:2017630:2017702 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/direct pointer/read
swat1-04:2017630:2017702 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/direct pointer/read
swat1-04:2017630:2017702 [2] NCCL INFO Channel 02/0 : 2[2] -> 1[1] via P2P/direct pointer/read
swat1-04:2017630:2017702 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/direct pointer/read
swat1-04:2017630:2017702 [2] NCCL INFO Channel 04/0 : 2[2] -> 1[1] via P2P/direct pointer/read
swat1-04:2017630:2017702 [2] NCCL INFO Channel 05/0 : 2[2] -> 1[1] via P2P/direct pointer/read
swat1-04:2017630:2017702 [2] NCCL INFO Channel 06/0 : 2[2] -> 1[1] via P2P/direct pointer/read
swat1-04:2017630:2017702 [2] NCCL INFO Channel 07/0 : 2[2] -> 1[1] via P2P/direct pointer/read
swat1-04:2017630:2017699 [5] NCCL INFO Connected all rings
swat1-04:2017630:2017700 [4] NCCL INFO Connected all rings
swat1-04:2017630:2017704 [0] NCCL INFO Connected all rings
swat1-04:2017630:2017703 [1] NCCL INFO Connected all rings
swat1-04:2017630:2017697 [7] NCCL INFO Connected all rings
swat1-04:2017630:2017698 [6] NCCL INFO Connected all rings
swat1-04:2017630:2017702 [2] NCCL INFO Connected all rings
swat1-04:2017630:2017701 [3] NCCL INFO Connected all rings
    16777216        524288      half    none      -1    565.5   29.67   27.81      0    539.2   31.11   29.17      0
    33554432       1048576      half    none      -1    933.6   35.94   33.70      0    921.9   36.40   34.12      0
    67108864       2097152      half    none      -1   1783.1   37.64   35.28      0   1775.4   37.80   35.44      0
   134217728       4194304      half    none      -1   3600.1   37.28   34.95      0   3583.7   37.45   35.11      0
   268435456       8388608      half    none      -1   7231.3   37.12   34.80      0   7248.6   37.03   34.72      0
   536870912      16777216      half    none      -1    14909   36.01   33.76      0    15052   35.67   33.44      0
  1073741824      33554432      half    none      -1    31408   34.19   32.05      0    31294   34.31   32.17      0
swat1-04:2017630:2017630 [0] NCCL INFO comm 0x8d57d40 rank 0 nranks 16 cudaDev 0 busId 7000 - Destroy COMPLETE
swat1-04:2017630:2017630 [7] NCCL INFO comm 0x8efcbc0 rank 7 nranks 16 cudaDev 7 busId cb000 - Destroy COMPLETE
swat1-04:2017630:2017630 [6] NCCL INFO comm 0x8ec0ac0 rank 6 nranks 16 cudaDev 6 busId c8000 - Destroy COMPLETE
swat1-04:2017630:2017630 [5] NCCL INFO comm 0x8e84880 rank 5 nranks 16 cudaDev 5 busId 8b000 - Destroy COMPLETE
swat1-04:2017630:2017630 [4] NCCL INFO comm 0x8e48640 rank 4 nranks 16 cudaDev 4 busId 88000 - Destroy COMPLETE
swat1-04:2017630:2017630 [3] NCCL INFO comm 0x8e0c400 rank 3 nranks 16 cudaDev 3 busId 4c000 - Destroy COMPLETE
swat1-04:2017630:2017630 [2] NCCL INFO comm 0x8dd01c0 rank 2 nranks 16 cudaDev 2 busId 48000 - Destroy COMPLETE
swat1-04:2017630:2017630 [1] NCCL INFO comm 0x8d93f80 rank 1 nranks 16 cudaDev 1 busId b000 - Destroy COMPLETE
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 33.3227 
#
AddyLaddy commented 3 months ago

I don't see GDRDMA being enabled on those IB NICs. Is that to be expected? Have you loaded the nvidia-peermem module or used a DMA-BUF enabled GPU driver + Kernel?

JuiceLemonLemon commented 3 months ago

I don't see GDRDMA being enabled on those IB NICs. Is that to be expected? Have you loaded the nvidia-peermem module or used a DMA-BUF enabled GPU driver + Kernel?

Sorry, I'm not familiar with GDRDMA, can you tell me how to enable GDRDMA?

AddyLaddy commented 3 months ago

https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#gpu-to-nic-communication

But you also need to make sure ACS is not enabled unless you're using a VM environment.

JuiceLemonLemon commented 3 months ago

https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#gpu-to-nic-communication

But you also need to make sure ACS is not enabled unless you're using a VM environment.

After enable the GDRDMA, the IB bandwidth can reach 93GB/s w/ 1GB communication data. I think it's normal now. Thank you very much for you help.