NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.16k stars 796 forks source link

nccl infiniband performance #489

Open kingder opened 3 years ago

kingder commented 3 years ago

Hi, I have similar problem with #307, two machines in a cluster connected with 200Gb/sec bandwidth infiniband. ibstatus:

Infiniband device 'mlx5_0' port 1 status:
        default gid:     fe80:0000:0000:0000:043f:7203:00f8:03ce
        base lid:        0x81
        sm lid:          0x1
        state:           4: ACTIVE
        phys state:      5: LinkUp
        rate:            200 Gb/sec (4X HDR)
        link_layer:      InfiniBand

Infiniband device 'mlx5_1' port 1 status:
        default gid:     fe80:0000:0000:0000:043f:7203:00fc:8334
        base lid:        0xffff
        sm lid:          0x0
        state:           1: DOWN
        phys state:      3: Disabled
        rate:            10 Gb/sec (4X)
        link_layer:      InfiniBand

ib_send_bw shows:

---------------------------------------------------------------------------------------
                    Send BW Test
 Dual-port       : OFF          Device         : mlx5_0
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 TX depth        : 128
 CQ Moderation   : 100
 Mtu             : 4096[B]
 Link type       : IB
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0x81 QPN 0x4262 PSN 0x50832c
 remote address: LID 0xa1 QPN 0x4a98 PSN 0xc52375
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
 65536      1000             150.85             145.17             0.276896
---------------------------------------------------------------------------------------

nvidia-smi topo -m shows:

        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    mlx5_0  mlx5_1  CPU Affinity    NUMA Affinity
GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV12    PXB     SYS     0-63,128-191    0
GPU1    NV12     X      NV12    NV12    NV12    NV12    NV12    NV12    PXB     SYS     0-63,128-191    0
GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV12    NODE    SYS     0-63,128-191    0
GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV12    NODE    SYS     0-63,128-191    0
GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV12    SYS     NODE    64-127,192-255  1
GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV12    SYS     NODE    64-127,192-255  1
GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV12    SYS     PXB     64-127,192-255  1
GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X      SYS     PXB     64-127,192-255  1
mlx5_0  PXB     PXB     NODE    NODE    SYS     SYS     SYS     SYS      X      SYS
mlx5_1  SYS     SYS     SYS     SYS     NODE    NODE    PXB     PXB     SYS      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

but nccl-tests only achieves about 3GB/s, which is far below the bandwidth,

#
#                                                     out-of-place                       in-place
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
     1048576        262144   float     sum    319.8    3.28    3.28  0e+00    323.3    3.24    3.24  0e+00
     2097152        524288   float     sum    681.3    3.08    3.08  0e+00    674.4    3.11    3.11  0e+00
     4194304       1048576   float     sum   1439.6    2.91    2.91  0e+00   1440.8    2.91    2.91  0e+00
     8388608       2097152   float     sum   2885.7    2.91    2.91  0e+00   2921.1    2.87    2.87  0e+00
    16777216       4194304   float     sum   5586.5    3.00    3.00  0e+00   5534.1    3.03    3.03  0e+00
    33554432       8388608   float     sum    11225    2.99    2.99  0e+00    11496    2.92    2.92  0e+00
    67108864      16777216   float     sum    22871    2.93    2.93  0e+00    24755    2.71    2.71  0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 2.99294

attach the detailed log of command: NCCL_NET_GDR_READ=1 NCCL_DEBUG_SUBSYS=GRAPH NCCL_DEBUG=INFO ./build/all_reduce_perf -g 1 -b 1M -e 64M -f 2 log.txt

Any ideas on what could be going wrong here?

sjeaugey commented 3 years ago

This is weird. Your nvidia-smi topo -m shows mlx5_0 and GPUs 0/1 to be on the same PCI switch, but the NCCL topology shows the GPU-NIC communication needs to go through the CPU, which explains why we get bad performance.

It would be interesting to run again on all 8 GPUs of each node and see if NCCL can use GPU Direct RDMA with GPUs close to the NIC.

Also note NCCL cannot properly detect the PCI Gen4 speeds and reverts to Gen3. This is probably because you are running an old kernel/distro (e.g. Ubuntu 16). It might not change anything in the end for this particular situation, but it could be a problem in some corner cases.

kingder commented 3 years ago

This is weird. Your nvidia-smi topo -m shows mlx5_0 and GPUs 0/1 to be on the same PCI switch, but the NCCL topology shows the GPU-NIC communication needs to go through the CPU, which explains why we get bad performance.

It would be interesting to run again on all 8 GPUs of each node and see if NCCL can use GPU Direct RDMA with GPUs close to the NIC.

Also note NCCL cannot properly detect the PCI Gen4 speeds and reverts to Gen3. This is probably because you are running an old kernel/distro (e.g. Ubuntu 16). It might not change anything in the end for this particular situation, but it could be a problem in some corner cases.

Thanks for the reply.

Yeah, we use CentOS 7.6 with kernel 3.10.0-957.el7.x86_64

After searching the issues and also the nccl troubleshooting, I could see two potential problems:

1, We haven't enabled GPU Direct RDMA.

  1. ACS is enabled, there's “SrcValid+” when grep the output of lspci -vvv

could these two be the main reasons of the bad performance?

Below is the output of running on all 8 GPUs of each node: log.txt

sjeaugey commented 3 years ago

Ah, right, that is likely the reason why NIC-GPU distance is shown as PHB in the NCCL topology: if GPU Direct RDMA is not available, we will have to go through the CPU for NIC-GPU transfers hence we show PHB. I misread the topology, indeed GPU and NIC are connected through a PCI switch (PCI/13000).

Also disabling ACS will probably help performance.

kingder commented 3 years ago

Ah, right, that is likely the reason why NIC-GPU distance is shown as PHB in the NCCL topology: if GPU Direct RDMA is not available, we will have to go through the CPU for NIC-GPU transfers hence we show PHB. I misread the topology, indeed GPU and NIC are connected through a PCI switch (PCI/13000).

Also disabling ACS will probably help performance.

So, here GPU Direct RDMA is the main reason for the poor performance, right?

I'm not very familiar with the topology, based on what I understand, correct me if I'm wrong, here NIC-GPU shows PXB, so enable GDR is required, what if the NIC-GPU shows PIX, will GDR still be a must?

Also, should we also disable IOMMU, because we previously encountered hangs / slow when run p2pBandwidthLatencyTest on a machine with 8 GPUs with no NVLINKs and IOMMU enabled, after disable IOMMU, the test works fine; This time we have NVLINKs and IOMMU / ACS are both enabled, p2pBandwidthLatencyTest / nccl-test works fine on single machine, while slow across nodes.

sjeaugey commented 3 years ago

Yes the lack of GPU Direct RDMA is the reason for the low performance.

PXB or PIX means that GPU and NIC are connected through PCI Switches, in which case GPU Direct RDMA is a must.

Disabling IOMMU/ACS is important in general for PCI communication, so you won't have problems with NVLink but for the networking part because we use PCI it will become important.

kingder commented 3 years ago

Yes the lack of GPU Direct RDMA is the reason for the low performance.

PXB or PIX means that GPU and NIC are connected through PCI Switches, in which case GPU Direct RDMA is a must.

Disabling IOMMU/ACS is important in general for PCI communication, so you won't have problems with NVLink but for the networking part because we use PCI it will become important.

Thanks! After enable GDR, the performance between 2 machines is much better, we can get 12GB/s Avg:

#
#                                                     out-of-place                       in-place
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
     1048576        262144   float     sum    402.9    2.60    4.88  4e-07    401.9    2.61    4.89  4e-07
     2097152        524288   float     sum    544.7    3.85    7.22  4e-07    499.3    4.20    7.88  4e-07
     4194304       1048576   float     sum    797.1    5.26    9.87  4e-07    795.3    5.27    9.89  4e-07
     8388608       2097152   float     sum   1385.5    6.05   11.35  4e-07   1383.2    6.06   11.37  4e-07
    16777216       4194304   float     sum   2241.0    7.49   14.04  4e-07   2228.7    7.53   14.11  4e-07
    33554432       8388608   float     sum   4186.7    8.01   15.03  4e-07   4196.3    8.00   14.99  4e-07
    67108864      16777216   float     sum   7382.4    9.09   17.04  4e-07   7385.8    9.09   17.04  4e-07
   134217728      33554432   float     sum    14310    9.38   17.59  4e-07    13969    9.61   18.02  4e-07
   268435456      67108864   float     sum    27716    9.69   18.16  4e-07    27177    9.88   18.52  4e-07
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 12.8822
#

But when tested on 512 GPUS (64 nodes), the bandwidth dropped to ~7 GB/s Avg:

#                                                     out-of-place                       in-place
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
     1048576        262144   float     sum   1794.9    0.58    1.17  6e-06    784.0    1.34    2.67  6e-06
     2097152        524288   float     sum   1099.9    1.91    3.81  6e-06   1099.3    1.91    3.81  6e-06
     4194304       1048576   float     sum   1653.8    2.54    5.06  6e-06   1671.6    2.51    5.01  6e-06
     8388608       2097152   float     sum   2703.0    3.10    6.19  6e-06   2685.9    3.12    6.23  6e-06
    16777216       4194304   float     sum   4868.5    3.45    6.88  6e-06   4865.6    3.45    6.88  6e-06
    33554432       8388608   float     sum   8930.8    3.76    7.50  6e-06   8860.5    3.79    7.56  6e-06
    67108864      16777216   float     sum    17292    3.88    7.75  6e-06    17657    3.80    7.59  6e-06
   134217728      33554432   float     sum    27960    4.80    9.58  9e-06    27576    4.87    9.72  9e-06
   268435456      67108864   float     sum    42719    6.28   12.54  9e-06    42176    6.36   12.70  9e-06
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 6.81374
#

Does this sounds normal to you?

sjeaugey commented 3 years ago

The average over different sizes does not make a lot of sense, so you should run up to 4G and see what is the peak BW you can achieve. Also on two nodes we have a special case which doesn't reflect the NIC bandwidth.

So to really see what your network is capable of, you should run with NCCL_ALGO=RING and up to 4G, and take the maximum bandwidth.