NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
809 stars 229 forks source link

Origin of Poor Internode NCCL Performance #152

Closed vitduck closed 1 year ago

vitduck commented 1 year ago

Hello,

I appreciate if you could share some insights on the origin of poor internode alltoall performance on our A100 system.

nThread 1 nGpus 1 minBytes 16 maxBytes 4294967296 step: 4(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0

#

Using devices

Rank 0 Group 0 Pid 47689 on gpu32 device 0 [0x07] NVIDIA A100-SXM4-80GB

Rank 1 Group 0 Pid 47690 on gpu32 device 1 [0x0b] NVIDIA A100-SXM4-80GB

Rank 2 Group 0 Pid 47691 on gpu32 device 2 [0x48] NVIDIA A100-SXM4-80GB

Rank 3 Group 0 Pid 47692 on gpu32 device 3 [0x4c] NVIDIA A100-SXM4-80GB

Rank 4 Group 0 Pid 47693 on gpu32 device 4 [0x88] NVIDIA A100-SXM4-80GB

Rank 5 Group 0 Pid 47694 on gpu32 device 5 [0x8b] NVIDIA A100-SXM4-80GB

Rank 6 Group 0 Pid 47695 on gpu32 device 6 [0xc8] NVIDIA A100-SXM4-80GB

Rank 7 Group 0 Pid 47696 on gpu32 device 7 [0xcb] NVIDIA A100-SXM4-80GB

Rank 8 Group 0 Pid 40144 on gpu37 device 0 [0x07] NVIDIA A100-SXM4-80GB

Rank 9 Group 0 Pid 40145 on gpu37 device 1 [0x0b] NVIDIA A100-SXM4-80GB

Rank 10 Group 0 Pid 40146 on gpu37 device 2 [0x48] NVIDIA A100-SXM4-80GB

Rank 11 Group 0 Pid 40147 on gpu37 device 3 [0x4c] NVIDIA A100-SXM4-80GB

Rank 12 Group 0 Pid 40148 on gpu37 device 4 [0x88] NVIDIA A100-SXM4-80GB

Rank 13 Group 0 Pid 40149 on gpu37 device 5 [0x8b] NVIDIA A100-SXM4-80GB

Rank 14 Group 0 Pid 40150 on gpu37 device 6 [0xc8] NVIDIA A100-SXM4-80GB

Rank 15 Group 0 Pid 40151 on gpu37 device 7 [0xcb] NVIDIA A100-SXM4-80GB

NCCL version 2.11.4+cuda11.4 #

out-of-place in-place

size count type redop root time algbw busbw #wrong time algbw busbw #wrong

(B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)

      16             4     float     sum      -1    37.56    0.00    0.00      0    32.75    0.00    0.00      0
      64            16     float     sum      -1    30.83    0.00    0.00      0    31.25    0.00    0.00      0
     256            64     float     sum      -1    33.02    0.01    0.01      0    32.71    0.01    0.01      0
    1024           256     float     sum      -1    37.07    0.03    0.05      0    35.57    0.03    0.05      0
    4096          1024     float     sum      -1    43.97    0.09    0.17      0    42.08    0.10    0.18      0
   16384          4096     float     sum      -1    49.54    0.33    0.62      0    46.26    0.35    0.66      0
   65536         16384     float     sum      -1    80.40    0.82    1.53      0    77.16    0.85    1.59      0
  262144         65536     float     sum      -1    85.27    3.07    5.76      0    84.93    3.09    5.79      0
 1048576        262144     float     sum      -1    147.9    7.09   13.29      0    147.5    7.11   13.33      0
 4194304       1048576     float     sum      -1    372.9   11.25   21.09      0    370.8   11.31   21.21      0
16777216       4194304     float     sum      -1   1291.1   12.99   24.37      0   1285.9   13.05   24.46      0
67108864      16777216     float     sum      -1   3917.9   17.13   32.12      0   3914.7   17.14   32.14      0

268435456 67108864 float sum -1 15497 17.32 32.48 0 15415 17.41 32.65 0 1073741824 268435456 float sum -1 62740 17.11 32.09 0 62765 17.11 32.08 0 4294967296 1073741824 float sum -1 270855 15.86 29.73 0 271182 15.84 29.70 0

Out of bounds values : 0 OK

Avg bus bandwidth : 12.9065

#

- alltoall_perf: 4 GB/s vs. Theoretical 100 GB/s (25GB/s per HDR) 

$ mpirun \ -np 16 \ --hostfile hostfile_16 \ ../build/alltoall_perf \ -b 16 -e 4G -f 4 -g 1

nThread 1 nGpus 1 minBytes 16 maxBytes 4294967296 step: 4(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0

#

Using devices

Rank 0 Group 0 Pid 34180 on gpu32 device 0 [0x07] NVIDIA A100-SXM4-80GB

Rank 1 Group 0 Pid 34181 on gpu32 device 1 [0x0b] NVIDIA A100-SXM4-80GB

Rank 2 Group 0 Pid 34182 on gpu32 device 2 [0x48] NVIDIA A100-SXM4-80GB

Rank 3 Group 0 Pid 34183 on gpu32 device 3 [0x4c] NVIDIA A100-SXM4-80GB

Rank 4 Group 0 Pid 34184 on gpu32 device 4 [0x88] NVIDIA A100-SXM4-80GB

Rank 5 Group 0 Pid 34185 on gpu32 device 5 [0x8b] NVIDIA A100-SXM4-80GB

Rank 6 Group 0 Pid 34186 on gpu32 device 6 [0xc8] NVIDIA A100-SXM4-80GB

Rank 7 Group 0 Pid 34187 on gpu32 device 7 [0xcb] NVIDIA A100-SXM4-80GB

Rank 8 Group 0 Pid 23210 on gpu37 device 0 [0x07] NVIDIA A100-SXM4-80GB

Rank 9 Group 0 Pid 23211 on gpu37 device 1 [0x0b] NVIDIA A100-SXM4-80GB

Rank 10 Group 0 Pid 23212 on gpu37 device 2 [0x48] NVIDIA A100-SXM4-80GB

Rank 11 Group 0 Pid 23213 on gpu37 device 3 [0x4c] NVIDIA A100-SXM4-80GB

Rank 12 Group 0 Pid 23214 on gpu37 device 4 [0x88] NVIDIA A100-SXM4-80GB

Rank 13 Group 0 Pid 23215 on gpu37 device 5 [0x8b] NVIDIA A100-SXM4-80GB

Rank 14 Group 0 Pid 23216 on gpu37 device 6 [0xc8] NVIDIA A100-SXM4-80GB

Rank 15 Group 0 Pid 23217 on gpu37 device 7 [0xcb] NVIDIA A100-SXM4-80GB

NCCL version 2.11.4+cuda11.4 #

out-of-place in-place

size count type redop root time algbw busbw #wrong time algbw busbw #wrong

(B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)

       0             0     float    none      -1    31.45    0.00    0.00      0    30.51    0.00    0.00    N/A
      64             1     float    none      -1    46.04    0.00    0.00      0    45.30    0.00    0.00    N/A
     256             4     float    none      -1    45.12    0.01    0.01      0    44.73    0.01    0.01    N/A
    1024            16     float    none      -1    44.30    0.02    0.02      0    44.32    0.02    0.02    N/A
    4096            64     float    none      -1    44.70    0.09    0.09      0    44.58    0.09    0.09    N/A
   16384           256     float    none      -1    45.27    0.36    0.34      0    46.87    0.35    0.33    N/A
   65536          1024     float    none      -1    49.23    1.33    1.25      0    48.84    1.34    1.26    N/A
  262144          4096     float    none      -1    108.2    2.42    2.27      0    104.4    2.51    2.35    N/A
 1048576         16384     float    none      -1    355.2    2.95    2.77      0    359.9    2.91    2.73    N/A
 4194304         65536     float    none      -1   1286.8    3.26    3.06      0   1291.0    3.25    3.05    N/A
16777216        262144     float    none      -1   4834.9    3.47    3.25      0   4827.9    3.48    3.26    N/A
67108864       1048576     float    none      -1    15845    4.24    3.97      0    15747    4.26    4.00    N/A

268435456 4194304 float none -1 57613 4.66 4.37 0 57768 4.65 4.36 N/A 1073741824 16777216 float none -1 227992 4.71 4.42 0 228250 4.70 4.41 N/A 4294967296 67108864 float none -1 934563 4.60 4.31 0 935303 4.59 4.31 N/A

Out of bounds values : 0 OK

Avg bus bandwidth : 2.00892

#


Thanks for reading. 

Regards. 
sjeaugey commented 1 year ago

My guess would be that the gpu37 node does not have nv_peer_mem loaded.

[Excellent bug report btw; thanks a lot for taking the time to gather the debug log and topology dump -- it really helped!]

vitduck commented 1 year ago

Thanks very much for pointing out the issue.

Indeed nv_peer_mem was not loaded on gpu37, which was a huge oversight on our part. We appreciate if you can give some further clarification regarding the log file.

1. Are the numbers of channel listed below an intrinsic property of NVLink3 ? Can we somehow 'visualize' these channels ?

 gpu32:22612:22693 [0] NCCL INFO 4 coll channels, 4 p2p channels, 1 p2p channels per peer

2. From which part of the log file can we deduce that GPUDirect RDMA is missing ?

gpu32:22612:22693 [0] NCCL INFO Channel 00 : 0[7000] -> 1[b000] via P2P/IPC/read

3. Does UCX affect the result of the test ? My understanding is that UCX should have no effect. There are certain overlap between UCX and NCCL. The former also provides cuda-optimized transports based on CUDA IPC and GPUDirect RDMA. Here, NCCL only requires MPI for basic communication setup, i.e. it is initialized way ahead of UCX components.

4. How can we interpret the discrepancy between AlltoALL and Allreduce ?
With nv_peer_mem correctly loaded, we archive respectable bandwidths for w.r.t to your S51111 presentation at GTC 2023.

#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
         ...
  1073741824      16777216     float    none      -1    51414   20.88   19.58      0    51191   20.98   19.66    N/A
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
         ...
  1073741824     268435456     float     sum      -1    22490   47.74   89.52      0    22465   47.80   89.62      0

For AlltoAll and Allreduce respectively. The Allreduce b/w aligns well with NICs' bandwidth of 100 GB/s. Is there a fundamental reason that AlltoAll cannot fully saturate the NICs ?

5. Are nv_peer_mem.ko and nvidia-peermem.ko used interchangeably ? Here is another point that we confused from the official GPUDirect documentation. The former is provided by Mellanox, and hosted here on github. As we understand it requires MLNX_OFED 5.1 The later is provided by NVIDIA driver since v470. Unlike Mellanox, it seems to support older MLNX_OFED. Please advise which one should we use since the development has stopped at Mellanox's end.

Thanks.

sjeaugey commented 1 year ago
  1. No, channels are just a way for NCCL to split operations onto multiple SMs, each channel potentially following a different route (using a different NIC, using a different path through NVLinks, etc ...). On NVSwitch systems, each SM traffic will be spread onto all NVLinks, so we just need to use enough SMs to reach peak bandwidth.
  2. If for a given node, not a single path says /GDRDMA then it's a clear sign there is no GPU Direct RDMA. That's what made me think GPU Direct RDMA was missing. "Then Shared indicates that a non-RDMA channel was used." No, it just means the connection is using shared buffers (for alltoall, we do not allocate one buffer per GPU pair but use a shared pool instead).
  3. Indeed NCCL and UCX are independent.
  4. Assuming the ~20GB/s is alltoall and ~89GB/s is allreduce, this is decent performance, if run on 2 nodes. Alltoall cannot aggregate the bandwidth of all NICs; each GPU is sending data independently from others, so there is no data merging we can do. Therefore the alltoall bandwidth is usually bottlenecked by the NIC bandwidth per GPU. In your case you have 24 GB/s for 2 GPUs, hence 12 GB/s per GPU. Given on 2 nodes, 50% of the data goes through NVLink and 50% goes through the network, then the perceived performance of alltoall will be at most 24GB/s. As you scale to larger number of nodes, the bandwidth should go down to 12 GB/s. I'd run allreduce from 8B to 16G to really see what the peak bandwidth is; it should be able to reach 96GB/s on an IB system. 89 is a bit sub-par but it could be due to the 1G size, or to the fact you only run one size and the clocks don't have time to ramp-up.
  5. I think you're right, nv_peer_mem was the former module which we used for a very long time but which should be replaced by nvidia-peermem. I'm not authoritative on this point however; that question should be asked to the network support team.
vitduck commented 1 year ago

Thanks for clarification regarding (1) and (3).

If for a given node, not a single path says /GDRDMA then it's a clear sign there is no GPU Direct RDMA. That's what made me think GPU Direct RDMA was missing. "Then Shared indicates that a non-RDMA channel was used." No, it just means the connection is using shared buffers (for alltoall, we do not allocate one buffer per GPU pair but use a shared pool instead).

I understand now since GDRDMA was completely missing on gpu37. Indeed when comparing the debug outputs, Shared only presents in alltoall, and not in all_reduce, as you pointed out.

I'd run allreduce from 8B to 16G to really see what the peak bandwidth is; it should be able to reach 96GB/s on an IB system. 89 is a bit sub-par but it could be due to the 1G size, or to the fact you only run one size and the clocks don't have time to ramp-up.

Sorry for causing confusion. I scanned from 16B up to 1G, but truncated to show only bandwidth at 1 G. I've rerun as suggested.

#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
          16             4     float     sum      -1    29.55    0.00    0.00      0    28.19    0.00    0.00      0
          32             8     float     sum      -1    28.70    0.00    0.00      0    26.80    0.00    0.00      0
          64            16     float     sum      -1    27.46    0.00    0.00      0    27.87    0.00    0.00      0
         128            32     float     sum      -1    28.21    0.00    0.01      0    27.45    0.00    0.01      0
         256            64     float     sum      -1    28.44    0.01    0.02      0    28.08    0.01    0.02      0
         512           128     float     sum      -1    30.89    0.02    0.03      0    31.08    0.02    0.03      0
        1024           256     float     sum      -1    31.40    0.03    0.06      0    31.53    0.03    0.06      0
        2048           512     float     sum      -1    33.84    0.06    0.11      0    33.14    0.06    0.12      0
        4096          1024     float     sum      -1    35.96    0.11    0.21      0    33.88    0.12    0.23      0
        8192          2048     float     sum      -1    38.91    0.21    0.39      0    36.70    0.22    0.42      0
       16384          4096     float     sum      -1    42.69    0.38    0.72      0    40.15    0.41    0.77      0
       32768          8192     float     sum      -1    45.31    0.72    1.36      0    45.38    0.72    1.35      0
       65536         16384     float     sum      -1    50.93    1.29    2.41      0    49.81    1.32    2.47      0
      131072         32768     float     sum      -1    71.16    1.84    3.45      0    67.73    1.94    3.63      0
      262144         65536     float     sum      -1    66.46    3.94    7.40      0    64.95    4.04    7.57      0
      524288        131072     float     sum      -1    75.59    6.94   13.01      0    75.77    6.92   12.97      0
     1048576        262144     float     sum      -1    96.34   10.88   20.41      0    96.04   10.92   20.47      0
     2097152        524288     float     sum      -1    134.4   15.60   29.25      0    120.0   17.48   32.77      0
     4194304       1048576     float     sum      -1    161.2   26.03   48.80      0    161.3   26.01   48.77      0
     8388608       2097152     float     sum      -1    232.0   36.16   67.80      0    232.5   36.08   67.65      0
    16777216       4194304     float     sum      -1    408.2   41.10   77.05      0    393.8   42.60   79.87      0
    33554432       8388608     float     sum      -1    770.8   43.53   81.62      0    794.7   42.22   79.17      0
    67108864      16777216     float     sum      -1   1380.8   48.60   91.13      0   1393.1   48.17   90.33      0
   134217728      33554432     float     sum      -1   2759.5   48.64   91.20      0   2697.1   49.76   93.31      0
   268435456      67108864     float     sum      -1   5679.0   47.27   88.63      0   5677.3   47.28   88.65      0
   536870912     134217728     float     sum      -1    11296   47.53   89.11      0    11277   47.61   89.26      0
  1073741824     268435456     float     sum      -1    22545   47.63   89.30      0    22555   47.61   89.26      0
  2147483648     536870912     float     sum      -1    44954   47.77   89.57      0    44983   47.74   89.51      0
  4294967296    1073741824     float     sum      -1    89861   47.80   89.62      0    89837   47.81   89.64      0
  8589934592    2147483648     float     sum      -1   179715   47.80   89.62      0   179663   47.81   89.65      0
 17179869184    4294967296     float     sum      -1   359659   47.77   89.56      0   359633   47.77   89.57      0

The peak bandwidth is indeed 91.2 GB/s at 128 MB with v1.1 of GPUDirect kernel module. To reach 96 GB/s, we will try again with latest version (v1.3).

I think you're right, nv_peer_mem was the former module which we used for a very long time but which should be replaced by nvidia-peermem. I'm not authoritative on this point however; that question should be asked to the network support team.

We will contact NVIDIA forum for further clarification.

sjeaugey commented 1 year ago

Updating GPU Direct RDMA won't help performance. It either works or it doesn't. I'd first suggest making sure the firmware of the NIC is up-to-date. What is the NIC type? Is it CX-5 or CX-6? Depending on the NIC model and firwmare level, you may need to increase MAX_ACC_OUT_READ to 44. You can check this page for how to set that, or simply reach out to network support.

vitduck commented 1 year ago

Sorry for belated follow up.

We are using CX-6

lspci  | grep Connect 
0e:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]

The MAX_ACC_OUT_READ had been raised to 44 per your suggestion for all PCIe adapters. The performance now is ~ 93.25 GB/s, which I believe to be within the margin of error.

#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
    ...
    33554432       8388608     float     sum      -1    765.9   43.81   82.15      0    753.2   44.55   83.53      0
    67108864      16777216     float     sum      -1   1367.2   49.09   92.04      0   1362.3   49.26   92.36      0
   134217728      33554432     float     sum      -1   2698.7   49.73   93.25      0   2695.6   49.79   93.36      0
   268435456      67108864     float     sum      -1   5665.7   47.38   88.84      0   5663.7   47.40   88.87      0
   536870912     134217728     float     sum      -1    11274   47.62   89.29      0    11271   47.63   89.31      0
  1073741824     268435456     float     sum      -1    22490   47.74   89.52      0    22489   47.75   89.52      0
  2147483648     536870912     float     sum      -1    44914   47.81   89.65      0    44905   47.82   89.67      0
  4294967296    1073741824     float     sum      -1    89759   47.85   89.72      0    89764   47.85   89.71      0
  8589934592    2147483648     float     sum      -1   179483   47.86   89.74      0   179523   47.85   89.72      0
 17179869184    4294967296     float     sum      -1   359425   47.80   89.62      0   359124   47.84   89.70      0

We are using 2.11.4 though. According to release note 2.12 brings addition performance to AlltoAll and Allreduce. https://developer.nvidia.com/blog/doubling-all2all-performance-with-nvidia-collective-communication-library-2-12/ Since we are using 2:1 GPU:NIC topology, this does not apply. Is this correct understanding ?

vitduck commented 1 year ago

To add to the previous message, we have tested again with NCCL 2.13.4 from NVHPC 22.7

The bandwidth for AlltoAll and Allreduce are 21 GB/s and 93 GB/s, respectively.

So we do not see a big difference from PXN feature.

sjeaugey commented 1 year ago

PXN will not improve peak bandwidth if peak bandwidth is already perfect. It will only improve the time for small operations (base latency) thanks to message aggregation.

It looks like all looks good now. Feel free to close the bug if so. Thanks!

vitduck commented 1 year ago

Thanks for explanation.

Before closing the issue, there is one more issue that I hope you can clarify. I am trying to derive the formal relationship between algbw and busbw for Alltoall case.

Per NCCL documentation, the AlltoAll is implemented based on send/recv

cclGroupStart();
for (int r=0; r<nranks; r++) {
  ncclSend(sendbuff[r], sendcount, sendtype, r, comm, stream);
  ncclRecv(recvbuff[r], recvcount, recvtype, r, comm, stream);
}
ncclGroupEnd();

From C. Wooley's presentation, NCCL: Accelerated Multi-GPU Collective Communication

untitled

However, from you previous explanation:

Given on 2 nodes, 50% of the data goes through NVLink and 50% goes through the network, then the perceived performance of alltoall will be at most 24GB/s. As you scale to larger number of nodes, the bandwidth should go down to 12 GB/s.

The correct coefficient should be $k/(k-1)$ instead.

sjeaugey commented 1 year ago

The coefficient is k/(k-1) as NCCL perf tests do not assume any specific topology; instead they consider flat topologies.

That works great on a single node with e.g. nvswitch where all GPUs access all others through NVSwitch. But when running on e.g. 2 nodes, the reality is that 7/16 go through NVLink and are not the bottleneck, and 8/16 go through the network and are the bottleneck. So in that case, I'd look at the algbw and divide it by 2 to get the real bw on the network.

vitduck commented 1 year ago

Thanks. I understand it clearly now.

The $\frac{n}{n-1}$ ecoefficiency can be derived from your phenomenological explanation regarding the ratio of IB/NVLINK interconnect.

For reference, I amend the derivation from my previous message.

Unless you have further comments, I would like to close this issue later. Thanks very much for your time. I've learnt a lot through this valuable discussion.