vitduck commented 1 year ago

Hello,

I appreciate if you could share some insights on the origin of poor internode alltoall performance on our A100 system.

System spec: OEM: XL675d CPU: 2 x AMD EPYC 7543 GPU: 8 x A100-SMX4 (Driver: 510.47.03) Interconnect: 4 x HDR 200Gbps OS: CentOS Linux release 7.9.2009 (Core) Env: gcc/10.2.0, cuda/11.4, cudampi/openmpi-4.1.1 (with UCX 11.2)

GPUDirect RDMA: enabled

$ lsmod | grep nv_peer_mem 
nvidia              39021421  283 nv_peer_mem,nvidia_modeset,nvidia_uvm

PCI Access Control Services (ACS): disabled

# lspci -vvv | grep ACSCtl 
ACSCtl: SrcValid- ...

A100 topology

GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    mlx5_0  mlx5_1  mlx5_2  mlx5_3  mlx5_4  mlx5_5  mlx5_6  mlx5_7  mlx5_8  mlx5_9  CPU Affinity    NUMA Affinity
GPU0     X  NV12    NV12    NV12    NV12    NV12    NV12    NV12    PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS 24-31   3
GPU1    NV12     X  NV12    NV12    NV12    NV12    NV12    NV12    PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS 24-31   3
GPU2    NV12    NV12     X  NV12    NV12    NV12    NV12    NV12    SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS 8-15    1
GPU3    NV12    NV12    NV12     X  NV12    NV12    NV12    NV12    SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS 8-15    1
GPU4    NV12    NV12    NV12    NV12     X  NV12    NV12    NV12    SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS 56-63   7
GPU5    NV12    NV12    NV12    NV12    NV12     X  NV12    NV12    SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS 56-63   7
GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X  NV12    SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS 40-47   5
GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X  SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS 40-47   5
mlx5_0  PXB PXB SYS SYS SYS SYS SYS SYS  X  PIX SYS SYS SYS SYS SYS SYS SYS SYS     
mlx5_1  PXB PXB SYS SYS SYS SYS SYS SYS PIX  X  SYS SYS SYS SYS SYS SYS SYS SYS     
mlx5_2  SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS  X  PIX SYS SYS SYS SYS SYS SYS     
mlx5_3  SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS PIX  X  SYS SYS SYS SYS SYS SYS     
mlx5_4  SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS  X  PIX SYS SYS SYS SYS     
mlx5_5  SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS PIX  X  SYS SYS SYS SYS     
mlx5_6  SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS  X  PIX SYS SYS     
mlx5_7  SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS PIX  X  SYS SYS     
mlx5_8  SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS  X  PIX     
mlx5_9  SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS PIX  X

NCCL debug and topology dump debug.txt topology.txt

The tests were built with MPI=1

$ git clone https://github.com/NVIDIA/nccl-tests 
$ make MPI=1 
$ ldd build/all_reduce_perf 
linux-vdso.so.1 =>  (0x00007fff66390000)
libcudart.so.11.0 => /apps/cuda/11.4/lib64/libcudart.so.11.0 (0x00002addb238a000)
librt.so.1 => /usr/lib64/librt.so.1 (0x00002addb262c000)
libmpi.so.40 => /apps/compiler/gcc/10.2.0/cudampi/11.4/openmpi/4.1.1/lib/libmpi.so.40 (0x00002addb2834000)
libnccl.so.2 => /apps/cuda/11.4/lib64/libnccl.so.2 (0x00002addb2d3e000)
...

all_reduce perf: 30 GB/s vs. Theoretical 100 GB/s (25GB/s per HDR)


$  mpirun   \  
  -np 16  \
  --hostfile hostfile_16 \
  ../build/all_reduce_perf  \
  -b 16 -e 4G -f 4 -g 1

nThread 1 nGpus 1 minBytes 16 maxBytes 4294967296 step: 4(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0

#

Using devices

Rank 0 Group 0 Pid 47689 on gpu32 device 0 [0x07] NVIDIA A100-SXM4-80GB

Rank 1 Group 0 Pid 47690 on gpu32 device 1 [0x0b] NVIDIA A100-SXM4-80GB

Rank 2 Group 0 Pid 47691 on gpu32 device 2 [0x48] NVIDIA A100-SXM4-80GB

Rank 3 Group 0 Pid 47692 on gpu32 device 3 [0x4c] NVIDIA A100-SXM4-80GB

Rank 4 Group 0 Pid 47693 on gpu32 device 4 [0x88] NVIDIA A100-SXM4-80GB

Rank 5 Group 0 Pid 47694 on gpu32 device 5 [0x8b] NVIDIA A100-SXM4-80GB

Rank 6 Group 0 Pid 47695 on gpu32 device 6 [0xc8] NVIDIA A100-SXM4-80GB

Rank 7 Group 0 Pid 47696 on gpu32 device 7 [0xcb] NVIDIA A100-SXM4-80GB

Rank 8 Group 0 Pid 40144 on gpu37 device 0 [0x07] NVIDIA A100-SXM4-80GB

Rank 9 Group 0 Pid 40145 on gpu37 device 1 [0x0b] NVIDIA A100-SXM4-80GB

Rank 10 Group 0 Pid 40146 on gpu37 device 2 [0x48] NVIDIA A100-SXM4-80GB

Rank 11 Group 0 Pid 40147 on gpu37 device 3 [0x4c] NVIDIA A100-SXM4-80GB

Rank 12 Group 0 Pid 40148 on gpu37 device 4 [0x88] NVIDIA A100-SXM4-80GB

Rank 13 Group 0 Pid 40149 on gpu37 device 5 [0x8b] NVIDIA A100-SXM4-80GB

Rank 14 Group 0 Pid 40150 on gpu37 device 6 [0xc8] NVIDIA A100-SXM4-80GB

Rank 15 Group 0 Pid 40151 on gpu37 device 7 [0xcb] NVIDIA A100-SXM4-80GB

NCCL version 2.11.4+cuda11.4 #

out-of-place in-place

size count type redop root time algbw busbw #wrong time algbw busbw #wrong

(B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)

      16             4     float     sum      -1    37.56    0.00    0.00      0    32.75    0.00    0.00      0
      64            16     float     sum      -1    30.83    0.00    0.00      0    31.25    0.00    0.00      0
     256            64     float     sum      -1    33.02    0.01    0.01      0    32.71    0.01    0.01      0
    1024           256     float     sum      -1    37.07    0.03    0.05      0    35.57    0.03    0.05      0
    4096          1024     float     sum      -1    43.97    0.09    0.17      0    42.08    0.10    0.18      0
   16384          4096     float     sum      -1    49.54    0.33    0.62      0    46.26    0.35    0.66      0
   65536         16384     float     sum      -1    80.40    0.82    1.53      0    77.16    0.85    1.59      0
  262144         65536     float     sum      -1    85.27    3.07    5.76      0    84.93    3.09    5.79      0
 1048576        262144     float     sum      -1    147.9    7.09   13.29      0    147.5    7.11   13.33      0
 4194304       1048576     float     sum      -1    372.9   11.25   21.09      0    370.8   11.31   21.21      0
16777216       4194304     float     sum      -1   1291.1   12.99   24.37      0   1285.9   13.05   24.46      0
67108864      16777216     float     sum      -1   3917.9   17.13   32.12      0   3914.7   17.14   32.14      0

268435456 67108864 float sum -1 15497 17.32 32.48 0 15415 17.41 32.65 0 1073741824 268435456 float sum -1 62740 17.11 32.09 0 62765 17.11 32.08 0 4294967296 1073741824 float sum -1 270855 15.86 29.73 0 271182 15.84 29.70 0

Out of bounds values : 0 OK

Avg bus bandwidth : 12.9065

#

- alltoall_perf: 4 GB/s vs. Theoretical 100 GB/s (25GB/s per HDR)

$ mpirun \ -np 16 \ --hostfile hostfile_16 \ ../build/alltoall_perf \ -b 16 -e 4G -f 4 -g 1

nThread 1 nGpus 1 minBytes 16 maxBytes 4294967296 step: 4(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0

#

Using devices

Rank 0 Group 0 Pid 34180 on gpu32 device 0 [0x07] NVIDIA A100-SXM4-80GB

Rank 1 Group 0 Pid 34181 on gpu32 device 1 [0x0b] NVIDIA A100-SXM4-80GB

Rank 2 Group 0 Pid 34182 on gpu32 device 2 [0x48] NVIDIA A100-SXM4-80GB

Rank 3 Group 0 Pid 34183 on gpu32 device 3 [0x4c] NVIDIA A100-SXM4-80GB

Rank 4 Group 0 Pid 34184 on gpu32 device 4 [0x88] NVIDIA A100-SXM4-80GB

Rank 5 Group 0 Pid 34185 on gpu32 device 5 [0x8b] NVIDIA A100-SXM4-80GB

Rank 6 Group 0 Pid 34186 on gpu32 device 6 [0xc8] NVIDIA A100-SXM4-80GB

Rank 7 Group 0 Pid 34187 on gpu32 device 7 [0xcb] NVIDIA A100-SXM4-80GB

Rank 8 Group 0 Pid 23210 on gpu37 device 0 [0x07] NVIDIA A100-SXM4-80GB

Rank 9 Group 0 Pid 23211 on gpu37 device 1 [0x0b] NVIDIA A100-SXM4-80GB

Rank 10 Group 0 Pid 23212 on gpu37 device 2 [0x48] NVIDIA A100-SXM4-80GB

Rank 11 Group 0 Pid 23213 on gpu37 device 3 [0x4c] NVIDIA A100-SXM4-80GB

Rank 12 Group 0 Pid 23214 on gpu37 device 4 [0x88] NVIDIA A100-SXM4-80GB

Rank 13 Group 0 Pid 23215 on gpu37 device 5 [0x8b] NVIDIA A100-SXM4-80GB

Rank 14 Group 0 Pid 23216 on gpu37 device 6 [0xc8] NVIDIA A100-SXM4-80GB

Rank 15 Group 0 Pid 23217 on gpu37 device 7 [0xcb] NVIDIA A100-SXM4-80GB

NCCL version 2.11.4+cuda11.4 #

out-of-place in-place

size count type redop root time algbw busbw #wrong time algbw busbw #wrong

(B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)

       0             0     float    none      -1    31.45    0.00    0.00      0    30.51    0.00    0.00    N/A
      64             1     float    none      -1    46.04    0.00    0.00      0    45.30    0.00    0.00    N/A
     256             4     float    none      -1    45.12    0.01    0.01      0    44.73    0.01    0.01    N/A
    1024            16     float    none      -1    44.30    0.02    0.02      0    44.32    0.02    0.02    N/A
    4096            64     float    none      -1    44.70    0.09    0.09      0    44.58    0.09    0.09    N/A
   16384           256     float    none      -1    45.27    0.36    0.34      0    46.87    0.35    0.33    N/A
   65536          1024     float    none      -1    49.23    1.33    1.25      0    48.84    1.34    1.26    N/A
  262144          4096     float    none      -1    108.2    2.42    2.27      0    104.4    2.51    2.35    N/A
 1048576         16384     float    none      -1    355.2    2.95    2.77      0    359.9    2.91    2.73    N/A
 4194304         65536     float    none      -1   1286.8    3.26    3.06      0   1291.0    3.25    3.05    N/A
16777216        262144     float    none      -1   4834.9    3.47    3.25      0   4827.9    3.48    3.26    N/A
67108864       1048576     float    none      -1    15845    4.24    3.97      0    15747    4.26    4.00    N/A

268435456 4194304 float none -1 57613 4.66 4.37 0 57768 4.65 4.36 N/A 1073741824 16777216 float none -1 227992 4.71 4.42 0 228250 4.70 4.41 N/A 4294967296 67108864 float none -1 934563 4.60 4.31 0 935303 4.59 4.31 N/A

Out of bounds values : 0 OK

Avg bus bandwidth : 2.00892

#


Thanks for reading. 

Regards.

sjeaugey commented 1 year ago

My guess would be that the gpu37 node does not have nv_peer_mem loaded.

[Excellent bug report btw; thanks a lot for taking the time to gather the debug log and topology dump -- it really helped!]

vitduck commented 1 year ago

Thanks very much for pointing out the issue.

Indeed nv_peer_mem was not loaded on gpu37, which was a huge oversight on our part. We appreciate if you can give some further clarification regarding the log file.

1. Are the numbers of channel listed below an intrinsic property of NVLink3 ? Can we somehow 'visualize' these channels ?

 gpu32:22612:22693 [0] NCCL INFO 4 coll channels, 4 p2p channels, 1 p2p channels per peer

2. From which part of the log file can we deduce that GPUDirect RDMA is missing ?

GPUDirect P2P between GPU0 and GPU1 via IPC

gpu32:22612:22693 [0] NCCL INFO Channel 00 : 0[7000] -> 1[b000] via P2P/IPC/read

GPUDirect RDMA send/recv between GPU0 and GPU8

gpu32:22612:22693 [0] NCCL INFO Channel 00 : 0[7000] -> 8[7000] [send] via NET/IB/1/GDRDMA
gpu37:52125:52256 [0] NCCL INFO Channel 00 : 0[7000] -> 8[7000] [receive] via NET/IB/0

GPUDirect RDMA send/recv between GPU0 and GPU12
```
gpu32:22612:22761 [0] NCCL INFO Channel 00 : 0[7000] -> 12[88000] [send] via NET/IB/1/GDRDMA/Shared
gpu37:52129:52281 [4] NCCL INFO Channel 00 : 0[7000] -> 12[88000] [receive] via NET/IB/5/Shared
```
First, I assume that the index after IB/ corresponds to NIC index. Then Shared indicates that a non-RDMA channel was used. In that case, the communication between GPU0 and GPU8 must be through Shared protocol as well. But it went through a GPUDirect RDMA channel.

3. Does UCX affect the result of the test ? My understanding is that UCX should have no effect. There are certain overlap between UCX and NCCL. The former also provides cuda-optimized transports based on CUDA IPC and GPUDirect RDMA. Here, NCCL only requires MPI for basic communication setup, i.e. it is initialized way ahead of UCX components.

4. How can we interpret the discrepancy between AlltoALL and Allreduce ?
With nv_peer_mem correctly loaded, we archive respectable bandwidths for w.r.t to your S51111 presentation at GTC 2023.

#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
         ...
  1073741824      16777216     float    none      -1    51414   20.88   19.58      0    51191   20.98   19.66    N/A

#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
         ...
  1073741824     268435456     float     sum      -1    22490   47.74   89.52      0    22465   47.80   89.62      0

For AlltoAll and Allreduce respectively. The Allreduce b/w aligns well with NICs' bandwidth of 100 GB/s. Is there a fundamental reason that AlltoAll cannot fully saturate the NICs ?

5. Are nv_peer_mem.ko and nvidia-peermem.ko used interchangeably ? Here is another point that we confused from the official GPUDirect documentation. The former is provided by Mellanox, and hosted here on github. As we understand it requires MLNX_OFED 5.1 The later is provided by NVIDIA driver since v470. Unlike Mellanox, it seems to support older MLNX_OFED. Please advise which one should we use since the development has stopped at Mellanox's end.

Thanks.

sjeaugey commented 1 year ago

No, channels are just a way for NCCL to split operations onto multiple SMs, each channel potentially following a different route (using a different NIC, using a different path through NVLinks, etc ...). On NVSwitch systems, each SM traffic will be spread onto all NVLinks, so we just need to use enough SMs to reach peak bandwidth.
If for a given node, not a single path says /GDRDMA then it's a clear sign there is no GPU Direct RDMA. That's what made me think GPU Direct RDMA was missing. "Then Shared indicates that a non-RDMA channel was used." No, it just means the connection is using shared buffers (for alltoall, we do not allocate one buffer per GPU pair but use a shared pool instead).
Indeed NCCL and UCX are independent.
Assuming the ~20GB/s is alltoall and ~89GB/s is allreduce, this is decent performance, if run on 2 nodes. Alltoall cannot aggregate the bandwidth of all NICs; each GPU is sending data independently from others, so there is no data merging we can do. Therefore the alltoall bandwidth is usually bottlenecked by the NIC bandwidth per GPU. In your case you have 24 GB/s for 2 GPUs, hence 12 GB/s per GPU. Given on 2 nodes, 50% of the data goes through NVLink and 50% goes through the network, then the perceived performance of alltoall will be at most 24GB/s. As you scale to larger number of nodes, the bandwidth should go down to 12 GB/s. I'd run allreduce from 8B to 16G to really see what the peak bandwidth is; it should be able to reach 96GB/s on an IB system. 89 is a bit sub-par but it could be due to the 1G size, or to the fact you only run one size and the clocks don't have time to ramp-up.
I think you're right, nv_peer_mem was the former module which we used for a very long time but which should be replaced by nvidia-peermem. I'm not authoritative on this point however; that question should be asked to the network support team.

vitduck commented 1 year ago

Thanks for clarification regarding (1) and (3).

If for a given node, not a single path says /GDRDMA then it's a clear sign there is no GPU Direct RDMA. That's what made me think GPU Direct RDMA was missing. "Then Shared indicates that a non-RDMA channel was used." No, it just means the connection is using shared buffers (for alltoall, we do not allocate one buffer per GPU pair but use a shared pool instead).

I understand now since GDRDMA was completely missing on gpu37. Indeed when comparing the debug outputs, Shared only presents in alltoall, and not in all_reduce, as you pointed out.

I'd run allreduce from 8B to 16G to really see what the peak bandwidth is; it should be able to reach 96GB/s on an IB system. 89 is a bit sub-par but it could be due to the 1G size, or to the fact you only run one size and the clocks don't have time to ramp-up.

Sorry for causing confusion. I scanned from 16B up to 1G, but truncated to show only bandwidth at 1 G. I've rerun as suggested.

#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
          16             4     float     sum      -1    29.55    0.00    0.00      0    28.19    0.00    0.00      0
          32             8     float     sum      -1    28.70    0.00    0.00      0    26.80    0.00    0.00      0
          64            16     float     sum      -1    27.46    0.00    0.00      0    27.87    0.00    0.00      0
         128            32     float     sum      -1    28.21    0.00    0.01      0    27.45    0.00    0.01      0
         256            64     float     sum      -1    28.44    0.01    0.02      0    28.08    0.01    0.02      0
         512           128     float     sum      -1    30.89    0.02    0.03      0    31.08    0.02    0.03      0
        1024           256     float     sum      -1    31.40    0.03    0.06      0    31.53    0.03    0.06      0
        2048           512     float     sum      -1    33.84    0.06    0.11      0    33.14    0.06    0.12      0
        4096          1024     float     sum      -1    35.96    0.11    0.21      0    33.88    0.12    0.23      0
        8192          2048     float     sum      -1    38.91    0.21    0.39      0    36.70    0.22    0.42      0
       16384          4096     float     sum      -1    42.69    0.38    0.72      0    40.15    0.41    0.77      0
       32768          8192     float     sum      -1    45.31    0.72    1.36      0    45.38    0.72    1.35      0
       65536         16384     float     sum      -1    50.93    1.29    2.41      0    49.81    1.32    2.47      0
      131072         32768     float     sum      -1    71.16    1.84    3.45      0    67.73    1.94    3.63      0
      262144         65536     float     sum      -1    66.46    3.94    7.40      0    64.95    4.04    7.57      0
      524288        131072     float     sum      -1    75.59    6.94   13.01      0    75.77    6.92   12.97      0
     1048576        262144     float     sum      -1    96.34   10.88   20.41      0    96.04   10.92   20.47      0
     2097152        524288     float     sum      -1    134.4   15.60   29.25      0    120.0   17.48   32.77      0
     4194304       1048576     float     sum      -1    161.2   26.03   48.80      0    161.3   26.01   48.77      0
     8388608       2097152     float     sum      -1    232.0   36.16   67.80      0    232.5   36.08   67.65      0
    16777216       4194304     float     sum      -1    408.2   41.10   77.05      0    393.8   42.60   79.87      0
    33554432       8388608     float     sum      -1    770.8   43.53   81.62      0    794.7   42.22   79.17      0
    67108864      16777216     float     sum      -1   1380.8   48.60   91.13      0   1393.1   48.17   90.33      0
   134217728      33554432     float     sum      -1   2759.5   48.64   91.20      0   2697.1   49.76   93.31      0
   268435456      67108864     float     sum      -1   5679.0   47.27   88.63      0   5677.3   47.28   88.65      0
   536870912     134217728     float     sum      -1    11296   47.53   89.11      0    11277   47.61   89.26      0
  1073741824     268435456     float     sum      -1    22545   47.63   89.30      0    22555   47.61   89.26      0
  2147483648     536870912     float     sum      -1    44954   47.77   89.57      0    44983   47.74   89.51      0
  4294967296    1073741824     float     sum      -1    89861   47.80   89.62      0    89837   47.81   89.64      0
  8589934592    2147483648     float     sum      -1   179715   47.80   89.62      0   179663   47.81   89.65      0
 17179869184    4294967296     float     sum      -1   359659   47.77   89.56      0   359633   47.77   89.57      0

The peak bandwidth is indeed 91.2 GB/s at 128 MB with v1.1 of GPUDirect kernel module. To reach 96 GB/s, we will try again with latest version (v1.3).

I think you're right, nv_peer_mem was the former module which we used for a very long time but which should be replaced by nvidia-peermem. I'm not authoritative on this point however; that question should be asked to the network support team.

We will contact NVIDIA forum for further clarification.

sjeaugey commented 1 year ago

Updating GPU Direct RDMA won't help performance. It either works or it doesn't. I'd first suggest making sure the firmware of the NIC is up-to-date. What is the NIC type? Is it CX-5 or CX-6? Depending on the NIC model and firwmare level, you may need to increase MAX_ACC_OUT_READ to 44. You can check this page for how to set that, or simply reach out to network support.

vitduck commented 1 year ago

Sorry for belated follow up.

We are using CX-6

lspci  | grep Connect 
0e:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]

The MAX_ACC_OUT_READ had been raised to 44 per your suggestion for all PCIe adapters. The performance now is ~ 93.25 GB/s, which I believe to be within the margin of error.

#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
    ...
    33554432       8388608     float     sum      -1    765.9   43.81   82.15      0    753.2   44.55   83.53      0
    67108864      16777216     float     sum      -1   1367.2   49.09   92.04      0   1362.3   49.26   92.36      0
   134217728      33554432     float     sum      -1   2698.7   49.73   93.25      0   2695.6   49.79   93.36      0
   268435456      67108864     float     sum      -1   5665.7   47.38   88.84      0   5663.7   47.40   88.87      0
   536870912     134217728     float     sum      -1    11274   47.62   89.29      0    11271   47.63   89.31      0
  1073741824     268435456     float     sum      -1    22490   47.74   89.52      0    22489   47.75   89.52      0
  2147483648     536870912     float     sum      -1    44914   47.81   89.65      0    44905   47.82   89.67      0
  4294967296    1073741824     float     sum      -1    89759   47.85   89.72      0    89764   47.85   89.71      0
  8589934592    2147483648     float     sum      -1   179483   47.86   89.74      0   179523   47.85   89.72      0
 17179869184    4294967296     float     sum      -1   359425   47.80   89.62      0   359124   47.84   89.70      0

We are using 2.11.4 though. According to release note 2.12 brings addition performance to AlltoAll and Allreduce. https://developer.nvidia.com/blog/doubling-all2all-performance-with-nvidia-collective-communication-library-2-12/ Since we are using 2:1 GPU:NIC topology, this does not apply. Is this correct understanding ?

vitduck commented 1 year ago

To add to the previous message, we have tested again with NCCL 2.13.4 from NVHPC 22.7

The bandwidth for AlltoAll and Allreduce are 21 GB/s and 93 GB/s, respectively.

So we do not see a big difference from PXN feature.

sjeaugey commented 1 year ago

PXN will not improve peak bandwidth if peak bandwidth is already perfect. It will only improve the time for small operations (base latency) thanks to message aggregation.

It looks like all looks good now. Feel free to close the bug if so. Thanks!

vitduck commented 1 year ago

Thanks for explanation.

Before closing the issue, there is one more issue that I hope you can clarify. I am trying to derive the formal relationship between algbw and busbw for Alltoall case.

Per NCCL documentation, the AlltoAll is implemented based on send/recv

cclGroupStart();
for (int r=0; r<nranks; r++) {
  ncclSend(sendbuff[r], sendcount, sendtype, r, comm, stream);
  ncclRecv(recvbuff[r], recvcount, recvtype, r, comm, stream);
}
ncclGroupEnd();

From C. Wooley's presentation, NCCL: Accelerated Multi-GPU Collective Communication

untitled

Similarly, the time required for transmitting one element of array is $\Delta t = \frac{N}{SB}$
The time for one iteration of the loop over cclGroup is $(k-1)\frac{N}{B}$ ($S$ is simply cancelled out)
Then, the total time for $k$ number of GPUs is $k(k-1)\frac{N}{B}$
The bus bandwidth $B$ is then $busbw = B = {k(k-1)} \frac{N}{t} = {k(k-1)}algbw$

However, from you previous explanation:

Given on 2 nodes, 50% of the data goes through NVLink and 50% goes through the network, then the perceived performance of alltoall will be at most 24GB/s. As you scale to larger number of nodes, the bandwidth should go down to 12 GB/s.

The correct coefficient should be $k/(k-1)$ instead.

sjeaugey commented 1 year ago

The coefficient is k/(k-1) as NCCL perf tests do not assume any specific topology; instead they consider flat topologies.

That works great on a single node with e.g. nvswitch where all GPUs access all others through NVSwitch. But when running on e.g. 2 nodes, the reality is that 7/16 go through NVLink and are not the bottleneck, and 8/16 go through the network and are the bottleneck. So in that case, I'd look at the algbw and divide it by 2 to get the real bw on the network.

vitduck commented 1 year ago

Thanks. I understand it clearly now.

The $\frac{n}{n-1}$ ecoefficiency can be derived from your phenomenological explanation regarding the ratio of IB/NVLINK interconnect.

For reference, I amend the derivation from my previous message.

In an $Alltoall$ operation, $n-1$ data is remote, so for each rank $t = \frac{1}{n-1}\frac{S}{B}$ since $n-1$ links can be fully utilized
The total time for $n$ ranks is then $\Sigma t = \frac{n}{n-1}\frac{S}{B}$
Rearranging the above gives $busbw = B = \frac{n}{n-1}\frac{S}{\Sigma t} =\frac{n}{n-1} algbw $

Unless you have further comments, I would like to close this issue later. Thanks very much for your time. I've learnt a lot through this valuable discussion.

NVIDIA / nccl-tests

Origin of Poor Internode NCCL Performance #152

nThread 1 nGpus 1 minBytes 16 maxBytes 4294967296 step: 4(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0

Using devices

Rank 0 Group 0 Pid 47689 on gpu32 device 0 [0x07] NVIDIA A100-SXM4-80GB

Rank 1 Group 0 Pid 47690 on gpu32 device 1 [0x0b] NVIDIA A100-SXM4-80GB

Rank 2 Group 0 Pid 47691 on gpu32 device 2 [0x48] NVIDIA A100-SXM4-80GB

Rank 3 Group 0 Pid 47692 on gpu32 device 3 [0x4c] NVIDIA A100-SXM4-80GB

Rank 4 Group 0 Pid 47693 on gpu32 device 4 [0x88] NVIDIA A100-SXM4-80GB

Rank 5 Group 0 Pid 47694 on gpu32 device 5 [0x8b] NVIDIA A100-SXM4-80GB

Rank 6 Group 0 Pid 47695 on gpu32 device 6 [0xc8] NVIDIA A100-SXM4-80GB

Rank 7 Group 0 Pid 47696 on gpu32 device 7 [0xcb] NVIDIA A100-SXM4-80GB

Rank 8 Group 0 Pid 40144 on gpu37 device 0 [0x07] NVIDIA A100-SXM4-80GB

Rank 9 Group 0 Pid 40145 on gpu37 device 1 [0x0b] NVIDIA A100-SXM4-80GB

Rank 10 Group 0 Pid 40146 on gpu37 device 2 [0x48] NVIDIA A100-SXM4-80GB

Rank 11 Group 0 Pid 40147 on gpu37 device 3 [0x4c] NVIDIA A100-SXM4-80GB

Rank 12 Group 0 Pid 40148 on gpu37 device 4 [0x88] NVIDIA A100-SXM4-80GB

Rank 13 Group 0 Pid 40149 on gpu37 device 5 [0x8b] NVIDIA A100-SXM4-80GB

Rank 14 Group 0 Pid 40150 on gpu37 device 6 [0xc8] NVIDIA A100-SXM4-80GB

Rank 15 Group 0 Pid 40151 on gpu37 device 7 [0xcb] NVIDIA A100-SXM4-80GB

out-of-place in-place

size count type redop root time algbw busbw #wrong time algbw busbw #wrong

(B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)

Out of bounds values : 0 OK

Avg bus bandwidth : 12.9065

nThread 1 nGpus 1 minBytes 16 maxBytes 4294967296 step: 4(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0

Using devices

Rank 0 Group 0 Pid 34180 on gpu32 device 0 [0x07] NVIDIA A100-SXM4-80GB

Rank 1 Group 0 Pid 34181 on gpu32 device 1 [0x0b] NVIDIA A100-SXM4-80GB

Rank 2 Group 0 Pid 34182 on gpu32 device 2 [0x48] NVIDIA A100-SXM4-80GB

Rank 3 Group 0 Pid 34183 on gpu32 device 3 [0x4c] NVIDIA A100-SXM4-80GB

Rank 4 Group 0 Pid 34184 on gpu32 device 4 [0x88] NVIDIA A100-SXM4-80GB

Rank 5 Group 0 Pid 34185 on gpu32 device 5 [0x8b] NVIDIA A100-SXM4-80GB

Rank 6 Group 0 Pid 34186 on gpu32 device 6 [0xc8] NVIDIA A100-SXM4-80GB

Rank 7 Group 0 Pid 34187 on gpu32 device 7 [0xcb] NVIDIA A100-SXM4-80GB

Rank 8 Group 0 Pid 23210 on gpu37 device 0 [0x07] NVIDIA A100-SXM4-80GB

Rank 9 Group 0 Pid 23211 on gpu37 device 1 [0x0b] NVIDIA A100-SXM4-80GB

Rank 10 Group 0 Pid 23212 on gpu37 device 2 [0x48] NVIDIA A100-SXM4-80GB

Rank 11 Group 0 Pid 23213 on gpu37 device 3 [0x4c] NVIDIA A100-SXM4-80GB

Rank 12 Group 0 Pid 23214 on gpu37 device 4 [0x88] NVIDIA A100-SXM4-80GB

Rank 13 Group 0 Pid 23215 on gpu37 device 5 [0x8b] NVIDIA A100-SXM4-80GB

Rank 14 Group 0 Pid 23216 on gpu37 device 6 [0xc8] NVIDIA A100-SXM4-80GB

Rank 15 Group 0 Pid 23217 on gpu37 device 7 [0xcb] NVIDIA A100-SXM4-80GB

out-of-place in-place

size count type redop root time algbw busbw #wrong time algbw busbw #wrong

(B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)

Out of bounds values : 0 OK

Avg bus bandwidth : 2.00892

Per NCCL documentation, the AlltoAll is implemented based on send/recv

From C. Wooley's presentation, NCCL: Accelerated Multi-GPU Collective Communication