NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
775 stars 226 forks source link

all_reduce_perf between NVLINK connected H100 PCIe GPUs lower than A100 SXM4 GPUs #194

Open chinthysl opened 6 months ago

chinthysl commented 6 months ago

./nccl-tests/build/all_reduce_perf -b 1G -e 8G -f 2 -g 2

H100 PCIe

...
NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1
...
[1] NCCL INFO Channel 00/0 : 1[21000] -> 0[1000] via P2P/direct pointer
...
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
  1073741824     268435456     float     sum      -1   9306.7  115.37  115.37      0   9291.5  115.56  115.56      0
  2147483648     536870912     float     sum      -1    18535  115.86  115.86      0    18524  115.93  115.93      0
  4294967296    1073741824     float     sum      -1    36943  116.26  116.26      0    36917  116.34  116.34      0
  8589934592    2147483648     float     sum      -1    73849  116.32  116.32      0    73767  116.45  116.45      0

A100 SXM4

[1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] -1/-1/-1->1->0 [3] -1/-1/-1->1->0 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] -1/-1/-1->1->0 [7] -1/-1/-1->1->0 [8] -1/-1/-1->1->0 [9] -1/-1/-1->1->0 [10] -1/-1/-1->1->0 [11] -1/-1/-1->1->0 [12] -1/-1/-1->1->0 [13] -1/-1/-1->1->0 [14] -1/-1/-1->1->0 [15] -1/-1/-1->1->0 [16] -1/-1/-1->1->0 [17] -1/-1/-1->1->0 [18] -1/-1/-1->1->0 [19] -1/-1/-1->1->0 [20] -1/-1/-1->1->0 [21] -1/-1/-1->1->0 [22] -1/-1/-1->1->0 [23] -1/-1/-1->1->0
...
[1] NCCL INFO Channel 00/0 : 1[cb000] -> 0[c8000] via P2P/direct pointer/read
...
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
  1073741824     268435456     float     sum      -1   5219.4  205.72  205.72      0   5227.2  205.41  205.41      0
  2147483648     536870912     float     sum      -1    10046  213.77  213.77      0    10045  213.78  213.78      0
  4294967296    1073741824     float     sum      -1    19513  220.11  220.11      0    19552  219.67  219.67      0
  8589934592    2147483648     float     sum      -1    38564  222.75  222.75      0    38588  222.60  222.60      0

GPU datasheets say H100 PCIe and A100 SXM4 both have 600GB/s BW. Following is my p2pbandwith test reports.

H100 PCIe

Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4      5      6      7 
     0 1676.95 336.21 104.50  76.25  71.94  70.57  65.92  64.72 
     1 336.77 1221.14  73.51  73.59  66.87  66.53  65.66  70.42 
     2 105.07  75.61 1674.17 336.95  64.95  64.59  68.98  72.20 
     3  75.84  73.63 337.74 1221.57  65.84  69.05  65.77  66.25 
     4  71.79  65.49  66.65  64.98 1676.61 339.44  75.36  76.10 
     5  69.16  66.69  65.28  70.19 339.69 1218.43  73.64  74.15 
     6  66.41  65.72  71.44  65.11  75.44  73.29 1220.27 362.06 
     7  66.96  69.93  72.22  66.47  75.84  73.93 360.87 1220.78

A100 SXM4

Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4      5      6      7 
     0 1562.50 423.16 425.59 424.77 424.77 423.97 425.12 424.63 
     1 424.20 1553.95 424.65 425.24 424.66 425.34 424.66 425.56 
     2 423.39 425.24 1548.56 424.66 425.00 425.24 425.47 423.28 
     3 426.68 427.07 427.13 1562.50 430.04 429.44 429.44 430.15 
     4 426.60 426.68 427.34 432.31 1601.74 517.65 517.48 519.19 
     5 426.48 426.30 426.89 431.91 517.65 1593.57 517.65 518.33 
     6 426.44 426.91 426.90 430.63 518.85 519.35 1600.10 519.89 
     7 426.40 426.96 427.42 428.36 519.08 519.02 519.01 1596.83