NVIDIA / cccl

CUDA Core Compute Libraries
https://nvidia.github.io/cccl/
Other
1.14k stars 135 forks source link

[PERF][BUG]: `thrust::transform` does not saturate bandwidth on newer hardware architectures (down to 62% SoL on H200 for int) #1673

Open ahendriksen opened 4 months ago

ahendriksen commented 4 months ago

Is this a duplicate?

Type of Bug

Performance

Component

Thrust

Describe the bug

Using thrust::transform on newer hardware platforms can result in subpar performance.

How to Reproduce

See godbolt link for exact reproducer.

Output:

benchmark   type             cp_gb  elapsed_ms     bw_gbps  pct_of_sol
mul         int8              8.59      8.2682      1038.9       21.1%
add         int8             12.88      9.2302      1396.0       28.4%
triad       int8             12.88      9.2364      1395.0       28.4%
nstream     int8             17.18      9.9301      1730.1       35.2%
mul         int16             8.59      4.6822      1834.6       37.3%
add         int16            12.88      5.2990      2431.6       49.5%
triad       int16            12.88      5.3074      2427.7       49.4%
nstream     int16            17.18      5.9318      2896.2       58.9%
mul         int32             8.59      2.8014      3066.3       62.4%
add         int32            12.88      3.4908      3691.1       75.1%
triad       int32            12.88      3.4901      3691.8       75.1%
nstream     int32            17.18      4.2756      4018.1       81.7%
mul         int64             8.59      2.1956      3912.3       79.6%
add         int64            12.88      2.9556      4359.5       88.7%
triad       int64            12.88      2.9548      4360.7       88.7%
nstream     int64            17.18      3.9255      4376.5       89.0%
mul         int128           17.18      4.0780      4212.9       85.7%
add         int128           25.77      5.9694      4317.0       87.8%
triad       int128           25.77      5.9789      4310.1       87.7%
nstream     int128           34.36      7.8597      4371.6       88.9%

Expected behavior

The benchmarks with int32 datatype should be able to saturate bandwidth (~90%). The benchmarks with int16 and int8 datatypes should have reasonable performance (>60%). The int64 mul benchmark should be at 90% SoL.

The int128, and the remaining int64 benchmarks have been added as a reference. Their performance is acceptable.

Reproduction link

https://godbolt.org/z/K7EW5freK

Operating System

No response

nvidia-smi output

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H200                    On  |   00000000:45:00.0 Off |                    0 |
| N/A   28C    P0             73W /  700W |       0MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

NVCC version

NA

bernhardmgruber commented 2 months ago

I just ran your benchmark on an H200 and can reproduce those numbers.

bernhardmgruber commented 2 months ago

I did some more measurements on an H200 (BW: 4.8 TB/s) using BabelStream (with the highest element count that fits into memory) and here is what I came up with:

      BW (MB/s) SoL Thrust improv.
cuda-stream Copy float 2255065 46.98%  
cuda-stream Mul float 2248589 46.85%  
cuda-stream Add float 2958949 61.64%  
cuda-stream Triad float 2994600 62.39%  
cuda-stream Dot float 3754092 78.21%  
           
cuda-stream Copy double 3512914 73.19%  
cuda-stream Mul double 3495498 72.82%  
cuda-stream Add double 4019655 83.74%  
cuda-stream Triad double 4019995 83.75%  
cuda-stream Dot double 4496204 93.67%  
           
thrust-stream Copy float 3306321 68.88% 21.90%
thrust-stream Mul float 3097175 64.52% 17.68%
thrust-stream Add float 3726179 77.63% 15.98%
thrust-stream Triad float 3743643 77.99% 15.61%
thrust-stream Dot float 4264744 88.85% 10.64%
           
thrust-stream Copy double 3306362 68.88% -4.30%
thrust-stream Mul double 3976946 82.85% 10.03%
thrust-stream Add double 4418539 92.05% 8.31%
thrust-stream Triad double 4427755 92.24% 8.50%
thrust-stream Dot double 4499021 93.73% 0.06%

Observations:

Implementation notes: Mul, Add and Triad use thrust::transform which eventually use CUB's DeviceFor, processing 2 items per stream in each thread. This probably does not generate enough loads to saturate the memory system. As a simple fix, we could increase the processed items per thread.

ahendriksen commented 2 months ago

Thanks for double-checking @bernhardmgruber . I agree with your comments. I think thrust::inner_product is already doing quite well. I would expect at most a 2% improvement (to 95% SoL).