[PERF][BUG]: `thrust::transform` does not saturate bandwidth on newer hardware architectures (down to 62% SoL on H200 for int)

ahendriksen commented 4 months ago

Is this a duplicate?

[X] I confirmed there appear to be no duplicate issues for this bug and that I agree to the Code of Conduct

Type of Bug

Performance

Component

Thrust

Describe the bug

Using thrust::transform on newer hardware platforms can result in subpar performance.

How to Reproduce

See godbolt link for exact reproducer.

Output:

benchmark   type             cp_gb  elapsed_ms     bw_gbps  pct_of_sol
mul         int8              8.59      8.2682      1038.9       21.1%
add         int8             12.88      9.2302      1396.0       28.4%
triad       int8             12.88      9.2364      1395.0       28.4%
nstream     int8             17.18      9.9301      1730.1       35.2%
mul         int16             8.59      4.6822      1834.6       37.3%
add         int16            12.88      5.2990      2431.6       49.5%
triad       int16            12.88      5.3074      2427.7       49.4%
nstream     int16            17.18      5.9318      2896.2       58.9%
mul         int32             8.59      2.8014      3066.3       62.4%
add         int32            12.88      3.4908      3691.1       75.1%
triad       int32            12.88      3.4901      3691.8       75.1%
nstream     int32            17.18      4.2756      4018.1       81.7%
mul         int64             8.59      2.1956      3912.3       79.6%
add         int64            12.88      2.9556      4359.5       88.7%
triad       int64            12.88      2.9548      4360.7       88.7%
nstream     int64            17.18      3.9255      4376.5       89.0%
mul         int128           17.18      4.0780      4212.9       85.7%
add         int128           25.77      5.9694      4317.0       87.8%
triad       int128           25.77      5.9789      4310.1       87.7%
nstream     int128           34.36      7.8597      4371.6       88.9%

Expected behavior

The benchmarks with int32 datatype should be able to saturate bandwidth (~90%). The benchmarks with int16 and int8 datatypes should have reasonable performance (>60%). The int64 mul benchmark should be at 90% SoL.

The int128, and the remaining int64 benchmarks have been added as a reference. Their performance is acceptable.

Reproduction link

https://godbolt.org/z/K7EW5freK

Operating System

No response

nvidia-smi output

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H200                    On  |   00000000:45:00.0 Off |                    0 |
| N/A   28C    P0             73W /  700W |       0MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

NVCC version

NA

bernhardmgruber commented 2 months ago

I just ran your benchmark on an H200 and can reproduce those numbers.

bernhardmgruber commented 2 months ago

I did some more measurements on an H200 (BW: 4.8 TB/s) using BabelStream (with the highest element count that fits into memory) and here is what I came up with:

			BW (MB/s)	SoL	Thrust improv.
cuda-stream	Copy	float	2255065	46.98%
cuda-stream	Mul	float	2248589	46.85%
cuda-stream	Add	float	2958949	61.64%
cuda-stream	Triad	float	2994600	62.39%
cuda-stream	Dot	float	3754092	78.21%

cuda-stream	Copy	double	3512914	73.19%
cuda-stream	Mul	double	3495498	72.82%
cuda-stream	Add	double	4019655	83.74%
cuda-stream	Triad	double	4019995	83.75%
cuda-stream	Dot	double	4496204	93.67%

thrust-stream	Copy	float	3306321	68.88%	21.90%
thrust-stream	Mul	float	3097175	64.52%	17.68%
thrust-stream	Add	float	3726179	77.63%	15.98%
thrust-stream	Triad	float	3743643	77.99%	15.61%
thrust-stream	Dot	float	4264744	88.85%	10.64%

thrust-stream	Copy	double	3306362	68.88%	-4.30%
thrust-stream	Mul	double	3976946	82.85%	10.03%
thrust-stream	Add	double	4418539	92.05%	8.31%
thrust-stream	Triad	double	4427755	92.24%	8.50%
thrust-stream	Dot	double	4499021	93.73%	0.06%

Observations:

Smaller data types seem to be more affected by the slowdown than larger types. This is in-line with your observation.
Thrust is generally better than native CUDA, except for Copy (probably because of: #1672)
Copy and Mul both perform the worst. They also both read a single input stream and write to one output stream.
Add and Triad perform better. They read from two input streams, thus generating twice as many reads.
I am disappointed thrust::inner_product cannot outperform a hand written reduction (for double). I feel we could do better here.

Implementation notes: Mul, Add and Triad use thrust::transform which eventually use CUB's DeviceFor, processing 2 items per stream in each thread. This probably does not generate enough loads to saturate the memory system. As a simple fix, we could increase the processed items per thread.

ahendriksen commented 2 months ago

Thanks for double-checking @bernhardmgruber . I agree with your comments. I think thrust::inner_product is already doing quite well. I would expect at most a 2% improvement (to 95% SoL).

NVIDIA / cccl