Open ahendriksen opened 4 months ago
I just ran your benchmark on an H200 and can reproduce those numbers.
I did some more measurements on an H200 (BW: 4.8 TB/s) using BabelStream (with the highest element count that fits into memory) and here is what I came up with:
BW (MB/s) | SoL | Thrust improv. | |||
---|---|---|---|---|---|
cuda-stream | Copy | float | 2255065 | 46.98% | |
cuda-stream | Mul | float | 2248589 | 46.85% | |
cuda-stream | Add | float | 2958949 | 61.64% | |
cuda-stream | Triad | float | 2994600 | 62.39% | |
cuda-stream | Dot | float | 3754092 | 78.21% | |
cuda-stream | Copy | double | 3512914 | 73.19% | |
cuda-stream | Mul | double | 3495498 | 72.82% | |
cuda-stream | Add | double | 4019655 | 83.74% | |
cuda-stream | Triad | double | 4019995 | 83.75% | |
cuda-stream | Dot | double | 4496204 | 93.67% | |
thrust-stream | Copy | float | 3306321 | 68.88% | 21.90% |
thrust-stream | Mul | float | 3097175 | 64.52% | 17.68% |
thrust-stream | Add | float | 3726179 | 77.63% | 15.98% |
thrust-stream | Triad | float | 3743643 | 77.99% | 15.61% |
thrust-stream | Dot | float | 4264744 | 88.85% | 10.64% |
thrust-stream | Copy | double | 3306362 | 68.88% | -4.30% |
thrust-stream | Mul | double | 3976946 | 82.85% | 10.03% |
thrust-stream | Add | double | 4418539 | 92.05% | 8.31% |
thrust-stream | Triad | double | 4427755 | 92.24% | 8.50% |
thrust-stream | Dot | double | 4499021 | 93.73% | 0.06% |
Observations:
thrust::inner_product
cannot outperform a hand written reduction (for double
). I feel we could do better here.Implementation notes: Mul, Add and Triad use thrust::transform
which eventually use CUB's DeviceFor
, processing 2 items per stream in each thread. This probably does not generate enough loads to saturate the memory system. As a simple fix, we could increase the processed items per thread.
Thanks for double-checking @bernhardmgruber . I agree with your comments. I think thrust::inner_product
is already doing quite well. I would expect at most a 2% improvement (to 95% SoL).
Is this a duplicate?
Type of Bug
Performance
Component
Thrust
Describe the bug
Using
thrust::transform
on newer hardware platforms can result in subpar performance.How to Reproduce
See godbolt link for exact reproducer.
Output:
Expected behavior
The benchmarks with int32 datatype should be able to saturate bandwidth (~90%). The benchmarks with int16 and int8 datatypes should have reasonable performance (>60%). The int64 mul benchmark should be at 90% SoL.
The int128, and the remaining int64 benchmarks have been added as a reference. Their performance is acceptable.
Reproduction link
https://godbolt.org/z/K7EW5freK
Operating System
No response
nvidia-smi output
NVCC version
NA