Closed ahendriksen closed 5 days ago
Related issue in RAPIDS, where smaller copies are serialized behind larger copies due to busy copy engines.
@gevtushenko : does thrust::copy_n
use a kernel to perform the copying? Perhaps, that should be used instead.
I just ran the Thrust benchmark for copy
on my A6000 and the current, cudaMemcpy
-based implementation performs "well enough":
T{ct} | Elements | Samples | CPU Time | Noise | GPU Time | Noise | Elem/s | GlobalMem BW | BWUtil |
---|---|---|---|---|---|---|---|---|---|
U8 | 2^28 = 268435456 | 558x | 782.671 us | 11.17% | 773.783 us | 8.30% | 346.913G | 693.826 GB/s | 90.33% |
U16 | 2^28 = 268435456 | 608x | 1.522 ms | 2.59% | 1.518 ms | 2.36% | 176.841G | 707.363 GB/s | 92.09% |
U32 | 2^28 = 268435456 | 820x | 3.050 ms | 4.37% | 3.045 ms | 4.18% | 88.167G | 705.332 GB/s | 91.83% |
U64 | 2^28 = 268435456 | 854x | 6.065 ms | 2.53% | 6.061 ms | 2.52% | 44.288G | 708.610 GB/s | 92.26% |
NonTrivial | 2^28 = 268435456 | 1200x | 6.168 ms | 3.17% | 6.163 ms | 3.16% | 43.554G | 696.871 GB/s | 90.73% |
However, I saw the <66% on H200 a few days ago, so there a kernel is probably the better choice. Are there any upsides with using cudaMemcpy
? I could assume if the device is busy with other work, using the copy engines could result in less contention for SMs and better overall application throughput. I am therefore wondering whether we need to give users a knob to choose which copy implementation is used.
Given the performance on A6000 looks fine, we may also want to dispatch between cudaMemcpy
and a kernel depending on the GPU we are running on.
@gevtushenko : does
thrust::copy_n
use a kernel to perform the copying? Perhaps, that should be used instead.
thrust::copy_n
uses the same implementation as thrust::copy
and will also use cudaMemcpyAsync
when possible, and otherwise thrust::transform
.
I discussed this with @gevtushenko yesterday and he remembers a time where we actually had a custom kernel for thrust::copy
, but switched to cudaMemcpy
because the latter was faster. We want to avoid ping-ponging between a custom kernel and cudaMemcpy
ourselves, and rather prefer to let the team behind cudaMemcpy
handle this, for which you already opened a bug report.
This does not mean we could not make an exception still, but rather that we are trying to address more pressing issues, e.g. #1673, first and see how cudaMemcpy
develops.
This is fixed. Should be visible in a future public release. Please see nvbug 4207603
Great! I will close the issue then, since no further action is necessary from our side. Feel free to reopen it if the problem is not resolved once the fixes land!
Is this a duplicate?
Type of Bug
Performance
Component
Thrust
Describe the bug
thrust::copy
usescudaMemcpy
to implement the copy, which saturates at most 66% of memory bandwidth on H200. nvbug 4207603How to Reproduce
See godbolt link for exact reproducer.
Observed output:
Expected behavior
thrust::copy should be able to saturate bandwidth.
Reproduction link
https://godbolt.org/z/foPG4ox53
Operating System
No response
nvidia-smi output
NVCC version
NA