[PERF][BUG]: Thrust uses cudaMemcpy for Device->Device copies (66% SoL on H200)

ahendriksen commented 2 months ago

Is this a duplicate?

[X] I confirmed there appear to be no duplicate issues for this bug and that I agree to the Code of Conduct

Type of Bug

Performance

Component

Thrust

Describe the bug

thrust::copy uses cudaMemcpy to implement the copy, which saturates at most 66% of memory bandwidth on H200. nvbug 4207603

How to Reproduce

See godbolt link for exact reproducer.

Observed output:

$ ./01_thrust_copy 
     cp_gb  elapsed_ms     bw_gbps  pct_of_sol
      8.59      2.6090      3292.4       67.0%
      8.59      2.6073      3294.5       67.0%
      8.59      2.6061      3296.1       67.0%

Expected behavior

thrust::copy should be able to saturate bandwidth.

Reproduction link

https://godbolt.org/z/foPG4ox53

Operating System

No response

nvidia-smi output

$ nvidia-smi 
Mon Apr 29 05:40:23 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H200                    On  |   00000000:45:00.0 Off |                    0 |
| N/A   27C    P0             73W /  700W |       0MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

NVCC version

NA

ahendriksen commented 2 months ago

Related issue in RAPIDS, where smaller copies are serialized behind larger copies due to busy copy engines.

@gevtushenko : does thrust::copy_n use a kernel to perform the copying? Perhaps, that should be used instead.

bernhardmgruber commented 1 week ago

I just ran the Thrust benchmark for copy on my A6000 and the current, cudaMemcpy-based implementation performs "well enough":

T{ct}	Elements	Samples	CPU Time	Noise	GPU Time	Noise	Elem/s	GlobalMem BW	BWUtil
U8	2^28 = 268435456	558x	782.671 us	11.17%	773.783 us	8.30%	346.913G	693.826 GB/s	90.33%
U16	2^28 = 268435456	608x	1.522 ms	2.59%	1.518 ms	2.36%	176.841G	707.363 GB/s	92.09%
U32	2^28 = 268435456	820x	3.050 ms	4.37%	3.045 ms	4.18%	88.167G	705.332 GB/s	91.83%
U64	2^28 = 268435456	854x	6.065 ms	2.53%	6.061 ms	2.52%	44.288G	708.610 GB/s	92.26%
NonTrivial	2^28 = 268435456	1200x	6.168 ms	3.17%	6.163 ms	3.16%	43.554G	696.871 GB/s	90.73%

However, I saw the <66% on H200 a few days ago, so there a kernel is probably the better choice. Are there any upsides with using cudaMemcpy? I could assume if the device is busy with other work, using the copy engines could result in less contention for SMs and better overall application throughput. I am therefore wondering whether we need to give users a knob to choose which copy implementation is used.

Given the performance on A6000 looks fine, we may also want to dispatch between cudaMemcpy and a kernel depending on the GPU we are running on.

@gevtushenko : does thrust::copy_n use a kernel to perform the copying? Perhaps, that should be used instead.

thrust::copy_n uses the same implementation as thrust::copy and will also use cudaMemcpyAsync when possible, and otherwise thrust::transform.

bernhardmgruber commented 1 week ago

I discussed this with @gevtushenko yesterday and he remembers a time where we actually had a custom kernel for thrust::copy, but switched to cudaMemcpy because the latter was faster. We want to avoid ping-ponging between a custom kernel and cudaMemcpy ourselves, and rather prefer to let the team behind cudaMemcpy handle this, for which you already opened a bug report.

This does not mean we could not make an exception still, but rather that we are trying to address more pressing issues, e.g. #1673, first and see how cudaMemcpy develops.

ahendriksen commented 1 week ago

This is fixed. Should be visible in a future public release. Please see nvbug 4207603

bernhardmgruber commented 5 days ago

Great! I will close the issue then, since no further action is necessary from our side. Feel free to reopen it if the problem is not resolved once the fixes land!

NVIDIA / cccl