NVIDIA / cccl

CUDA Core Compute Libraries
Other
1.03k stars 122 forks source link

[PERF][BUG]: Thrust uses cudaMemcpy for Device->Device copies (66% SoL on H200) #1672

Closed ahendriksen closed 5 days ago

ahendriksen commented 2 months ago

Is this a duplicate?

Type of Bug

Performance

Component

Thrust

Describe the bug

thrust::copy uses cudaMemcpy to implement the copy, which saturates at most 66% of memory bandwidth on H200. nvbug 4207603

How to Reproduce

See godbolt link for exact reproducer.

Observed output:

$ ./01_thrust_copy 
     cp_gb  elapsed_ms     bw_gbps  pct_of_sol
      8.59      2.6090      3292.4       67.0%
      8.59      2.6073      3294.5       67.0%
      8.59      2.6061      3296.1       67.0%

Expected behavior

thrust::copy should be able to saturate bandwidth.

Reproduction link

https://godbolt.org/z/foPG4ox53

Operating System

No response

nvidia-smi output

$ nvidia-smi 
Mon Apr 29 05:40:23 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H200                    On  |   00000000:45:00.0 Off |                    0 |
| N/A   27C    P0             73W /  700W |       0MiB / 143771MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

NVCC version

NA

ahendriksen commented 2 months ago

Related issue in RAPIDS, where smaller copies are serialized behind larger copies due to busy copy engines.

@gevtushenko : does thrust::copy_n use a kernel to perform the copying? Perhaps, that should be used instead.

bernhardmgruber commented 1 week ago

I just ran the Thrust benchmark for copy on my A6000 and the current, cudaMemcpy-based implementation performs "well enough":

T{ct} Elements Samples CPU Time Noise GPU Time Noise Elem/s GlobalMem BW BWUtil
U8 2^28 = 268435456 558x 782.671 us 11.17% 773.783 us 8.30% 346.913G 693.826 GB/s 90.33%
U16 2^28 = 268435456 608x 1.522 ms 2.59% 1.518 ms 2.36% 176.841G 707.363 GB/s 92.09%
U32 2^28 = 268435456 820x 3.050 ms 4.37% 3.045 ms 4.18% 88.167G 705.332 GB/s 91.83%
U64 2^28 = 268435456 854x 6.065 ms 2.53% 6.061 ms 2.52% 44.288G 708.610 GB/s 92.26%
NonTrivial 2^28 = 268435456 1200x 6.168 ms 3.17% 6.163 ms 3.16% 43.554G 696.871 GB/s 90.73%

However, I saw the <66% on H200 a few days ago, so there a kernel is probably the better choice. Are there any upsides with using cudaMemcpy? I could assume if the device is busy with other work, using the copy engines could result in less contention for SMs and better overall application throughput. I am therefore wondering whether we need to give users a knob to choose which copy implementation is used.

Given the performance on A6000 looks fine, we may also want to dispatch between cudaMemcpy and a kernel depending on the GPU we are running on.

@gevtushenko : does thrust::copy_n use a kernel to perform the copying? Perhaps, that should be used instead.

thrust::copy_n uses the same implementation as thrust::copy and will also use cudaMemcpyAsync when possible, and otherwise thrust::transform.

bernhardmgruber commented 1 week ago

I discussed this with @gevtushenko yesterday and he remembers a time where we actually had a custom kernel for thrust::copy, but switched to cudaMemcpy because the latter was faster. We want to avoid ping-ponging between a custom kernel and cudaMemcpy ourselves, and rather prefer to let the team behind cudaMemcpy handle this, for which you already opened a bug report.

This does not mean we could not make an exception still, but rather that we are trying to address more pressing issues, e.g. #1673, first and see how cudaMemcpy develops.

ahendriksen commented 1 week ago

This is fixed. Should be visible in a future public release. Please see nvbug 4207603

bernhardmgruber commented 5 days ago

Great! I will close the issue then, since no further action is necessary from our side. Feel free to reopen it if the problem is not resolved once the fixes land!