NVIDIA / cccl

CUDA Core Compute Libraries
https://nvidia.github.io/cccl/
Other
1.15k stars 135 forks source link

[BUG]: thrust::inclusive_scan not working correctly when -rdc=true #2449

Open Olli1080 opened 2 days ago

Olli1080 commented 2 days ago

Is this a duplicate?

Type of Bug

Silent Failure

Component

Thrust

Describe the bug

thrust::inclusive_scan does not work properly on certain vector sizes and seems to stop midway e.g. [1,2,3,x,x,...,x] (x being the value the result vector was initialized with) After some experiments it seems that -rdc=true causes this behaviour. (changing from debug to release results in the exact same results)

Visual Studio Community 2022 -Version 17.11.4

CUDA 12.6 Update 1

How to Reproduce

  1. Download reproduction repo at https://github.com/Olli1080/Cuda-Bug.git
  2. Compile the solution
  3. It outputs the size ranges where inclusive_scan fails
  4. Disable GenerateRelocatableDeviceCode (aka. -rdc=true)
  5. No output is generated indicating that everything works

Log generated on my system:

at iteration: 1537; 0 != 1537
[1537 - 1921]
at iteration: 3073; 0 != 3073
[3073 - 3841]
at iteration: 4609; 0 != 4609
[4609 - 5761]
at iteration: 6145; 0 != 6145
[6145 -

Expected behavior

inclusive_scan should work no matter the status of GenerateRelocatableDeviceCode

Reproduction link

https://github.com/Olli1080/Cuda-Bug.git

Operating System

Windows 11 Pro 10.0.22631

nvidia-smi output

+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 560.94 Driver Version: 560.94 CUDA Version: 12.6 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Driver-Model | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 4090 WDDM | 00000000:01:00.0 On | Off | | 0% 38C P8 22W / 450W | 1460MiB / 24564MiB | 4% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+

NVCC version

nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2024 NVIDIA Corporation Built on Wed_Aug_14_10:26:51_Pacific_Daylight_Time_2024 Cuda compilation tools, release 12.6, V12.6.68 Build cuda_12.6.r12.6/compiler.34714021_0

Olli1080 commented 22 hours ago

After some testing i've got the same results for 12.5 Patch 1 and 12.3 Patch 2 (didn't check 12.4). The same behaviour could also be triggered without -rdc=true but instead -ewp.

The issue does not appear if i set SM>=60, compute>=60, which if left empty defaults to SM_52, compute_52. The issue is also present for SM_53, compute_53. I've tested all configurations of 52<=SM<=89, 52<=compute<=89