NVIDIA / cccl

CUDA Core Compute Libraries
https://nvidia.github.io/cccl/
Other
1.24k stars 157 forks source link

[FEA]: Optimized ceiling division `cuda::ceil_div` #2391

Open fbusato opened 1 month ago

fbusato commented 1 month ago

Is this a duplicate?

Area

libcu++

Is your feature request related to a problem? Please describe.

ceil_div can be better optimized for GPU architectures

Describe the solution you'd like

The current version can be optimized for signed integers inputs int, int64_t :

Other experiments with unsigned and uint64_t

However, the ceiling division can be also computed safely with 1 + ((a - 1) / b) for unsigned inputs. Unfortunately, this works only if a != 0, so we need to add a == 0 ? 0 : 1 + ((a - 1) / b) which defeats efficiency vs. the proposed version.

On the other hand, we can replace the ternary operator with min() which is efficient on GPU:

::min(a, 1 + ((a - 1) / b))

Describe alternatives you've considered

No response

Additional context

No response

fbusato commented 1 month ago

this is issue is a follow up of https://github.com/NVIDIA/cccl/pull/2376