[FEA]: Optimized ceiling division `cuda::ceil_div`

Is this a duplicate?

[x] I confirmed there appear to be no duplicate issues for this request and that I agree to the Code of Conduct

Area

libcu++

Is your feature request related to a problem? Please describe.

ceil_div can be better optimized for GPU architectures

Describe the solution you'd like

The current version can be optimized for signed integers inputs int, int64_t :

(a + b - 1) / b generates 3 instructions less than (value1 / div1) + (value1 / div1) * div1 != 0) for 32-bit on Volta, Ampere, Hopper*
while it generates ~20 instructions less for 64-bit inputs

Other experiments with unsigned and uint64_t

(value1 / div1) + (value1 / div1) * div1 != 0) generates one instruction less than (value1 / div1) + (value1 % div1 > 0) for 32-bit on Volta, Ampere, Hopper
Same for 64-bit

However, the ceiling division can be also computed safely with 1 + ((a - 1) / b) for unsigned inputs. Unfortunately, this works only if a != 0, so we need to add a == 0 ? 0 : 1 + ((a - 1) / b) which defeats efficiency vs. the proposed version.

On the other hand, we can replace the ternary operator with min() which is efficient on GPU:

::min(a, 1 + ((a - 1) / b))

this saves one instruction on Volta, Ampere and two instructions on Hopper for 32-bit
and ~10 instructions for 64-bit inputs

Describe alternatives you've considered

No response

Additional context

No response

NVIDIA / cccl