Is your feature request related to a problem? Please describe.
ceil_div can be better optimized for GPU architectures
Describe the solution you'd like
The current version can be optimized for signed integers inputs int, int64_t :
(a + b - 1) / b generates 3 instructions less than (value1 / div1) + (value1 / div1) * div1 != 0) for 32-bit on Volta, Ampere, Hopper*
while it generates ~20 instructions less for 64-bit inputs
Other experiments with unsigned and uint64_t
(value1 / div1) + (value1 / div1) * div1 != 0) generates one instruction less than (value1 / div1) + (value1 % div1 > 0) for 32-bit on Volta, Ampere, Hopper
Same for 64-bit
However, the ceiling division can be also computed safely with 1 + ((a - 1) / b) for unsigned inputs. Unfortunately, this works only if a != 0, so we need to add a == 0 ? 0 : 1 + ((a - 1) / b) which defeats efficiency vs. the proposed version.
On the other hand, we can replace the ternary operator with min() which is efficient on GPU:
::min(a, 1 + ((a - 1) / b))
this saves one instruction on Volta, Ampere and two instructions on Hopper for 32-bit
Is this a duplicate?
Area
libcu++
Is your feature request related to a problem? Please describe.
ceil_div
can be better optimized for GPU architecturesDescribe the solution you'd like
The current version can be optimized for signed integers inputs
int
,int64_t
:(a + b - 1) / b
generates 3 instructions less than(value1 / div1) + (value1 / div1) * div1 != 0)
for 32-bit on Volta, Ampere, Hopper*Other experiments with
unsigned
anduint64_t
(value1 / div1) + (value1 / div1) * div1 != 0)
generates one instruction less than(value1 / div1) + (value1 % div1 > 0)
for 32-bit on Volta, Ampere, HopperHowever, the ceiling division can be also computed safely with
1 + ((a - 1) / b)
for unsigned inputs. Unfortunately, this works only ifa != 0
, so we need to adda == 0 ? 0 : 1 + ((a - 1) / b)
which defeats efficiency vs. the proposed version.On the other hand, we can replace the ternary operator with
min()
which is efficient on GPU:Describe alternatives you've considered
No response
Additional context
No response