Extend DeviceReduce::Sum with requirements API for opt-in reproducibility
Works only for input iterator value_type of float/double
Future work:
Support half/bfloat by upconverting to float/double
Support for custom types containing floating point values by using something similar to the decomposer approach we used for radix sort to decompose a custom type into something like tuple<float, double, ...>.
Extend BlockReduce with reproducible algorithm
Extend to Scan
For scan, we'd ideally like to fit the aggregate state in 128 bits, which would be tricky because for k=2 (from @maddyscientist's algorithm) we'd need 128 bits trivially, but could potentially reserve 2 of the bits in the aggregator type to use for the decoupled lookback signaling (see https://github.com/NVIDIA/cccl/issues/220)
Is this a duplicate?
Area
CUB
Is your feature request related to a problem? Please describe.
I would like reproducible reductions for floating-point values.
Describe the solution you'd like
@maddyscientist has a proof-of-concept implementation here: https://github.com/maddyscientist/reproducible_floating_sums/tree/feature/cuda
MVP:
DeviceReduce::Sum
with requirements API for opt-in reproducibilityvalue_type
offloat/double
Future work:
tuple<float, double, ...>
.Describe alternatives you've considered
No response
Additional context
No response