[EPIC]: Reproducible floating-point reductions

[X] I confirmed there appear to be no duplicate issues for this request and that I agree to the Code of Conduct

CUB

I would like reproducible reductions for floating-point values.

MVP:

Future work:

Support half/bfloat by upconverting to float/double
Support for custom types containing floating point values by using something similar to the decomposer approach we used for radix sort to decompose a custom type into something like tuple<float, double, ...>.
Extend BlockReduce with reproducible algorithm
Extend to Scan
- For scan, we'd ideally like to fit the aggregate state in 128 bits, which would be tricky because for k=2 (from @maddyscientist's algorithm) we'd need 128 bits trivially, but could potentially reserve 2 of the bits in the aggregator type to use for the decoupled lookback signaling (see https://github.com/NVIDIA/cccl/issues/220)

NVIDIA / cccl