[FEA]: Add support for multi-reductions (tuple)

Is this a duplicate?

[ ] I confirmed there appear to be no duplicate issues for this request and that I agree to the Code of Conduct

Area

CUB

Is your feature request related to a problem? Please describe.

When performing multiple reductions back to back the compiler is unable to effectively hide latency by interleaving instructions from each reduction. I wrote a prototype multi-reduction which uses variadic templates and tuples in order to move the butterfly loop outside of the tuple loop thereby allowing each independent member of the tuple to hide latency across other members of the tuple.

Here is a godbolt with this sample code: https://godbolt.org/z/qGs7TxjTW

One data point we have found is that for one CFD code they got a 1.5x speedup using a multi-reduction over just calling reduce repeatedly.

Describe the solution you'd like

https://godbolt.org/z/qGs7TxjTW

Describe alternatives you've considered

No response

Additional context

No response

NVIDIA / cccl