NVIDIA / cccl

CUDA Core Compute Libraries
Other
977 stars 119 forks source link

[FEA]: Add support for multi-reductions (tuple) #1842

Open luitjens opened 4 weeks ago

luitjens commented 4 weeks ago

Is this a duplicate?

Area

CUB

Is your feature request related to a problem? Please describe.

When performing multiple reductions back to back the compiler is unable to effectively hide latency by interleaving instructions from each reduction. I wrote a prototype multi-reduction which uses variadic templates and tuples in order to move the butterfly loop outside of the tuple loop thereby allowing each independent member of the tuple to hide latency across other members of the tuple.

Here is a godbolt with this sample code: https://godbolt.org/z/qGs7TxjTW

One data point we have found is that for one CFD code they got a 1.5x speedup using a multi-reduction over just calling reduce repeatedly.

Describe the solution you'd like

https://godbolt.org/z/qGs7TxjTW

Describe alternatives you've considered

No response

Additional context

No response

bernhardmgruber commented 4 weeks ago

Hi! Thanks for raising awareness for this use case! From the top of my head I wondered whether you could achieve this using zip iterators and a tuple as accumulator. I could get that working with Thrust and CUB: https://godbolt.org/z/xYoPPh858. I don't know whether the generated code is efficient, but the API seems to allow this use case. Could you please have a look whether this can solve your use case? If not, please tell us more details on what you need concretely (like, what kind of API changes or extensions, or whether performance is suboptimal). Thanks a lot!