Open luitjens opened 4 weeks ago
Hi! Thanks for raising awareness for this use case! From the top of my head I wondered whether you could achieve this using zip iterators and a tuple as accumulator. I could get that working with Thrust and CUB: https://godbolt.org/z/xYoPPh858. I don't know whether the generated code is efficient, but the API seems to allow this use case. Could you please have a look whether this can solve your use case? If not, please tell us more details on what you need concretely (like, what kind of API changes or extensions, or whether performance is suboptimal). Thanks a lot!
Is this a duplicate?
Area
CUB
Is your feature request related to a problem? Please describe.
When performing multiple reductions back to back the compiler is unable to effectively hide latency by interleaving instructions from each reduction. I wrote a prototype multi-reduction which uses variadic templates and tuples in order to move the butterfly loop outside of the tuple loop thereby allowing each independent member of the tuple to hide latency across other members of the tuple.
Here is a godbolt with this sample code: https://godbolt.org/z/qGs7TxjTW
One data point we have found is that for one CFD code they got a 1.5x speedup using a multi-reduction over just calling reduce repeatedly.
Describe the solution you'd like
https://godbolt.org/z/qGs7TxjTW
Describe alternatives you've considered
No response
Additional context
No response