NVIDIA / cccl

CUDA Core Compute Libraries
https://nvidia.github.io/cccl/
Other
1.16k stars 136 forks source link

[FEA]: Make CUB block algorithms usable with cuda::std::array #1877

Open pauleonix opened 3 months ago

pauleonix commented 3 months ago

Is this a duplicate?

Area

CUB

Is your feature request related to a problem? Please describe.

1764 deprecated cub::ArrayWrapper in favor of cuda::std::array without providing a replacement for users using it with CUB's block algorithms that take C-style arrays. cub::ArrayWrapper provided access to it's C-style array member, but cuda::std::array doesn't (because std::array doesn't either). Inside CUB this is solved by accessing cuda::std::arrays's .__elems_ (an implementation detail) which is not a good solution for users.

Describe the solution you'd like

Replace the C-style arrays in CUBs interfaces with cuda::std::spans. Use fbusato's "minimal 'concepts' for array-like types" as proposed in #2286.

Describe alternatives you've considered

cuda::std::span<T, N> seems like the right candidate for the interfaces but with T being a template type in the CUB algorithm this would still need a duplication of the interfaces to stay backward compatible.

As C-style arrays decay to pointers when returned from functions, a simple function like cub::to_array(cuda::std::array) seems impossible. It could be done with a macro that accesses .__elems_ but macros have their own problems.

Adding an API to cuda::std::array that is not available for std::array is probably against libcu++'s principles.

Adding a whole new cuda::array for access to the C-style array member seems over the top although it might still be easier to maintain than adding overloads taking cuda::std::array parameters to all CUB block algorithms. (Edit: This can be solved using cuda::std::span for the interface as pointed out by miscco below)

I am currently using my own array wrapper to avoid plain C-style arrays in my code when interfacing with CUB's block algorithms.

Additional context

No response

miscco commented 3 months ago

Thanks for opening the discussion.

Sorry for dropping ArrayWrapper it was an internal type we considered not used anymore.

Regarding your request to expand the API of cuda::std::array:

I am not in favor of that because the actual problem is that the algorithm interface is suboptimal and not that cuda::std::array is lacking an API. Instead of adding a wrapper class to work around a suboptimal interface we should consider improving the interface of the cub algorithms to take e.g a cuda::std::span

@gevtushenko what is your opinion here. I am not that deep into the cub algorithms that I could estimate the effort needed to change their API

pauleonix commented 3 months ago

@miscco As mentioned below the PR I wasn't a user of ArrayWrapper although I would have been if I had known about it.

I agree that changing the algortihms interface would be ideal but is either a breaking change or blows up the amount of functions in the API.

I'm not sure if span is ideal here because it seems like CUB makes users use local arrays by design (to avoid performance blunders) which the more flexible span would change.

I just wanted to also consider what would be a minimal, non-breaking change (i.e. a conversion macro or a cuda::array/cub::array with .c_array which certainly both have pros and cons as well).

pauleonix commented 3 months ago

Ah, I just realized that replacing C-style arrays with spans in the interface would probably be non-breaking as well. My point about that being a significant design change still stands though.

bernhardmgruber commented 3 months ago

the actual problem is that the algorithm interface is suboptimal and not that cuda::std::array is lacking an API.

That actually nails the issue. I agree!

I am not that deep into the cub algorithms that I could estimate the effort needed to change their API

The CUB interfaces taking references to native arrays are widely spread around, but the refactoring should be straight forward, since native arrays passed into the APIs would just convert to spans now. Also, the spans could just be indexed like native arrays, so the changes should be somewhat contained locally.

I agree that changing the algortihms interface would be ideal but is either a breaking change or blows up the amount of functions in the API.

I think neither breaking change nor API duplication is needed. Native arrays should convert nicely:

#include <span>
void foo_old(int (&arr)[4]) { ... }
void foo_new(std::span<int, 4> arr) { ... }
int main() {
    int arr[4]{1, 2, 3, 4};
    foo_old(arr);
    foo_new(arr);
}

I'm not sure if span is ideal here because it seems like CUB makes users use local arrays by design (to avoid performance blunders) which the more flexible span would change.

If a span<int, 4> does not perform as fast as an int [4] we have a performance bug in the implementation. It should lead to the same generated code. It may take a bit more compile-time though.

pauleonix commented 3 months ago

@bernhardmgruber I meant that users could pass spans that point to e.g. global memory. Although my hunch that that wasn't (easily) possible with the current API might be wrong.

pauleonix commented 3 months ago

I would hope compilers are able to put local arrays into registers even when they are accessed through a span (assuming everything is inlined).

pauleonix commented 2 weeks ago

@bernhardmgruber I fear using span would not be as non-breaking as we expected it to be. Implicitly casting a C-style array to a span<T, N> only works when T is not a template type like in your example.

This version fails to compile:

#include <span>
template <typename T>
void foo_old(T (&arr)[4]) {}
template <typename T>
void foo_new(std::span<T, 4> arr) {}
int main() {
    int arr[4]{1, 2, 3, 4};
    foo_old(arr);
    foo_new(arr);
}

See also this StackOverflow answer. So to avoid a breaking change one would have to duplicate the API's where the old interface would explicitly cast to span and call the new interface.

fbusato commented 2 weeks ago

please also consider a solution based on minimal "concepts" for array-like types. #2286 proposes a similar approach. This approach works with raw array, std::array. std::span, std::mdspan, std::vector, etc.

pauleonix commented 2 weeks ago

@fbusato I like the suggestion. The only downside I see is that the interface is less clear to the reader. But maybe calling the template type something like ArrayLike instead of Input is enough to make it readable.

Other than that I only see the same issue as with taking span which is that it allows users to shoot themselves in the foot performance-wise by passing spans that don't correspond to registers/local memory. Or is there a way to check this at compile-time as well? Either way I don't think that this downside would be significant enough not to do it anyway.

miscco commented 2 weeks ago

I mean the obviously correct solution is to constrain the algorithms with std::ranges::contiguous_range

https://godbolt.org/z/T6ehMf158

The issue with that is that we only backport ranges to C++17, which ... is not C++11

pauleonix commented 2 weeks ago

@miscco I was rather thinking of someone using static size spans pointing to global memory instead of using cub::BlockLoad or similar first and therefore cause non-coalesced access.

bernhardmgruber commented 2 weeks ago

@miscco I was rather thinking of someone using static size spans pointing to global memory instead of using cub::BlockLoad or similar first and therefore cause non-coalesced access.

You can also do that today if you have a statically-sized array in global memory. Can be passed straight to a CUB agent taking a reference to such an array.

pauleonix commented 2 weeks ago

@bernhardmgruber Yeah but having a global buffer of arrays is a rather rare occurrence, especially given that vector types exist. And using casts to achieve the same is hopefully off-the-beaten-path-enough to not happen accidentally.

Still, I'm a fan of more flexibility. Maybe there are even some very creative, proper use-cases that are enabled or rather simplified by this interface change. I have been thinking if ranges::contiguous_range is actually stricter than necessary as it does not accept e.g. something using range adaptors. I mean in terms of performance it probably does not matter if I pass such a view or put the results into an (register-) array first (i.e. it's not as important as it is for device algorithms where it enables kernel fusion). But in terms of simple code it could be quite nice to be able to pass views. Also in terms of code looking more similar between using STL ranges:: algorithms and CUB block algorithms.

Edit: I guess the main problem with that idea is that views from range adaptors don't have static size even if they could?

bernhardmgruber commented 2 weeks ago

I am reopening this issue, because there are more CUB agents that take references to statically-sized arrays. However, our workaround of reaching into the guts of ::cuda::std::array to pass data to the histogram agents, because we deprecated cub::ArrayWrapper, was removed now.