Open pauleonix opened 3 months ago
Thanks for opening the discussion.
Sorry for dropping ArrayWrapper
it was an internal type we considered not used anymore.
Regarding your request to expand the API of cuda::std::array
:
I am not in favor of that because the actual problem is that the algorithm interface is suboptimal and not that cuda::std::array
is lacking an API. Instead of adding a wrapper class to work around a suboptimal interface we should consider improving the interface of the cub algorithms to take e.g a cuda::std::span
@gevtushenko what is your opinion here. I am not that deep into the cub algorithms that I could estimate the effort needed to change their API
@miscco As mentioned below the PR I wasn't a user of ArrayWrapper
although I would have been if I had known about it.
I agree that changing the algortihms interface would be ideal but is either a breaking change or blows up the amount of functions in the API.
I'm not sure if span
is ideal here because it seems like CUB makes users use local arrays by design (to avoid performance blunders) which the more flexible span
would change.
I just wanted to also consider what would be a minimal, non-breaking change (i.e. a conversion macro or a cuda::array
/cub::array
with .c_array
which certainly both have pros and cons as well).
Ah, I just realized that replacing C-style arrays with span
s in the interface would probably be non-breaking as well. My point about that being a significant design change still stands though.
the actual problem is that the algorithm interface is suboptimal and not that
cuda::std::array
is lacking an API.
That actually nails the issue. I agree!
I am not that deep into the cub algorithms that I could estimate the effort needed to change their API
The CUB interfaces taking references to native arrays are widely spread around, but the refactoring should be straight forward, since native arrays passed into the APIs would just convert to spans now. Also, the spans could just be indexed like native arrays, so the changes should be somewhat contained locally.
I agree that changing the algortihms interface would be ideal but is either a breaking change or blows up the amount of functions in the API.
I think neither breaking change nor API duplication is needed. Native arrays should convert nicely:
#include <span>
void foo_old(int (&arr)[4]) { ... }
void foo_new(std::span<int, 4> arr) { ... }
int main() {
int arr[4]{1, 2, 3, 4};
foo_old(arr);
foo_new(arr);
}
I'm not sure if
span
is ideal here because it seems like CUB makes users use local arrays by design (to avoid performance blunders) which the more flexiblespan
would change.
If a span<int, 4>
does not perform as fast as an int [4]
we have a performance bug in the implementation. It should lead to the same generated code. It may take a bit more compile-time though.
@bernhardmgruber I meant that users could pass span
s that point to e.g. global memory. Although my hunch that that wasn't (easily) possible with the current API might be wrong.
I would hope compilers are able to put local arrays into registers even when they are accessed through a span
(assuming everything is inlined).
@bernhardmgruber I fear using span
would not be as non-breaking as we expected it to be. Implicitly casting a C-style array to a span<T, N>
only works when T
is not a template type like in your example.
This version fails to compile:
#include <span>
template <typename T>
void foo_old(T (&arr)[4]) {}
template <typename T>
void foo_new(std::span<T, 4> arr) {}
int main() {
int arr[4]{1, 2, 3, 4};
foo_old(arr);
foo_new(arr);
}
See also this StackOverflow answer. So to avoid a breaking change one would have to duplicate the API's where the old interface would explicitly cast to span
and call the new interface.
please also consider a solution based on minimal "concepts" for array-like types. #2286 proposes a similar approach. This approach works with raw array, std::array. std::span, std::mdspan, std::vector, etc.
@fbusato I like the suggestion. The only downside I see is that the interface is less clear to the reader. But maybe calling the template type something like ArrayLike
instead of Input
is enough to make it readable.
Other than that I only see the same issue as with taking span
which is that it allows users to shoot themselves in the foot performance-wise by passing span
s that don't correspond to registers/local memory. Or is there a way to check this at compile-time as well? Either way I don't think that this downside would be significant enough not to do it anyway.
I mean the obviously correct solution is to constrain the algorithms with std::ranges::contiguous_range
https://godbolt.org/z/T6ehMf158
The issue with that is that we only backport ranges to C++17, which ... is not C++11
@miscco I was rather thinking of someone using static size span
s pointing to global memory instead of using cub::BlockLoad
or similar first and therefore cause non-coalesced access.
@miscco I was rather thinking of someone using static size
span
s pointing to global memory instead of usingcub::BlockLoad
or similar first and therefore cause non-coalesced access.
You can also do that today if you have a statically-sized array in global memory. Can be passed straight to a CUB agent taking a reference to such an array.
@bernhardmgruber Yeah but having a global buffer of arrays is a rather rare occurrence, especially given that vector types exist. And using casts to achieve the same is hopefully off-the-beaten-path-enough to not happen accidentally.
Still, I'm a fan of more flexibility. Maybe there are even some very creative, proper use-cases that are enabled or rather simplified by this interface change. I have been thinking if ranges::contiguous_range
is actually stricter than necessary as it does not accept e.g. something using range adaptors. I mean in terms of performance it probably does not matter if I pass such a view or put the results into an (register-) array first (i.e. it's not as important as it is for device algorithms where it enables kernel fusion). But in terms of simple code it could be quite nice to be able to pass views. Also in terms of code looking more similar between using STL ranges::
algorithms and CUB block algorithms.
Edit: I guess the main problem with that idea is that views from range adaptors don't have static size even if they could?
I am reopening this issue, because there are more CUB agents that take references to statically-sized arrays. However, our workaround of reaching into the guts of ::cuda::std::array
to pass data to the histogram agents, because we deprecated cub::ArrayWrapper
, was removed now.
Is this a duplicate?
Area
CUB
Is your feature request related to a problem? Please describe.
1764 deprecated
cub::ArrayWrapper
in favor ofcuda::std::array
without providing a replacement for users using it with CUB's block algorithms that take C-style arrays.cub::ArrayWrapper
provided access to it's C-stylearray
member, butcuda::std::array
doesn't (becausestd::array
doesn't either). Inside CUB this is solved by accessingcuda::std::arrays
's.__elems_
(an implementation detail) which is not a good solution for users.Describe the solution you'd like
Replace the C-style arrays in CUBs interfaces withUse fbusato's "minimal 'concepts' for array-like types" as proposed in #2286.cuda::std::span
s.Describe alternatives you've considered
cuda::std::span<T, N>
seems like the right candidate for the interfaces but withT
being a template type in the CUB algorithm this would still need a duplication of the interfaces to stay backward compatible.As C-style arrays decay to pointers when returned from functions, a simple function like
cub::to_array(cuda::std::array)
seems impossible. It could be done with a macro that accesses.__elems_
but macros have their own problems.Adding an API to
cuda::std::array
that is not available forstd::array
is probably against libcu++'s principles.Adding a whole new
cuda::array
for access to the C-style array member seems over the topalthough it might still be easier to maintain than adding overloads taking(Edit: This can be solved usingcuda::std::array
parameters to all CUB block algorithms.cuda::std::span
for the interface as pointed out by miscco below)I am currently using my own array wrapper to avoid plain C-style arrays in my code when interfacing with CUB's block algorithms.
Additional context
No response