Unified (sparse) BLAS, FFT, etc?

pauleonix commented 1 year ago

Are there plans to add abstractions for math libraries in alpaka or another alpaka-group project? Alpaka seems to do a great job of abstracting the process of writing kernels, but many HPC applications also use vendor-tuned libraries like Intel MKL, cuBLAS, cuSparse, cuFFT and their AMD (roc*) counterparts. While the APIs for e.g. BLAS are very similar for the different implementations, small differences in the APIs often result in ugly code using #ifdefs etc.

Kokkos seems to have such wrappers for (sparse) BLAS but not for FFTs (yet) in kokkos-kernels.

bernhardmgruber commented 1 year ago

We do have plans for abstracting vendor specific linear algebra libraries. @sliwowitz is working on this. Maybe he can comment on the status of this effort.

pauleonix commented 1 year ago

After reading #1798 I want to clarify that I would include random number generation as well on the question side, but in terms of priority I think that BLAS (dense then sparse) functionality called from the host is the most used/important.

I explicitly mentioned BLAS and FFTs, as these should be easiest to wrap due to the already similar APIs (oriented around CBLAS and FFTW) and I was very sure that there are optimized libraries for all of them on all platforms. I haven't studied the AMD libraries and Intel's MKL enough to know which other CUDA Math libraries have equivalents on other platforms.

After what @bernhardmgruber wrote, my main question is answered: Yes, these kinds of library wrappers are generally part of the scope of alpaka and they are being worked on. If @sliwowitz has time to comment on the status for me and other interested readers, that would be welcome, but I would be fine with closing this issue either way.

Thanks for the answer(s) and the work on alpaka!

j-stephan commented 1 year ago

There is an effort to bring a subset of BLAS into the C++ standard: https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p1673r7.html#purpose-of-this-paper

When we tackle this issue it might be a good idea to use this as basis. Thoughts @sliwowitz @bernhardmgruber @psychocoderHPC?

pauleonix commented 1 year ago

Together with executors, senders/receivers, mdspan etc these future C++ features would ideally completely replace not only these BLAS wrappers, but also frameworks like alpaka itself as I understand them. That being said it might still take quite some time for these things to mature.

Nvidia certainly tries to accelerate these developments (see their nvc++ compiler with parallel algorithm offload, libcudac++ standard library and e.g. RAFT (RAPIDS) adopting mdspan/mdarray before standardization).

While it generally seems like a good idea to me to adopt the same interface, to get the full picture one would likely also need to adopt some of the other features (at the least mdspan and executors?) to really make it work. Considering that these APIs can still change significantly during the standardization process, tracking them might bring a lot of work/instability. But looking at their present state and getting some inspiration seems like a good idea anyway.

j-stephan commented 1 year ago

Well, mdspan is on our to do list anyway (see #1482). I don't think we have talked about executors yet, but IMO it should be possible to create alpaka-based executors. This would mean that we'd have to create separate libraries for the abstract BLAS, ... functionality.

mhoemmen commented 1 year ago

Greetings! A colleague pointed out this discussion. I'm an author of P1673 and other mdspan and mdarray proposals, and would be happy to answer any questions you might have about them.

@j-stephan wrote:

There is an effort to bring a subset of BLAS into the C++ standard....

The link you posted is to an older version of the proposal. You can use the wg21.link service to get the latest published version. https://wg21.link/p1673 currently points to R10.

@pauleonix wrote:

Considering that these APIs can still change significantly during the standardization process....

mdspan (P0009 and follow-on papers) was voted into the C++23 draft this summer, so it is unlikely to change much. There are two national body (NB) comments relating to mdspan. The first NB comment asks for approval of P2630 (adding submdspan, the slicing interface). P2630 is currently under LEWG review, and LEWG has asked for more information (see https://github.com/cplusplus/papers/issues/1293 ), so I consider it unlikely to be approved for C++23. I also would prefer to see P2642 (padded mdspan layouts) approved in some form, which would affect what submdspan returns for layout_left and layout_right mdspan. The second NB comment asks for default_accessor to take a second optional IndexType template parameter, and to use index_type as the offset, rather than size_t. I assess this as a small and reasonable change that might reach C++23.

submdspan (P2630) is still under LEWG design review. I think the function name and the order and number of parameters are unlikely to change.

mdarray (P1684) will need to see LEWG at least once more, so its interface is perhaps less fixed.

stdblas (P1673) is still under design review. LEWG has asked to split up the task of reviewing it into small groups.

fwyzard commented 1 year ago

Hi @mhoemmen, thanks for pointing out these new proposals.

I've skimmed P2642 but I'm not very familiar with mdspan and its layouts, so I didn't really get what the new proposal adds.

Does the overaligned access described in 3.2.2, point 1 cover the same use case we discussed in https://github.com/kokkos/mdspan/issues/117 ? Or does it assume that the padding is still a multiple of the elements' size ?

mhoemmen commented 1 year ago

Hi @fwyzard !

@fwyzard wrote:

Does the overaligned access described in 3.2.2, point 1 cover the same use case we discussed in https://github.com/kokkos/mdspan/issues/117 ? Or does it assume that the padding is still a multiple of the elements' size ?

https://github.com/kokkos/mdspan/issues/117 is a weird (but legal) use case. It conflates two different features:

a strided ("pitched") allocation, and
unaligned access (not aligned to the value type's alignment -- this is what makes it weird).

Layout mappings themselves are independent of the element's size (). The padding affects the stride, which is an element index, not a byte index. The key to https://github.com/kokkos/mdspan/issues/117 is that you can make the layout mapping index into bytes, and the accessor turn those bytes into values. That's how it works if the pitch is not a multiple of sizeof(value_type). The example turns out to need two nonunit strides: the allocation's pitch, and sizeof(T) (the value type's size in bytes). As a result, `layout__padded` won't work here, precisely because of the accessor trick that turns a byte offset into a proxy-reference-to-element-type.

To understand "Point 1" in P2642, I recommend reviewing the overaligned access example.

(*) Nothing forbids one from writing a custom layout that encodes the element's size, but none of the use cases discussed here call for that. This could be useful, for example, if trying to describe hardware acceleration of array operations that uses different array layouts for different element types.

bernhardmgruber commented 1 year ago

Hi @mhoemmen! Great to see you joining the discussion! I am in a hurry, so here are just a few quick comments:

First of all, we are super happy that mdspan made it into the working draft! I hope it stays there until the release of C++23. The subsequent features like submdspan and the padded layouts are great as well! I tried the submdspan in #1788 and it works! I am hesitant though whether they will make C++23 because the feature freeze for C++23 already happened and it seems improper to me to try to land further new material via NB comments. Also the fact that the new padded layouts changed the behavior of submdspan indicates to me that the facility might not be fully baked. But we will see in Kona (I will join virtually). And in the end, it matters what compilers and standard libraries support. So if they land very early in C++26, the new facilities may still ship close in time with mdspan! And apart from that, you can also get them via package managers. I really like mdspan's design in this regard, because I hope to see many people come up with very interesting layouts and accessors which they can plug into the one std::mdspan. Again, well done!

For the standard blas interfaces, I think there will still be some work necessary, as you outlined. When that lands, vendors also need to adopt it. E.g., Intel would come up with a conformant interface to MKL. Same for NVIDIA etc. I expect this to still take many years which portability layers like alpaka and Kokkos need to bridge. But they could implement the standard interface from the proposal in the meantime, so it is definitely interesting to us that there is such a proposal! Whether we have the resources provide and maintain such an implementation is a different question.

unaligned access (not aligned to the value type's alignment -- this is what makes it weird).

It's important to point out that the access is not unaligned. The alignment is just smaller than the size of the value type.

struct RGB {
    float r, g, b;
};

Has a size of 12, but an alignment of 4.

We also know that the pitches we typically have are powers of two (e.g. 512), so even though the pitch is not divisible by the value type's size, it is divisible by the value type's member's size and alignment.

mhoemmen commented 1 year ago

Thanks for the kind words!

I tried the submdspan in https://github.com/alpaka-group/alpaka/pull/1788 and it works! I am hesitant though whether they will make C++23 because the feature freeze for C++23 already happened and it seems improper to me to try to land further new material via NB comments.

I recommended against trying to get submdspan into C++23 for that reason.

It's important to point out that the access is not unaligned. The alignment is just smaller than the size of the value type.

Ah, sorry! If it's legal C++ to dereference the offset pointer, then you can skip the whole proxy reference thing and just write an accessor whose access function does *reinterpret_cast<RGB*>(p + i).

alpaka-group / alpaka

Unified (sparse) BLAS, FFT, etc? #1794