alpaka-group / alpaka

Abstraction Library for Parallel Kernel Acceleration :llama:
https://alpaka.readthedocs.io
Mozilla Public License 2.0
354 stars 74 forks source link

Libraries for explicit vectorization that might be usable for the Alpaka element layer #652

Open bussmann opened 6 years ago

bussmann commented 6 years ago

Vectorization is still an open issue (Does it belong in Alpaka at all? How do we enforce it?). I want to use this issue to create a list of libraries that might help. Please extend to your own liking.

Update: I have started ordering the list to my own liking in terms of usability, sustainability, etc. My current take is that VecCore from CERN uses Vc as a backend while xsimd is really an independent approach. Vc is also Helmholtz (Volker Lindenstruth) and thus has in principle long term support and it seems there's activity in including this into the C++ standard.

1) Vc: https://github.com/VcDevel/Vc 2) xsimd: https://github.com/QuantStack/xsimd 3) VecCore: https://github.com/root-project/veccore

These projects seem to be mostly one person efforts or not very active at all: Inastemp: https://gitlab.mpcdf.mpg.de/bbramas/inastemp boost.simd: https://github.com/NumScale/boost.simd VCL: https://www.agner.org/optimize/#vectorclass VCL KNC: https://github.com/mancoast/vclknc QuickVec (a student project): https://www.andrew.cmu.edu/user/mkellogg/15-418/final.html#

ax3l commented 6 years ago

xsimd: https://github.com/QuantStack/xsimd

sbastrakov commented 6 years ago

Never tried that, but could be good (but currently not in boost): https://github.com/NumScale/boost.simd

j-stephan commented 3 years ago

We would like to get this into alpaka 0.7.0. However, this requires #38 to be resolved.

bernhardmgruber commented 3 years ago

One of the main issues with SIMD libraries and alpaka is that you want to write your kernel code using such SIMD facilities, have it nicely emit vector code for CPU targets, but also make it compile for GPUs as well. Using existing libraries, this is not trivial.

LLAMA contains such an approach using Vc in: https://github.com/alpaka-group/llama/pull/128. The key idea here is that for GPU targets, the kernel code compiles down to a scalar version and does not use the SIMD library at all, because SIMD library functions are usually not annotated with __host__ or __device__, so they cannot be referenced when we compile for CUDA or HIP.

CERN's VecCore solves exactly that by offering a vector type that can also, at compile time, be switched between a Vc vector or a scalar, thus also keeping Vc out when compiling for CUDA. So VecCore could be a potential off-the-shelf solution.

We could also hand-roll our own small SIMD wrapper that either compiles to scalar, or a loop over a vector of elements, or use a SIMD library such as Vc. But I guess this is a significant effort.

As for the API design, it seems like some implementations are converging on the std::simd design, which you can find here: https://en.cppreference.com/w/cpp/experimental/simd/simd. For a detailed rational on the design, you can read Matthias Kretz's PhD thesis.

Also interesting, the Kokkos SIMD library uses exactly this approach as well: https://github.com/kokkos/simd-math Also see tutorial slides here: https://github.com/kokkos/kokkos-tutorials/blob/main/LectureSeries/KokkosTutorial_05_SIMDStreamsTasking.pdf.

Kokkos SIMD also shows the interaction with Kokkos Views, which seems like you declare your SIMD types already in your views: Kokkos::View<Kokkos::SIMD<float>>. But there are more interesting options for the SIMD ABI parameter, which I have not studied in detail yet. So we also need to consider how the SIMD types interact with memory views.

sbastrakov commented 3 years ago

Just to add to the list: https://github.com/google/highway

bernhardmgruber commented 1 year ago

Btw, I solved this recently in LLAMA. Here is the documentation: https://llama-doc.readthedocs.io/en/latest/pages/simd.html I also presented it on my poster at ACAT22 last week: https://indico.cern.ch/event/1106990/contributions/4991311/attachments/2533306/4361386/LLAMA%20poster.pdf

fwyzard commented 1 year ago

Btw, Intel is working to propose a SIMD library based on xvec/simd into Boost: https://lists.boost.org/Archives/boost/2022/09/253579.php .

bernhardmgruber commented 1 year ago

IIUC, this is an implementation of std::simd by Intel. It's great to see more implementations appearing! And I am especially happy they try to get it into Boost. That is going to be a tough :)

Thanks for sharing the link!