Differences between alpaka, Kokkos, RAJA, etc.

etiennemlb commented 2 years ago

Hi,

I'm looking at different flavors of "Abstraction Library for Parallel Kernel Acceleration".

How is Alpaka different from Sycl, Kokkos, Raja or OpenCL ? Pros, cons.

Thanks

j-stephan commented 2 years ago

Hi @etiennemlb,

I'll try to answer your question.

Programming models

All of the mentioned projects aim for the same goal of performance-portable single-source programming. But they take different paths to achieve it (namely their programming models).

RAJA parallelizes through a series of loop transformations which are mapped to the underlying hardware.

Kokkos offers you a parallel_for construct (and algorithms based on parallel_for) which is customizable with policies to map to the various levels of parallelism.

alpaka gives you the tools to write any portable kernel. However, it is your job as programmer to fill in the details of the algorithm. If you are interested in a more high-level approach of coding I recommend to take a look at our vikunja project (which is based on alpaka): https://github.com/alpaka-group/vikunja

Other abstractions

Both Kokkos and RAJA try to abstract away from the gritty details of device management, memory management, and so on.

alpaka always gives you full control over anything happening in the program. You decide when buffer allocations, offloading to devices, ... happen. Thanks to alpaka's design it is also user-extensible. For example, if you don't like the way our device queue implementation works you can provide your own. If it fulfills all requirements mandated by our internal concepts it will integrate nicely with the other alpaka utilities.

Performance

Since all projects are tailored for the HPC crowd and are maintained by HPC experts they are usually in the same ballpark of performance. They are also similar (to each other and to native programming models) in their capabilities, i.e. what can be expressed in code and mapped to hardware. For a recent study of alpaka vs other programming models see here: https://link.springer.com/chapter/10.1007/978-3-031-10419-0_6

Most of alpaka's abstractions are resolved during compile-time (thus they don't result in runtime overhead). You can therefore assume that alpaka offers you a level of performance very close to the native models like CUDA since the generated machine code is very similar.

Other aspects

We don't use Kokkos and RAJA very often so I won't comment on their downsides. Please get in touch with their respective developers - they know their strengths and weaknesses much better than we do.

alpaka is somewhat verbose (compared to the other programming models). This is not because we are bad API designers but because alpaka is very customizable and offers a lot of control for the user. It requires a certain familiarity with modern C++, though, and you shouldn't be afraid of using C++ templates.

OpenCL and SYCL

These are somewhat different. Both are "just" API specifications provided by an industry consortium. Industry players (hardware vendors and sometimes third parties) need to provide an implementation suitable for a specific set of hardware. The degree of support varies across vendors; sometimes they don't (yet) support a newer revision of the standard, sometimes they rely heavily on their own extensions for performance. This makes true portability hard to achieve in practice unless you restrict yourself to the lowest common denominator (or different code paths for different runtimes).

In addition, OpenCL is a split-source language (all of the others are single-source C++ APIs): Your host program is coded in C, C++ or another language while the device code is written in the OpenCL C(++) dialect and requires separate compilation at some point.

I hope that cleared things up for you.

fwyzard commented 2 years ago

Hi @etiennemlb, as an alpaka user and sometimes contributor, let me add the reasons the CMS Collaboration has decided to adopt Alpaka rather than Kokkos or SYCL/oneAPI as a perfomance portability solution for the next 3-5 years.

Programming Model

In our experience, programming with Alpaka closer to using CUDA than Kokkos and SYCL. Kokkos strongly advices you to use its abstractions, which may or may not map well to the algorithms and code base at hand. The original SYCL standard is even more different, with the use of buffers rather than pointers. This may be an advantage or a disadvantage depending on one's use cases, of course. The extensions pushed by Intel to OneAPI and the new SYCL standard also alleviate this problem - but AFAIK they are not yet adopted by other SYCL implementations.

Performance

For us it was essential to achieve near-native performance on CPUs and NVIDIA GPUs. We could get this with Alpaka, but not with Kokkos.

Maturity

After an year-long investigation we concluded that Alpaka and Kokkos are well-estabilished products, while oneAPI was still kind of a work in progress, and the other SYCL backends were even less ready, especially for targeting NVIDIA or AMD hardware.

We will continue to monitor their progress, of course.

Single binary, multiple backends

Our software distribution model greatly benefits from being able to ship a single binary that can target multiple backends (e.g. CPUs, NVIDIA and AMD GPUs) at runtime.

We could achieve this using native CUDA and ROCm, and using Alpaka. It was not possible to do it with Kokkos (not even targeting different generations of NVIDIA GPUs). It was initially possible with the SYCL backend of LLVM, but is has been reported to be broken in recent releases.

etiennemlb commented 2 years ago

Hi, Thanks a lot for your answers, it'll help me

j-stephan commented 2 years ago

Thanks for asking this @etiennemlb. I'm pinning this issue for the time being since it serves nicely as a form of documentation.

alpaka-group / alpaka