[Meta] Consider studying Vulkan as API design input

I do not expect alpaka to support Vulkan as a backend anytime soon, at least not without going through an abstraction layer like codeplay's SyCL-over-Vulkan implementation, because Vulkan uses a split-source programming model and this is fundamentally incompatible with alpaka's current design.

However, I think Vulkan's API design could inform alpaka's future API design direction, because it is one of the most recently designed GPU APIs and probably the lowest level GPU API to achieve wide-scale popularity in recent times. This means that it exposes a wealth of interesting concepts which are not yet all present in current-generation GPU compute APIs, but may make their way into future APIs and extensions (and indeed, already are coming in some cases).

Therefore, Vulkan provides a good largest common denominator to answer alpaka design questions such as "Is there any GPU API which violates this property?" or "Is there some place in the alpaka which could benefit from future extension headroom?".

This is meant to be a general issue for discussing whether you think that the general idea of taking inspiration from Vulkan makes sense, and which of these Vulkan features you think are particularly important for alpaka to support. If it is agreeed that particular Vulkan features should be added to alpaka's abstraction vocabulary, it would be wise to create dedicated issues about these features instead of using this issue, as otherwise the discussion thread would get very messy.

Here are particular Vulkan concepts that I think could provide valuable design input for future alpaka API evolutions [1] :

Backend API extension and evolution: Long-lived APIs evolve over time. Extensions are proposed, initially enabled for some hardware/drivers only, and in some cases eventually merged into the core API.
- The extension merging process can take a long time, and some extensions (e.g. raytracing) may be of interest to developers before they have made it into the core API. Enabling these extensions requires performing some operations during the API instance initialization process.
- Existence of multiple versions of a given core API means that an application may need to adapt to certain API features only being optionally present, either by bombing if they are not or by replacing them with older constructs at runtime.
Validation layers: Vulkan ditches all pretense of exhaustive runtime error checking, instead turning large categories of API misuse into undefined behavior. This improves runtime performance, at the cost of making applications harder to debug. To balance this tradeoff better, runtime API checks may be enabled during the application development process.
- Again, enabling these validation layers requires specifying some configuration parameters during the API initialization process.
Device properties: A Vulkan implementation may expose several widely different devices, such as an integrated Intel GPU and a discrete NVidia GPU. To help applications selecting the device that they are interested in, and check if the host system meets their requirements, a wide range of device queries are provided. These range from a simple "device type" enum (enough to discriminate integrated GPU vs discrete GPU) to a very detailed enumeration of hardware limits.
- The need to discriminated between devices within a single API is already coming to alpaka with SyCL support, and will need to be supported appropriately. As discussed during a recent workshop, the easiest way in the beginning is to hardcode a number of known SyCL implementations as Alpaka accelerator types, but true runtime device selection may be needed at some point.
- Regarding device queries, alpaka already has some, and it has been discussed before that more will be added as use cases demand, so I think this is relatively well covered.
Queue families: A Vulkan implementation is more or less guaranteed [2] to expose at least one "universal" command queue family which accepts graphics, compute, and any sort of dense data transfer commands. From this queue family, CUDA/OpenCL style "universal" command queues can be created. But a Vulkan implementation may also, and usually will, support multiple command queue families, e.g. a queue family which models DMA data transfers and only supports data transfers with certain requirements, but whose submissions are guaranteed to execute in parallel to graphics and compute commands.
- I think someone mentioned that Vulkan queue families could map into alpaka's queue properties, this is to be investigated further.
- Queue families break a number of API user assumptions, such as the assumption that a task may be submitted to any queue (it may be specific to a given queue family and incompatible with queues from other families). Synchronization within a queue may use special primitives which are cheaper than those used for synchronization across queues (see also synchronization section).
Command buffers: To avoid doing one driver call (which involves one syscall among other things) per API call, Vulkan enables batching submissions to the GPU driver into "command buffers".
- I'm leaving this part in because the concept is making its way into every new GPU API and might therefore be worth eventually integrating into the alpaka API once it is more common in compute APIs. But I don't think it is as critical for compute workloads as it is for graphics workloads. We're not usually spamming the API with work in the compute world, and when we do we're usually limited by host/device communication more than by driver overhead.
Synchronization: Rather than having one god host/device synchronization primitive that does everything like legacy GPU APIs, Vulkan has many lighter-weight primitives, which enable reducing the cost of synchronization by only doing as much synchronization as necessary.
- Fences are signaled by the device at the end of a (set of) command batch(es) and can be awaited by the host.
- Semaphores enable synchronizing command batches across multiple device queues, and can also be awited by he host.
- Events synchronize commands within a single device queue, and may also be signaled by the host.
- Successive API commands on a queue are allowed to execute concurrently by default, pipeline and memory barriers must be used to selectively disable undesirable concurrency and reorderings between successive commands.
- And, like in every other semi-modern GPU API, we have "wait for queue idle" and "wait for device idle" commands that are fine for application teardown and quick prototyping but shouldn't be used frequently in real code.
Specialization constants: Since GPU shaders are JITted by the driver, it is possible to adjust their compilation parameters at runtime in order to generate more optimal device-side code. Vulkan exposes this capability via the concept of specialization constants.
- I could see some very interesting uses for this facility in GPU compute code. Imagine specializing device code for the true parameters of the computation (e.g. execution CLI parameters), without going through a slow and costly host code rebuild...
Raytracing: While most of Vulkan's special-purpose compute features are unlikely to be of interest to alpaka's scientific computing audience, I believe that hardware-accelerated raytracing specifically already raised quite a few eyebrows in the scientific community, as it could have applications beyond graphics, e.g. in particle physics simulation.
- Furthermore, ray tracing pipelines are a relatively self-contained feature which, like compute shaders, is not integrated within the main graphics pipeline. So there is a fair chance that compute APIs will eventually embrace them.
Subgroups: This is analogous to CUDA's notion of a warp: it exposes to shaders that some tasks are executed together in a SIMT lockstep, which enables much more efficient synchronization between SIMT lanes.
- I'm not sure if alpaka exposes that notion already, if not it may want to consider it for future inclusion as even in compute APIs it's becoming common these days.
Memory types and heaps: Like OpenCL but a bit more elegantly, Vulkan exposes the full craziness of CPU-GPU memory bus interactions for optimization purposes by allowing device-visible memory to be allocated from various heaps and under different guarantees (memory types):
- Local to the device (i.e. in VRAM, or reserved RAM for integrated GPUs)
- Visible from the host (can be mapped for random writing)
- Host-coherent (host writes are visible from the device without cache flushing and vice versa)
- Host-cached (host writes go through the CPU cache, may not be coherent but faster)
- Lazily allocated (device may optimize out allocation of some resource if not needed)
Multi-GPU support: Since Vulkan 1.1, it is possible to treat multiple physical devices as a single logical device for the, which enables various Vulkan features taking advantage of fast inter-device interconnects (such as NVLink) to avoid unnecessarily slow device1->host->device2 memory transfers.
Resource usage flags: Like OpenCL, Vulkan requires the user to specify what a memory resource (buffer, image) will be used for. This allows much driver black magic under the hood (mapping descriptors to hardware registers without guesswork, storing images in compressed layout if they aren't written...).
Images and samplers: While these may look like graphics-only features to the untrained eye, one very important thing that they do is to give access to the GPU texturing units which can do cool things like super-fast linear interpolation with correct handling of out-or-range values. This is just as useful in general mathematics (for piecewise defined functions or interpolation-based fast approximation of complex analytical functions) as it is in graphics.
- There is prior art of both OpenCL and CUDA exposing this, which is a strong argument in favor of alpaka also exposing some flavor of it.
- During the recent alpaka workshop, it was discussed that alpaka developers were hesitant about exposing such GPU-specific features because they did not want to emulate in on the CPU as only developers know what's the best way to emulate it for their workload. This is fine. Just do it like OpenCL and expose it as an optional feature that a device may or may not support. If developers want to use GPU texturing functionality, it should be up to them to write their own sampler polyfill for CPU platforms.

This is not meant to be a full list of features which Vulkan exposes and alpaka doesn't. I'm mostly focusing on Vulkan features which I think could be of interest to the scientific computing community. In particular, I have excluded Vulkan features which are fully graphics-centric (e.g. rasterization pipeline, mesh shading...), are meant to optimize application latency rather than throughput (e.g. pipeline caches), or have been restricted to a single vendor for a long while which makes their standardization path unclear (e.g. device-generated commands).
If I'm reading the spec right, the exact guarantee is that if a device has one queue that supports graphics, then it has one queue that supports both graphics and compute, and if a queue supports graphics or compute, then it supports data transfers with arbitrary granularity.

alpaka-group / alpaka

[Meta] Consider studying Vulkan as API design input #1065