[DR]: GPU hackathon gk-g0-app work on parallelizing over components

This design review encompasses the ongoing work in the branch https://github.com/ammarhakim/gkylzero/tree/gk-g0-app-gpu-hack2024 for the GPU hackathon.

The main restructuring of the code is a targeted parallelization of certain operators "over components." So, for example, parallelizing certain operations over basis functions where each GPU thread handles not only a single cell, but a single basis function, update and we launch # of basis functions more threads.

This restructuring is not something we can universally employ. This parallelism hinges on each thread being able to work genuinely independently, and in many cases our kernels rely on the computation of intermediate variables to further increase the sparsity of the final update (think for example of the collisionless advection kernels in Vlasov or gyrokinetics, which have extra sparsity in the phase space characteristics which is pre-computed and then utilized in the final update).

However, we have identified two key locations where this parallelism helps significantly: the Maxwellian projection routine and configuration space-phase space weak multiplication.

For configuration space-phase space weak multiplication, we can simply have the GPU versions take thread ID in the input to the function (generating a new set of kernels that are "parallelized over components"). Note these are new kernels because the CPU does not prefer the loop over components and instead most efficiently operates when you give the whole kernel to the CPU.

For projection routines, we are splitting the ultimate operation over quadrature points into three stages:

Get n, u, T/m (and other quantities) at configuration space quadrature points
Get the Maxwellian at phase space quadrature points
Convert the nodal quadrature basis back to the modal basis

These conversions between different basis sets have been found to be most efficient utilizing the cuBLAS dgemm function and having the system do one large matrix-vector product of Ax = b where A is [num_basis x num_quad], x is [num_quad x num_cells], and b is [num_basis x num_cells].

We are currently working on a design of the dgemm calls such that there is memory, modeled after the bin_op_mem, that contains the relevant matrix/matrices needed to perform these operations.

We are also proposing a streamlining of the Maxwellian on basis functionality, as we see no reason to continue supporting the lab moments projection (since this is strictly less accurate than the primitive moments projection and the primitive moments projection has a correction routine that works best because it's correcting the primitive moments and not the lab moments). Further all Vlasov Maxwellian projection should go through the vlasov_lte object to avoid code duplication.

There may be other places in the code that could benefit from parallelization over components which may be worth exploring. For example, right now velocity moments are relatively slow for the number of operations performed (still a sub-dominant cost overall but sometimes a non-trivial component of the update, like 20%), because the GPU reduction routine for velocity moments is non-optimal, requiring an atomic add after the kernel call. We could attempt to parallelize velocity moments over components as well so that more threads are spawned and the atomic add overhead can be more easily masked (especially since every component could independently do atomic adds, assuming I understand how the atomic operations work).

ammarhakim / gkylzero

[DR]: GPU hackathon gk-g0-app work on parallelizing over components #395