CUDA/HIP Backend Refactor

jeremylt commented 2 years ago

As a result of how they were designed, there is a bunch of code duplication in the CUDA backends, and as a result of this CUDA duplication, the HIP backends inherited this same code duplication.

We should make a PR or series of PRs that is actually designed to refactor and reduce code duplication across these backends.

What code could/should be combined?
Where and how do we need to allow for differences between the backends and platforms?
Where do we want to test to prevent regression from aggressive or incorrect amalgamation between CUDA and HIP?
What needs to be done to allow the code generation backends (gpu/cuda/gen and gpu/hip/gen) to share kernels from the other backends?

@tcew, @jedbrown, any anyone else who I'm missing but is interested, please feel free to jump into this issue or the discussion with thoughts I'm overlooking.

YohannDudouit commented 2 years ago

I would unify hip/ kernels/* and cuda/kernels/* into gpu_common_kernels/* (or something similar), currently the code is duplicated. In the future it could be different, Hip/Cuda could have different code, but I think that the design would be better if we abstracted the fact that we target Hip or Cuda architectures. This would allow to test new implementations in a more modular way. We can have different implementations of the same algorithms that live under common_gpu_kernels, then which one we decide on using is chosen when loading the source. The implementations loaded can be different for hip and cuda. The purpose being to try different implementations for different scenarios. We could imagine loading different implementations not only based on hip/cuda, but also on the number of quadrature points and degrees of freedom. We already know that there is no best performance approach for all cases, that different implementations work better for different cases, this design could also handle that. Also, there is implementations that result in less register pressure. In certain applications, register pressure can become an issue when the QFunction gets big, this design would also allow to change the parallelization strategy to accommodate this kind of issues. This design would potentially allow to fuse magma backend kernels too.

On the topic of implementations that would be specific to hip or cuda, I am not aware of such things. Generalizing code to target either Hip or Cuda is relatively trivial, the architecture specific keywords can easily be abstracted behind macros (CEED_DEVICE, CEED_HOST, CEED_HOST_DEVICE, CEED_MEM_SHARED, etc...).

jeremylt commented 2 years ago

I think a smaller first step could be refactoring the code generation backends to share the kernels that other backends use. Currently there are some minor differences, but I don't know why those differences were added.

YohannDudouit commented 2 years ago

This is a good point, if we gather code, then we have to document in the same place the reasons for the differences.

My proposal above is not a "first step" but a goal, I guess the different tasks would be:

Gather common code between hip/cuda
Generalize code that differ from hip/cuda to be architecture agnostic
Generalize the gen interface to make sense with simplices (is it just the same as 1D tensor?)
Unify the interface of ref and shared implementations with the gen interface, this would allow to generate gen kernels from ref and shared functions, but also any other implementation idea that might come.
Add a mechanism to pick and generate a specific implementation.

jeremylt commented 2 years ago

For the long term health of these backends, I think we should do a cleanup and refactor in the near term. Combining kernels across the CUDA and HIP backends should come after this near term refactor. I don't know enough about performance studies between CUDA and HIP to attempt combining pieces of these two backend 'families' myself, but I do know enough to refactor the backend design into something cleaner.

Proposed near term refactor roadmap:

[x] Merge #841

PR 2

[x] Tidy mechanism by which *-shared and *-gen reach into *-ref for JiT, ceed backend data, etc

PR 3

[x] Pull kernel source strings into header files
[x] Refactor *-ref and *-shared kernels to use templates for compatibility with *-gen

PR 4+

[x] Refactor *-gen to break Ceed*GenOperatorBuild into smaller pieces
[ ] Add simplex support to *-gen
[ ] Add AtPoints support to *-gen
[x] Add collocated gradient support to *-gen

jeremylt commented 9 months ago

I stalled out and focused on some Ratel work before wrapping up the final stage of this issue. @jedbrown I think this last stage of the GPU /gen backend refactor would let us most easily incorporate the new basis (including particles) work into these backends. Depending upon prioritization, I think this would be a good think to try to make time for in the spring.

CEED / libCEED

CUDA/HIP Backend Refactor #839