CEED / libCEED

CEED Library: Code for Efficient Extensible Discretizations
https://libceed.org
BSD 2-Clause "Simplified" License
199 stars 46 forks source link

CUDA/HIP Backend Refactor #839

Open jeremylt opened 2 years ago

jeremylt commented 2 years ago

As a result of how they were designed, there is a bunch of code duplication in the CUDA backends, and as a result of this CUDA duplication, the HIP backends inherited this same code duplication.

We should make a PR or series of PRs that is actually designed to refactor and reduce code duplication across these backends.

@tcew, @jedbrown, any anyone else who I'm missing but is interested, please feel free to jump into this issue or the discussion with thoughts I'm overlooking.

YohannDudouit commented 2 years ago

I would unify hip/ kernels/* and cuda/kernels/* into gpu_common_kernels/* (or something similar), currently the code is duplicated. In the future it could be different, Hip/Cuda could have different code, but I think that the design would be better if we abstracted the fact that we target Hip or Cuda architectures. This would allow to test new implementations in a more modular way. We can have different implementations of the same algorithms that live under common_gpu_kernels, then which one we decide on using is chosen when loading the source. The implementations loaded can be different for hip and cuda. The purpose being to try different implementations for different scenarios. We could imagine loading different implementations not only based on hip/cuda, but also on the number of quadrature points and degrees of freedom. We already know that there is no best performance approach for all cases, that different implementations work better for different cases, this design could also handle that. Also, there is implementations that result in less register pressure. In certain applications, register pressure can become an issue when the QFunction gets big, this design would also allow to change the parallelization strategy to accommodate this kind of issues. This design would potentially allow to fuse magma backend kernels too.

On the topic of implementations that would be specific to hip or cuda, I am not aware of such things. Generalizing code to target either Hip or Cuda is relatively trivial, the architecture specific keywords can easily be abstracted behind macros (CEED_DEVICE, CEED_HOST, CEED_HOST_DEVICE, CEED_MEM_SHARED, etc...).

jeremylt commented 2 years ago

I think a smaller first step could be refactoring the code generation backends to share the kernels that other backends use. Currently there are some minor differences, but I don't know why those differences were added.

YohannDudouit commented 2 years ago

This is a good point, if we gather code, then we have to document in the same place the reasons for the differences.

My proposal above is not a "first step" but a goal, I guess the different tasks would be:

jeremylt commented 2 years ago

For the long term health of these backends, I think we should do a cleanup and refactor in the near term. Combining kernels across the CUDA and HIP backends should come after this near term refactor. I don't know enough about performance studies between CUDA and HIP to attempt combining pieces of these two backend 'families' myself, but I do know enough to refactor the backend design into something cleaner.

Proposed near term refactor roadmap:

PR 2

PR 3

PR 4+

jeremylt commented 9 months ago

I stalled out and focused on some Ratel work before wrapping up the final stage of this issue. @jedbrown I think this last stage of the GPU /gen backend refactor would let us most easily incorporate the new basis (including particles) work into these backends. Depending upon prioritization, I think this would be a good think to try to make time for in the spring.