Closed G-071 closed 1 year ago
The PR now contains some additional features:
Furthermore, the PR now depends on kokkos/kokkos#5628 as we require the HPX TeamPolicy Fixes within, as well as the adaptations to make the execution space work with the current HPX master. This PR also contains the aforementioned optimization for small kernels (to be executed directly with they fit within one task) and the new sender-receiver implementation, which is nice to have. To make the Octo-Tiger work aggregation function correctly with this Kokkos PR, we also require SC-SGS/CPPuddle#14.
I changed the PR to allow for older HPX and Kokkos versions (instead of having to use master/develop). Still, I updated the CI test pipelines to use the newer stack (HPX v1.9.0, Kokkos 4.0.01/develop) where possible/required. I also fixed a ton of warnings in the process, as they were becoming increasingly annoying.
@diehlpk Mind having another look at it once all tests pass? It should be good to go now!
@G-071 looks good to me and we can merge once all tests passed.
This PR optimizes the SIMD implementation for the reconstruct Kokkos kernel: It's now stepping more efficiently through the sub-grid and uses hierarchical parallelism for cache blocking, resulting in a speedup of almost 2 compared to the old SIMD implementation introduced in #426 .
Also added are more convenient builds on SVE (by pulling in https://github.com/srinivasyadav18/sve automatically) and two more cmake variable to make changing between SIMD extensions and libraries more convenient.
The PR comes with one caveat: For the reconstruct changes, I had to switch to the HPX Kokkos executor -- this works but I like to get an performance improvement for small kernels merged for that executor before we merge this PR (hence me opening this as a draft).