STEllAR-GROUP / octotiger

Astrophysics program simulating the evolution of star systems based on the fast multipole method on adaptive Octrees
http://octotiger.stellar-group.org/
Boost Software License 1.0
48 stars 18 forks source link

Hydro optimizations: Explicit work aggregation, Comm optimizations, explicit SIMD instructions #426

Closed G-071 closed 2 years ago

G-071 commented 2 years ago

This PR contains 3 major changes to the hydro module:

  1. Added explicit work aggregation for the two major hydro compute kernel (reconstruct, flux) using the new executors added in https://github.com/SC-SGS/CPPuddle/pull/12. The maximum number of aggregated kernel launches can be steered, CLI parameter --max_executor_slices=<unsigned int>. This feature is yet to be considered experimental, though it passed all tests and increases performance substantially. This PR also adds some new tests to continually test different aggregation sizes. The work aggregation works for all hydro GPU kernel implementations (CUDA, HIP, Kokkos).

  2. Improved communication in the hydro solver! Notably, exchanging the hydro_boundaries (and subsequently the hydro_amr_boundaries) now takes the HPX localities into account. If the neighbor in question is on the same locality, no HPX action will be used for communication! Further, no communication buffer will be used. Instead, the memory of the neighbor is accessed directly (with some new local hpx promises/futures to avoid any races). The old communication implementation can still be used with the parameter --optimize_local_communication=0.

  3. Added first hydro SIMD implementation for flux and reconstruct kernels. The kernels still work on the GPU (using the Kokkos SIMD double specialization), but support a wide range of CPU SIMD instructions sets as we can both use std::experimental::simd and the Kokkos SIMD types on CPU (tested mostly with AVX512 so far). This also makes the previous experimental Vc flux kernel obsolete. Hence, I removed it. Furthermore, while adding this SIMD implementation I also cleaned up the experimental scalar kernels for flux and reconstruct which were leftovers from the initial hydro GPU port (removing a lot of code duplication in the process.)

While I am still actively working on the implementation of 1. and 3., right now is a good point to merge this back into master before the PR gets too large/unwieldy as everything is working and the tests pass. I will add subsequent optimizations / code cleanups in separate PRs.

Some additional features in this PR:

I add this as a draft PR, as some CI pipelines still show problems. The problems seem to be caused by the changed module environment on rostam (g++ 10 and clang 12 modules crash during loading) and obsolete g++/HPX versions on CircleCI. However, all tests pass locally, so it is likely we can fix this on the server-side independent of the PR (fix the modules on Rostam, upgrade CircleCI dependencies).

G-071 commented 2 years ago

The license information were missing in quite a few files actually! I have added them accordingly in 4236a528109a9839200d9e4861c3647bce7e4a81

G-071 commented 2 years ago

I adapted the Jenkins pipelines to the changes on Rostam and removed the legacy tests on CircleCI. I left the script for CircleCI in case we actually want to update these tests and re-enable them, but in their current form they're redundant anyway since the Jenkins tests also cover the same things (and far more).

Once the remaining pipelines are completed, this PR should be good to go!

@diehlpk Could you have another look at it?

I recommend that this PR should be merged before #425, as we need the Jenkins pipeline fixes there as well.