Modernize core - Githubissues

jngrad commented 3 weeks ago

Fixes #4846 Partial fix for #4847

Description of changes:

migrate to C++20 in both C++ and CUDA code
- CUDA 12+ is required
- a recent C++ compiler is required, recommended: GCC 11+, Clang 18+, AppleClang 15+
use STL containers
- use std::optional instead of boost::optional
- use std::numbers instead of Utils::numbers
- use std::span instead of Utils::Span
- use std::ranges and ranged-based algorithms instead of boost::ranges and boost::algorithm
- cannot use std::array instead of Utils::Array, because its constexpr methods don't have the __device__ qualifier ("calling a constexpr __host__ function from a __host__ __device__ function is not allowed")
reduce dependency on Boost
- remaining: boost::mpi, boost::serialization, boost::unit_test, boost::algorithm (for C++26 features)
migrate to lbmpy/pystencils 1.3.3 and to the most recent version of the waLBerla development branch
rewrite CMake by following best practices
- the object-in-fluid module is now also installed when the espressomd module is installed
- C++ and CUDA source files are now treated separately and no longer share the same compiler flags
- CUDA shared objects are now properly linked against instrumentation tools (e.g. Caliper, UBSAN)
- CUDA objects now generate code coverage information for host code
- Boost unit tests can now be written in a CUDA source file
rewrite P3M nested loops that operate on 3D grids
- a utility function wraps the nested loops and takes a kernel as argument that operates on the grid
- intermediate values are now systematically cached
- the analytic cotangent sum kernel was inlined

jngrad commented 2 weeks ago

@reinaual This PR should be considered a top priority. It is needed to complete the hand-over to Forschungszentrum Jülich ("MultiXscale deliverable" e-mail from May 10). It is also a pre-requisite for major refactoring campaigns, such as moving the waLBerla bridge to a separate repository (blocked by its dependency on Utils:: functions and classes) and moving the Monte Carlo code to Python (blocked because it touches the same C++ code as this PR).

@rodrigoacb @phohenberger The P3M code now leverages C++20 features: std::span, std::ranges, concepts. In addition, to facilitate porting to Cabana and Kokkos, all nested loops of the form:

auto accumulate = 0.;
for (int nx = -mesh[0] / 2; nx < mesh[0] / 2; nx++) {
  auto const ctan_x = cotangent_sum(nx, mesh_i[0]);
  for (int ny = -mesh[1] / 2; ny < mesh[1] / 2; ny++) {
    auto const ctan_y = ctan_x * cotangent_sum(ny, mesh_i[1]);
    for (int nz = -mesh[2] / 2; nz < mesh[2] / 2; nz++) {
      if ((nx != 0) or (ny != 0) or (nz != 0)) {
        auto const n2 = nx * nx + ny * ny + nz * nz;
        auto const cs = ctan_y * cotangent_sum(nz, mesh_i[2]);
        auto const [alias1, alias2] =
            p3m_tune_aliasing_sums(nx, ny, nz, mesh, mesh_i, cao, alpha_L_i);
        auto const d = alias1 - Utils::sqr(alias2 / cs) / n2;
        if (d > 0.) {
          accumulate += d;
        }
      }
    }
  }
}

were rewritten as:

auto const m_stop = mesh / 2;
auto const m_start = -m_stop;
auto indices = Utils::Vector3i{};
auto values = Utils::Vector3d{};
auto accumulate = 0.;
for_each_3d(m_start, m_stop, indices,
  [&]() {
    if ((indices[0] != 0) or (indices[1] != 0) or (indices[2] != 0)) {
      auto const n2 = indices.norm2();
      auto const cs = Utils::product(values);
      auto const [alias1, alias2] =
          p3m_tune_aliasing_sums(indices, mesh, mesh_i, cao, alpha_L_i);
      auto const d = alias1 - Utils::sqr(alias2 / cs) / n2;
      if (d > 0.) {
        accumulate += d;
      }
  },
  [&](int dim, int n) {
    values[dim] = cotangent_sum(n, mesh_i[dim]);
  });

Some kernels rely on random access iterators (std::size_t index or std::vector<double>::iterator it_energy) that are incremented at every step of the innermost loop. This side-effect can be an obstacle to execution policies that rely on out-of-order execution or parallel execution, but the good news is, they can always be replaced by a std::size_t pos calculated on-the-fly using the loop counters stored in indices. The flattened arrays have row-major order. This can be a bit tricky when the loop counters don't start at 0; when in doubt, do git show 7a91cfa and search for the string Utils::get_linear_index until you find the hot loop of interest.

The execution policy for for_each_3d() may differ depending on the situation. Some of them (like in the example above) have a Projector kernel that populates an array of intermediate values that are expensive to compute and thus cannot be executed fully out-of-order without incurring a small performance penalty (calling the projector $3N^3$ times instead of $N + N^2 + N^3$ times). When no Projector kernel is provided, it defaults to a constexpr no-op function that generates no assembly code. See for_each_3d.hpp for more details. The RX/RY/RZ and KX/KY/KZ constants are used to transpose data; kernels that use them might not fully benefit from SIMD.

Not all nested loops contribute equally to the runtime. A Caliper report for the P3M benchmark is reproduced below. The nested loops in the top-level integrator are quite inexpensive compared to the reciprocal space (all-to-all FFT) and real space (N-square short-range loop) force calculations, although we have concrete plans to improve the performance of these two components. There is an expensive nested loop in the accuracy calculation kernel of the tuning algorithm, due to the $\mathcal{O}(N^6)$ complexity of evaluating a double sum over the 3D grid (it doesn't appear as a double sum in the Caliper report, because I disabled the nested instrumentation in that kernel to reduce overhead).

$ CALI_CONFIG=runtime-report mpiexec -n 4 ./pypresso ../maintainer/benchmarks/p3m.py \
    --volume_fraction=0.25 --particles_per_core=2000
Path                         Min time/rank Max time/rank Avg time/rank Time %
integrate                         3.783249      3.835922      3.808505  4.946649
  calculate_forces                2.312934      7.806445      5.088166  6.608728
    init_forces                   0.132315      0.138147      0.135768  0.176341
    calc_long_range_forces       32.141172     32.232919     32.174786 41.789985
      for_each_3d                 0.512280      0.530820      0.519396  0.674614
    short_range_loop             20.797703     26.168697     23.444353 30.450525
tune                              3.030472      3.126394      3.075959  3.995187
  calculate_accuracy              0.000367      0.000440      0.000406  0.000527
    for_each_3d                   2.264349      2.353698      2.311547  3.002335
  recalc P3M params               0.002077      0.002345      0.002197  0.002853
    for_each_3d                   0.056157      0.059715      0.057934  0.075247
      for_each_3d                 0.070111      0.073837      0.071683  0.093105
  integrate                       0.305274      0.306183      0.305689  0.397042
    calculate_forces              0.107690      1.059676      0.547243  0.710782
      init_forces                 0.005260      0.005526      0.005395  0.007007
      calc_long_range_forces      1.348340      1.419409      1.369864  1.779237
        for_each_3d               0.017730      0.018481      0.018187  0.023622
      short_range_loop            1.880777      2.821218      2.371574  3.080302
recalc P3M params                 0.001075      0.001241      0.001148  0.001491
  for_each_3d                     0.656495      0.701529      0.682490  0.886447
    for_each_3d                   0.817611      0.863255      0.842211  1.093899

jngrad commented 1 week ago

General MultiXscale Meeting: Clang 18.1 has full support for HIP, OpenCL, AMD GPUs, Intel GPUs and near-complete support for OpenMP (including GPU kernel calls). This makes Clang a compiler with strategic importance, in particular Clang 19.0 because it implements an extra OpenMP feature that is needed to build the testsuite of another software in EasyBuild. This PR introduces support for Clang 18.1 in ESPResSo.

espressomd / espresso

Modernize core #4931