Closed jngrad closed 5 months ago
@reinaual This PR should be considered a top priority. It is needed to complete the hand-over to Forschungszentrum Jülich ("MultiXscale deliverable" e-mail from May 10). It is also a pre-requisite for major refactoring campaigns, such as moving the waLBerla bridge to a separate repository (blocked by its dependency on Utils::
functions and classes) and moving the Monte Carlo code to Python (blocked because it touches the same C++ code as this PR).
@rodrigoacb @phohenberger The P3M code now leverages C++20 features: std::span
, std::ranges
, concepts
. In addition, to facilitate porting to Cabana and Kokkos, all nested loops of the form:
auto accumulate = 0.;
for (int nx = -mesh[0] / 2; nx < mesh[0] / 2; nx++) {
auto const ctan_x = cotangent_sum(nx, mesh_i[0]);
for (int ny = -mesh[1] / 2; ny < mesh[1] / 2; ny++) {
auto const ctan_y = ctan_x * cotangent_sum(ny, mesh_i[1]);
for (int nz = -mesh[2] / 2; nz < mesh[2] / 2; nz++) {
if ((nx != 0) or (ny != 0) or (nz != 0)) {
auto const n2 = nx * nx + ny * ny + nz * nz;
auto const cs = ctan_y * cotangent_sum(nz, mesh_i[2]);
auto const [alias1, alias2] =
p3m_tune_aliasing_sums(nx, ny, nz, mesh, mesh_i, cao, alpha_L_i);
auto const d = alias1 - Utils::sqr(alias2 / cs) / n2;
if (d > 0.) {
accumulate += d;
}
}
}
}
}
were rewritten as:
auto const m_stop = mesh / 2;
auto const m_start = -m_stop;
auto indices = Utils::Vector3i{};
auto values = Utils::Vector3d{};
auto accumulate = 0.;
for_each_3d(m_start, m_stop, indices,
[&]() {
if ((indices[0] != 0) or (indices[1] != 0) or (indices[2] != 0)) {
auto const n2 = indices.norm2();
auto const cs = Utils::product(values);
auto const [alias1, alias2] =
p3m_tune_aliasing_sums(indices, mesh, mesh_i, cao, alpha_L_i);
auto const d = alias1 - Utils::sqr(alias2 / cs) / n2;
if (d > 0.) {
accumulate += d;
}
},
[&](int dim, int n) {
values[dim] = cotangent_sum(n, mesh_i[dim]);
});
Some kernels rely on random access iterators (std::size_t index
or std::vector<double>::iterator it_energy
) that are incremented at every step of the innermost loop. This side-effect can be an obstacle to execution policies that rely on out-of-order execution or parallel execution, but the good news is, they can always be replaced by a std::size_t pos
calculated on-the-fly using the loop counters stored in indices
. The flattened arrays have row-major order. This can be a bit tricky when the loop counters don't start at 0; when in doubt, do git show 7a91cfa
and search for the string Utils::get_linear_index
until you find the hot loop of interest.
The execution policy for for_each_3d()
may differ depending on the situation. Some of them (like in the example above) have a Projector
kernel that populates an array of intermediate values that are expensive to compute and thus cannot be executed fully out-of-order without incurring a small performance penalty (calling the projector $3N^3$ times instead of $N + N^2 + N^3$ times). When no Projector
kernel is provided, it defaults to a constexpr
no-op function that generates no assembly code. See for_each_3d.hpp for more details. The RX/RY/RZ
and KX/KY/KZ
constants are used to transpose data; kernels that use them might not fully benefit from SIMD.
Not all nested loops contribute equally to the runtime. A Caliper report for the P3M benchmark is reproduced below. The nested loops in the top-level integrator are quite inexpensive compared to the reciprocal space (all-to-all FFT) and real space (N-square short-range loop) force calculations, although we have concrete plans to improve the performance of these two components. There is an expensive nested loop in the accuracy calculation kernel of the tuning algorithm, due to the $\mathcal{O}(N^6)$ complexity of evaluating a double sum over the 3D grid (it doesn't appear as a double sum in the Caliper report, because I disabled the nested instrumentation in that kernel to reduce overhead).
$ CALI_CONFIG=runtime-report mpiexec -n 4 ./pypresso ../maintainer/benchmarks/p3m.py \
--volume_fraction=0.25 --particles_per_core=2000
Path Min time/rank Max time/rank Avg time/rank Time %
integrate 3.783249 3.835922 3.808505 4.946649
calculate_forces 2.312934 7.806445 5.088166 6.608728
init_forces 0.132315 0.138147 0.135768 0.176341
calc_long_range_forces 32.141172 32.232919 32.174786 41.789985
for_each_3d 0.512280 0.530820 0.519396 0.674614
short_range_loop 20.797703 26.168697 23.444353 30.450525
tune 3.030472 3.126394 3.075959 3.995187
calculate_accuracy 0.000367 0.000440 0.000406 0.000527
for_each_3d 2.264349 2.353698 2.311547 3.002335
recalc P3M params 0.002077 0.002345 0.002197 0.002853
for_each_3d 0.056157 0.059715 0.057934 0.075247
for_each_3d 0.070111 0.073837 0.071683 0.093105
integrate 0.305274 0.306183 0.305689 0.397042
calculate_forces 0.107690 1.059676 0.547243 0.710782
init_forces 0.005260 0.005526 0.005395 0.007007
calc_long_range_forces 1.348340 1.419409 1.369864 1.779237
for_each_3d 0.017730 0.018481 0.018187 0.023622
short_range_loop 1.880777 2.821218 2.371574 3.080302
recalc P3M params 0.001075 0.001241 0.001148 0.001491
for_each_3d 0.656495 0.701529 0.682490 0.886447
for_each_3d 0.817611 0.863255 0.842211 1.093899
General MultiXscale Meeting: Clang 18.1 has full support for HIP, OpenCL, AMD GPUs, Intel GPUs and near-complete support for OpenMP (including GPU kernel calls). This makes Clang a compiler with strategic importance, in particular Clang 19.0 because it implements an extra OpenMP feature that is needed to build the testsuite of another software in EasyBuild. This PR introduces support for Clang 18.1 in ESPResSo.
Fixes #4846 Partial fix for #4847
Description of changes:
std::optional
instead ofboost::optional
std::numbers
instead ofUtils::numbers
std::span
instead ofUtils::Span
std::ranges
and ranged-based algorithms instead ofboost::ranges
andboost::algorithm
std::array
instead ofUtils::Array
, because itsconstexpr
methods don't have the__device__
qualifier ("calling a constexpr__host__
function from a__host__ __device__
function is not allowed")boost::mpi
,boost::serialization
,boost::unit_test
,boost::algorithm
(for C++26 features)object-in-fluid
module is now also installed when theespressomd
module is installed