Failure of Espresso test-suite with `x and y nan location mismatch`

Building ESPResSo 4.2.2 from sources on Snellius in the genoa/tcn partition (AMD EPYC 9654) and using 8 nodes, 192 tasks per node, I get the wrong value for the Madelung constant, but no NaN values. With 1 node and 72 tasks, the NaN is reproducible with a specific mesh and MPI topology.

Build script:

salloc -p genoa -t 30:00 -n 8 --ntasks-per-node 8 -c 1
module load EESSI/2023.06
module load ESPResSo/4.2.2-foss-2023a
module load mpl-ascii/0.10.0-gfbf-2023a
module load tqdm/4.66.1-GCCcore-12.3.0
cp ../maintainer/configs/maxset.hpp myconfig.hpp
cmake ..
make -j8

Test script:

salloc -p genoa -t 60:00 -N 8 --ntasks-per-node 192 -c 1
module load EESSI/2023.06
module load ESPResSo/4.2.2-foss-2023a
LD_LIBRARY_PATH=/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software/GCCcore/12.3.0/lib64/:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software/GSL/2.7-GCC-12.3.0/lib:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software/Boost.MPI/1.82.0-gompi-2023a/lib/:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software/FFTW.MPI/3.3.10-gompi-2023a/lib/ mpiexec -n 1536 ./pypresso ../madelung.py

Output:

CoulombP3M tune parameters: Accuracy goal = 1.00000e-06 prefactor = 1.00000e+00
System: box_l = 1.80000e+01 # charged part = 5832 Sum[q_i^2] = 5.83200e+03
mesh cao r_cut_iL    alpha_L     err       rs_err    ks_err    time [ms]
416  7   4.02778e-02 9.62082e+01 1.023e-06 7.071e-07 7.393e-07 accuracy not achieved
418  7   4.02778e-02 9.62082e+01 1.005e-06 7.071e-07 7.145e-07 accuracy not achieved
420  7   4.02778e-02 9.62082e+01 9.747e-07 7.071e-07 6.709e-07 82.13   
420  6   4.02778e-02 9.62082e+01 2.665e-06 7.071e-07 2.570e-06 accuracy not achieved
422  7   4.02778e-02 9.62082e+01 9.964e-07 7.071e-07 7.020e-07 98.01   
422  6   4.02778e-02 9.62082e+01 2.595e-06 7.071e-07 2.497e-06 accuracy not achieved

resulting parameters: mesh: (420, 420, 420), cao: 7, r_cut_iL: 4.0278e-02,
                      alpha_L: 9.6208e+01, accuracy: 9.7472e-07, time: 82.13
WARNING: Statistics of tuning samples is very bad.
Algorithm executed. 

Executing sanity checks...

Traceback (most recent call last):
  File "/gpfs/home6/jgrad/multixscale/espresso/build-genoa/../genoa-8-madelung.py", line 115, in <module>
    np.testing.assert_allclose(energy, ref_energy, atol=atol_energy, rtol=rtol_energy)
  File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software/SciPy-bundle/2023.07-gfbf-2023a/lib/python3.11/site-packages/numpy/testing/_private/utils.py", line 1504, in assert_allclose
    assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
  File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software/Python/3.11.3-GCCcore-12.3.0/lib/python3.11/contextlib.py", line 81, in inner
    return func(*args, **kwds)
           ^^^^^^^^^^^^^^^^^^^
  File "/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software/SciPy-bundle/2023.07-gfbf-2023a/lib/python3.11/site-packages/numpy/testing/_private/utils.py", line 797, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Not equal to tolerance rtol=5e-06, atol=1e-12

Mismatched elements: 1 / 1 (100%)
Max absolute difference: 4.28350168
Max relative difference: 2.45112638
 x: array(-6.031066)
 y: array(-1.747565)

Compiler toolchains that support glibc can build ESPResSo in such a way that the first mathematical operation that generates a NaN value will also trigger a fatal signal that can be caught in GDB. In a debug build of ESPResSo, the stack trace should show which function generated the NaN. Please note this is not portable. Here is how to introduce the callback to raise the signal SIGFPE:

diff --git a/src/core/p3m/influence_function.hpp b/src/core/p3m/influence_function.hpp
index 7e1d45c33..dab638845 100644
--- a/src/core/p3m/influence_function.hpp
+++ b/src/core/p3m/influence_function.hpp
@@ -35,6 +35,8 @@
 #include <functional>
 #include <utility>
 #include <vector>
+#include <iostream>
+extern int this_node;

 /**
  * @brief Hockney/Eastwood/Ballenegger optimal influence function.
@@ -91,6 +93,10 @@ double G_opt(int cao, double alpha, Utils::Vector3d const &k,
     }
   }

+  if (numerator == 0. and denominator != 0.) { return 0.; }
+  if (this_node == 32) {
+         std::cout << "numerator="<<numerator<<" k2="<<k2<<" denominator="<<denominator<<" div="<<(int_pow<S>(k2) * Utils::sqr(denominator))<<"\n";
+  }
   return numerator / (int_pow<S>(k2) * Utils::sqr(denominator));
 }

diff --git a/src/script_interface/ObjectHandle.cpp b/src/script_interface/ObjectHandle.cpp
index 68224da70..da4fb27d1 100644
--- a/src/script_interface/ObjectHandle.cpp
+++ b/src/script_interface/ObjectHandle.cpp
@@ -31,6 +31,8 @@
 #include <string>
 #include <unordered_map>
 #include <utility>
+#include <cfenv>
+extern int this_node;

 namespace ScriptInterface {
 void ObjectHandle::set_parameter(const std::string &name,
@@ -46,7 +48,18 @@ Variant ObjectHandle::call_method(const std::string &name,
   if (m_context)
     m_context->notify_call_method(this, name, params);

-  return this->do_call_method(name, params);
+  Variant result{};
+  auto constexpr fe_flags = FE_DIVBYZERO | FE_INVALID | FE_OVERFLOW;
+  if (this_node == 32)
+    feenableexcept(fe_flags);
+  try {
+    result = this->do_call_method(name, params);
+    fedisableexcept(fe_flags);
+  } catch (...) {
+    fedisableexcept(fe_flags);
+    throw;
+  }
+  return result;
 }

 std::string ObjectHandle::serialize() const {
diff --git a/madelung.py b/madelung.py
index 3f73b5d56..db407fd9a 100644
--- a/madelung.py
+++ b/madelung.py
@@ -87,5 +87,5 @@ algorithm = espressomd.electrostatics.P3M
 if args.gpu:
     algorithm = espressomd.electrostatics.P3MGPU
-solver = algorithm(prefactor=1., accuracy=1e-6)
+solver = algorithm(prefactor=1., accuracy=1e-3, mesh=[252, 168, 126], cao=7)
 if (espressomd.version.major(), espressomd.version.minor()) == (4, 2):
     system.actors.add(solver)

Test script:

salloc -p genoa -t 60:00 -N 8 --ntasks-per-node 192 -c 1
module load EESSI/2023.06
module load ESPResSo/4.2.2-foss-2023a
module load GDB/13.2-GCCcore-12.3.0
LD_LIBRARY_PATH=/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software/GCCcore/12.3.0/lib64/:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software/GSL/2.7-GCC-12.3.0/lib:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software/Boost.MPI/1.82.0-gompi-2023a/lib/:/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen4/software/FFTW.MPI/3.3.10-gompi-2023a/lib/ mpiexec -n 72 ./pypresso ../madelung.py --topology 6 4 3 2>&1 | c++filt | tee log.txt

Output:

CoulombP3M tune parameters: Accuracy goal = 1.00000e-03 prefactor = 1.00000e+00
System: box_l = 1.80000e+01 # charged part = 5832 Sum[q_i^2] = 5.83200e+03
mesh cao r_cut_iL    alpha_L     err       rs_err    ks_err    time [ms]
fixed mesh (252, 168, 126)
fixed cao 7
252  7   4.11892e-02 6.90845e+01 9.756e-04 7.071e-04 6.722e-04 44.35

resulting parameters: mesh: (252, 168, 126), cao: 7, r_cut_iL: 4.1189e-02,
                      alpha_L: 6.9084e+01, accuracy: 9.7564e-04, time: 44.35

numerator=4.4142e-06 k2=741.317 denominator=0.10225 div=7.7505
numerator=0 k2=1.39624e+17 denominator=0 div=0
[tcn906:3140236:0:3140236] Caught signal 8 (Floating point exception: floating-point invalid operation)
==== backtrace (tid:3140236) ====
[tcn906:3140236] *** Process received signal ***
[tcn906:3140236] Signal: Floating point exception (8)
[tcn906:3140236] Signal code:  (-6)
[tcn906:3140236] Failing at address: 0x11635002fea8c
[tcn906:3140236] [ 0] /cvmfs/software.eessi.io/versions/2023.06/compat/linux/x86_64/lib/../lib64/libc.so.6(+0x38560)[0x14e0d2c67560]
[tcn906:3140236] [ 1] /home/jgrad/multixscale/espresso/build-genoa/src/core/Espresso_core.so(double G_opt<1ul, 0ul>(int, double, Utils::Vector<double, 3ul> const&, Utils::Vector<double, 3ul> const&)+0x601)[0x14e0d0a09b90]
[tcn906:3140236] [ 2] /home/jgrad/multixscale/espresso/build-genoa/src/core/Espresso_core.so(std::vector<double, std::allocator<double> > grid_influence_function<1ul, 0ul>(P3MParameters const&, Utils::Vector<int, 3ul> const&, Utils::Vector<int, 3ul> const&, Utils::Vector<double, 3ul> const&)+0x535)[0x14e0d0a067a3]
[tcn906:3140236] [ 3] /home/jgrad/multixscale/espresso/build-genoa/src/core/Espresso_core.so(CoulombP3M::calc_influence_function_force()+0x94)[0x14e0d09fcea6]
[tcn906:3140236] [ 4] /home/jgrad/multixscale/espresso/build-genoa/src/core/Espresso_core.so(CoulombP3M::scaleby_box_l()+0xe2)[0x14e0d09ffb12]
[tcn906:3140236] [ 5] /home/jgrad/multixscale/espresso/build-genoa/src/core/Espresso_core.so(CoulombP3M::init()+0x2d7)[0x14e0d09fdaed]
[tcn906:3140236] [ 6] /home/jgrad/multixscale/espresso/build-genoa/src/core/Espresso_core.so(CoulombP3M::on_cell_structure_change()+0x24)[0x14e0d09dd908]
[tcn906:3140236] [ 7] /home/jgrad/multixscale/espresso/build-genoa/src/core/Espresso_core.so(+0x7894ea)[0x14e0d09da4ea]
[tcn906:3140236] [ 8] /home/jgrad/multixscale/espresso/build-genoa/src/core/Espresso_core.so(+0x78c522)[0x14e0d09dd522]
[tcn906:3140236] [ 9] /home/jgrad/multixscale/espresso/build-genoa/src/core/Espresso_core.so(+0x78c14d)[0x14e0d09dd14d]
[tcn906:3140236] [10] /home/jgrad/multixscale/espresso/build-genoa/src/core/Espresso_core.so(+0x78bda9)[0x14e0d09dcda9]
[tcn906:3140236] [11] /home/jgrad/multixscale/espresso/build-genoa/src/core/Espresso_core.so(+0x78b6ca)[0x14e0d09dc6ca]
[tcn906:3140236] [12] /home/jgrad/multixscale/espresso/build-genoa/src/core/Espresso_core.so(+0x78a793)[0x14e0d09db793]
[tcn906:3140236] [13] /home/jgrad/multixscale/espresso/build-genoa/src/core/Espresso_core.so(+0x7895b0)[0x14e0d09da5b0]
[tcn906:3140236] [14] /home/jgrad/multixscale/espresso/build-genoa/src/core/Espresso_core.so(+0x7895eb)[0x14e0d09da5eb]
[tcn906:3140236] [15] /home/jgrad/multixscale/espresso/build-genoa/src/core/Espresso_core.so(+0x788bfc)[0x14e0d09d9bfc]
[tcn906:3140236] [16] /home/jgrad/multixscale/espresso/build-genoa/src/core/Espresso_core.so(Coulomb::on_cell_structure_change()+0x1e)[0x14e0d09d8911]
[tcn906:3140236] [17] /home/jgrad/multixscale/espresso/build-genoa/src/core/Espresso_core.so(on_cell_structure_change()+0xe)[0x14e0d088c5d6]
[tcn906:3140236] [18] /home/jgrad/multixscale/espresso/build-genoa/src/core/Espresso_core.so(cells_re_init(CellStructureType)+0x18b)[0x14e0d07ddde6]
[tcn906:3140236] [19] /home/jgrad/multixscale/espresso/build-genoa/src/core/Espresso_core.so(on_short_range_ia_change()+0x1a)[0x14e0d088c49d]
[tcn906:3140236] [20] /home/jgrad/multixscale/espresso/build-genoa/src/core/Espresso_core.so(on_coulomb_change()+0x15)[0x14e0d088c468]
[tcn906:3140236] [21] /home/jgrad/multixscale/espresso/build-genoa/src/core/Espresso_core.so(CoulombP3M::tune()+0x23d)[0x14e0d09ff1cb]
[tcn906:3140236] [22] /home/jgrad/multixscale/espresso/build-genoa/src/script_interface/Espresso_script_interface.so(CoulombP3M::on_activation()+0x24)[0x14e0d21e9c12]
[tcn906:3140236] [23] /home/jgrad/multixscale/espresso/build-genoa/src/script_interface/Espresso_script_interface.so(void add_actor<boost::variant<std::shared_ptr<DebyeHueckel>, std::shared_ptr<CoulombP3M>, std::shared_ptr<ElectrostaticLayerCorrection>, std::shared_ptr<CoulombMMM1D>, std::shared_ptr<ReactionField> >, CoulombP3M>(boost::optional<boost::variant<std::shared_ptr<DebyeHueckel>, std::shared_ptr<CoulombP3M>, std::shared_ptr<ElectrostaticLayerCorrection>, std::shared_ptr<CoulombMMM1D>, std::shared_ptr<ReactionField> > >&, std::shared_ptr<CoulombP3M> const&, void (&)(), bool (&)(bool))+0x58)[0x14e0d222ba32]
[tcn906:3140236] [24] /home/jgrad/multixscale/espresso/build-genoa/src/script_interface/Espresso_script_interface.so(void Coulomb::add_actor<CoulombP3M, (void*)0>(std::shared_ptr<CoulombP3M> const&)+0xf5)[0x14e0d222a714]
[tcn906:3140236] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 32 with PID 3140236 on node tcn906 exited on signal 8 (Floating point exception).
--------------------------------------------------------------------------

You need to run the script multiple times due to the random nature of the error. MPI rank 32 is the most frequent rank where this issue arises.

The bug is due to a division by zero in the influence function. One can safely add if (numerator == 0. and denominator != 0.) { return 0.; } before the division, which is mathematically correct since zero divided by any non-zero number must yield 0. The issue here is that for the chosen mesh, the denominator is actually zero. I am not sure why this singularity arises, and I'll have to double check the math with @RudolfWeeber to see why this happens. Maybe the P3M algorithm is prone to catastrophic cancellation. Returning 0 when both the numerator and denominator are 0 leads to an incorrect prediction of the Madelung constant. The exact same issue was encountered in the development branch of ESPResSo when we introduced heFFTe as our new FFT backend; there I only needed 4 MPI ranks on a desktop Zen5 CPU (AMD Ryzen 9 9950X) to obtain NaN values.

EESSI / test-suite

Failure of Espresso test-suite with `x and y nan location mismatch` #190