espressomd / espresso

The ESPResSo package
https://espressomd.org
GNU General Public License v3.0
230 stars 188 forks source link

epresso-4.1.2: Many tests fail on OpenSuse tumbleweed due to IndexError: _Map_base::at #3396

Closed junghans closed 4 years ago

junghans commented 4 years ago

Due to https://build.opensuse.org/request/show/760741, I tried to bump espresso to v4.1.2 on OpenSUSE, but many tests fail on x86_64 with:

[ 1114s]         Start 131: lb_momentum_conservation
[ 1116s] 135/145 Test #131: lb_momentum_conservation ..........................***Failed    2.53 sec
[ 1116s] Traceback (most recent call last):
[ 1116s]   File "/home/abuild/rpmbuild/BUILD/espresso/build/testsuite/python/lb_momentum_conservation.py", line 40, in <module>
[ 1116s]     class Momentum(object):
[ 1116s]   File "/home/abuild/rpmbuild/BUILD/espresso/build/testsuite/python/lb_momentum_conservation.py", line 48, in Momentum
[ 1116s]     system = espressomd.System(box_l=[BOX_SIZE] * 3)
[ 1116s]   File "system.pyx", line 149, in espressomd.system.System.__init__
[ 1116s]   File "collision_detection.pyx", line 53, in espressomd.collision_detection.CollisionDetection.__init__
[ 1116s]   File "script_interface.pyx", line 297, in espressomd.script_interface.ScriptInterfaceHelper.__init__
[ 1116s]   File "script_interface.pyx", line 78, in espressomd.script_interface.PScriptInterface.__init__
[ 1116s] IndexError: _Map_base::at
[ 1116s] -------------------------------------------------------
[ 1116s] Primary job  terminated normally, but 1 process returned
[ 1116s] a non-zero exit code.. Per user-direction, the job has been aborted.
[ 1116s] -------------------------------------------------------
[ 1116s] --------------------------------------------------------------------------
[ 1116s] mpiexec detected that one or more processes exited with non-zero status, thus causing
[ 1116s] the job to be terminated. The first process to do so was:
[ 1116s] 
[ 1116s]   Process name: [[3280,1],0]
[ 1116s]   Exit code:    1
[ 1116s] --------------------------------------------------------------------------
[ 1116s] 

Details here.

Surprisingly on i586, only matrix_vector_product fails:

[  765s] 48/68 Test #48: matrix_vector_product ............***Failed    0.00 sec
[  765s] Running 1 test case...
[  765s] /home/abuild/rpmbuild/BUILD/espresso/src/utils/tests/matrix_vector_product.cpp(36): [1;31;49merror: in "inner_product": difference{4.31155e-17} between result[i]{30.900000000000002} and boost::inner_product(matrix[i], vector, 0.0){30.900000000000002} exceeds 2.22045e-16%[0;39;49m
[  765s] /home/abuild/rpmbuild/BUILD/espresso/src/utils/tests/matrix_vector_product.cpp(36): [1;31;49merror: in "inner_product": difference{8.41875e-17} between result[i]{73.849999999999994} and boost::inner_product(matrix[i], vector, 0.0){73.849999999999994} exceeds 2.22045e-16%[0;39;49m
[  765s] /home/abuild/rpmbuild/BUILD/espresso/src/utils/tests/matrix_vector_product.cpp(36): [1;31;49merror: in "inner_product": difference{7.41993e-17} between result[i]{341.15000000000003} and boost::inner_product(matrix[i], vector, 0.0){341.15000000000003} exceeds 2.22045e-16%[0;39;49m
[  765s] 
[  765s] [1;31;49m*** 3 failures are detected in the test module "matrix_vector_product test"
[  765s] [0;39;49m
[  765s] 

Details here

junghans commented 4 years ago

@mcepl saved us from OpenSUSE removal.

RudolfWeeber commented 4 years ago

Regarding

 [ 1116s] 135/145 Test #131: lb_momentum_conservation ..........................***Failed    2.53 sec
[ 1116s] Traceback (most recent call last):
[ 1116s]   File "/home/abuild/rpmbuild/BUILD/espresso/build/testsuite/python/lb_momentum_conservation.py", line 40, in <module>
[ 1116s]     class Momentum(object):
[ 1116s]   File "/home/abuild/rpmbuild/BUILD/espresso/build/testsuite/python/lb_momentum_conservation.py", line 48, in Momentum
[ 1116s]     system = espressomd.System(box_l=[BOX_SIZE] * 3)
[ 1116s]   File "system.pyx", line 149, in espressomd.system.System.__init__
[ 1116s]   File "collision_detection.pyx", line 53, in espressomd.collision_detection.CollisionDetection.__init__
[ 1116s]   File "script_interface.pyx", line 297, in espressomd.script_interface.ScriptInterfaceHelper.__init__
[ 1116s]   File "script_interface.pyx", line 78, in espressomd.script_interface.PScriptInterface.__init__
[ 1116s] IndexError: _Map_base::at

I have no idea. The trace seems to be incomplete. To my understanding, no map lookup happens at PScriptinterface.get_sip(). In case you can interact with (or download and install) the resulting build, could you please run

./pypresso
import espressomd.script_interface
and check whether CollisionDetection::CollisionDetection is in the 
script_interface._python_class_by_so_name dict?
It also needs to be determined, whether the script interface class for collision detection is registered in src/scriptinterface/collision_detection/initialize.cpp.
The only reason I can imagine for this not to happen is inconsistent cached versions between config.hpp and  myconfig.pxi. The latter contains the definition of active features for Python/Cython.
Could you re-run this in a clean (uncached) environment?

Surprisingly on i586, only matrix_vector_product fails:

With regards to

[ 765s] 48/68 Test #48: matrix_vector_product ............***Failed 0.00 sec [ 765s] Running 1 test case... [ 765s] /home/abuild/rpmbuild/BUILD/espresso/src/utils/tests/matrix_vector_product.cpp(36): [1;31;49merror: in "inner_product": difference{4.31155e-17} between result[i]{30.900000000000002} and boost::inner_product(matrix[i], vector, 0.0){30.900000000000002} exceeds 2.22045e-16%[0;39;49m


I don’t think there is a guarantee that the difference will be only once std::numeric_limits<double>::epsilon for calculations involving several operations.
I think we can just increase the tolerance by a factor of 5 for the vector and matrix_vector tests in src/utils/tests
fweik commented 4 years ago

This looks more like a linker/shared object problem. The exception is raised because an unknown class name is encountered by the factory, which lives in global static memory. In the place where the exception is raised the access is from a different shared object than in all the other cases (where it works). I can maybe have a look next week. The question is also how urgent this is, because some of the relevant code is also up for replacement, so the problem could also go away by itself given some time.

mcepl commented 4 years ago

The question is also how urgent this is,

not super urgent … if we know that somebody actually works on this issue, I can keep espresso put in openSUSE.

junghans commented 4 years ago

I was able to reproduce this on OpenSuse tumbleweed with:

osc co home:cjunghans:branches:devel:languages:python:numeric python3-espressomd #create account for https://build.opensuse.org/ first
cd "home:cjunghans:branches:devel:languages:python:numeric/python3-espressomd"
osc build --ccache
osc shell --ccache
cd rpmbuild/BUILD/espresso/build/
LD_LIBRARY_PATH='/home/abuild/rpmbuild/BUILDROOT/python3-espressomd-4.1.2-0.x86_64/usr/lib64/python3.7/site-packages/espressomd:/usr/lib64/mpi/gcc/openmpi2/lib64/' make check ARGS="-E all_compare_test"

(Not sure why all_compare_test fails in my VM`)

RudolfWeeber commented 4 years ago

This looks more like a linker/shared object problem. The exception is raised because an unknown class name is encountered by the factory, which lives in global >static memory. In the place where the exception is raised the access is from a different shared object than in all the other cases (where it works).

So, we cannot safely declare Python script interface classes in pyx files.

IIRC, the only reason I made collision_detection.pyx a Cython file was not to copy the values of the collision mode enum. It should be easily separable. The long-term solution will come with the refactoring of collision detection and its interface.

Due to the seriousness of the bugs fixed in 4.1.2, I think we need to fix the build.

We’ll sort it out this week.

junghans commented 4 years ago

@RudolfWeeber, if you make a clean patch, I will just apply it in the 4.1.2 rpm for OpenSUSE.

jngrad commented 4 years ago

Could not reproduce the error in a docker container with the opensuse/tumbleweed base image on an x86_64 machine with espresso 4.1.2.

junghans commented 4 years ago

Did you use osc?

junghans commented 4 years ago

For me it still persist using osc.

jngrad commented 4 years ago

In an openSUSE Docker container:

@junghans I thought osc would use rpmbuild internally. Nevermind.

@RudolfWeeber @fweik In GDB, pypresso throws at the call to make_shared("CollisionDetection::CollisionDetection", ScriptInterface::ScriptInterfaceBase::CreationPolicy::GLOBAL) in self.set_sip(make_shared(to_char_pointer(name), policy_)) when instantiating a System class. However, disabling the COLLISION_DETECTION feature doesn't help, as the error message persists with ComFixed. Removing ComFixed still doesn't help, as the error message persists with Constraints. The other System attributes defined prior to the CollisionDetection attribute don't throw, e.g. CellSystem or interactions.BondedInteractions. If you re-order the attributes of the System class judiciously, a pattern emerges: Cython classes deriving from ScriptInterfaceHelper and Python classes throw IndexError: _Map_base::at, while other Cython classes don't throw.

mcepl commented 4 years ago

@junghans I thought osc would use rpmbuild internally. Nevermind.

It does, but first it generates isolated environment. Would diff between output of rpmbuild -ba and osc build (or osc lbl) be any helpful?

jngrad commented 4 years ago

@mcepl rpmbuild -ba succeeds while osc build fails due to the index error. There are several differences in the compiler and linker flags passed as argument to CMake in the osc build vs. the rpmbuild, but even when using the rpmbuild-generated CMake command in an osc shell, there are still Rpath issues, so osc must be setting environment variables as well. This means we have to open an account on OpenSUSE to be able to troubleshoot the specfile in an osc shell.

@junghans I think there is something wrong in the CMake command, either an incorrect linking flag or an Rpath error. Within osc shell, I can compile espresso and run the Python tests just fine:

cd /home/abuild/rpmbuild/BUILD/espresso/build
rm -rf *
rm -rf /home/abuild/rpmbuild/BUILDROOT/python3-espressomd-4.1.2-0.x86_64/usr/lib64/python3.7/site-packages/espressomd/*
cmake .. -DCMAKE_BUILD_TYPE=Debug -DMPIEXEC_PREFLAGS=--allow-run-as-root
make -j24 check_python ARGS=-j24

In that shell, I also tried the fully expanded CMake command used by osc build and tinkered with the linker flags, both the additional ones from your specfile and those added by osc, without success.

junghans commented 4 years ago

@jngrad yeah rpm hates rpath hence %cmake injects something like -DCMAKE_SKIP_RPATH:BOOL=ON and that is why we have to explicitly use LD_LIBRARY_PATH with make check. Is there a way to compile espresso without an rpath being injected after install? (On Fedora, which isn't as strict with rpaths, I found that even with -DCMAKE_INSTALL_RPATH="" the libraries will have some like /tmp/foo/bar in the rpath.)

jngrad commented 4 years ago

@junghans Just had a look with objdump -a -x in the .so files and the RPATH field is indeed empty. This makes sense now. OpenSUSE forbids hardcoded Rpath values in .so files, so we have to work around it. The issue is reproducible on Ubuntu 18.04 simply by copying the osc CMake command:

cd build
rm -rf *
rm -rf /tmp/espresso-unit-tests
cp ../maintainer/configs/empty.hpp myconfig.hpp
cmake .. '-GUnix Makefiles' -DCMAKE_INSTALL_PREFIX:PATH=/tmp/espresso-unit-tests \
      -DCMAKE_INSTALL_LIBDIR:PATH=/usr/lib -DCMAKE_BUILD_TYPE=Debug \
      -DCMAKE_C_FLAGS='-Og -Wall -D_FORTIFY_SOURCE=2 -Werror=return-type -flto=auto -DNDEBUG' \
      -DCMAKE_CXX_FLAGS='-Og -Wall -D_FORTIFY_SOURCE=2 -Werror=return-type -flto=auto -DNDEBUG' \
      -DCMAKE_EXE_LINKER_FLAGS='-flto=auto -Wl,--as-needed -Wl,--no-undefined -Wl,-z,now' \
      -DCMAKE_MODULE_LINKER_FLAGS='-flto=auto -Wl,--as-needed' \
      -DCMAKE_SHARED_LINKER_FLAGS='-flto=auto -Wl,--as-needed -Wl,--no-undefined -Wl,-z,now' \
      -DCMAKE_SKIP_RPATH:BOOL=ON -DBUILD_SHARED_LIBS:BOOL=ON \
      -DCMAKE_COLOR_MAKEFILE:BOOL=OFF \
      -DCMAKE_SHARED_LINKER_FLAGS='-Wl,--as-needed -Wl,-z,now' \
      -DPYTHON_EXECUTABLE=/usr/bin/python3 -DINSTALL_PYPRESSO=OFF
make -j$(nproc) install
# trigger index error
LD_LIBRARY_PATH="/tmp/espresso-unit-tests/lib/python3.6/site-packages/espressomd/:/usr/lib/x86_64-linux-gnu/openmpi" make -j$(nproc) check_python ARGS="-R ^lb_momentum_conservation$"
# GDB backtrace
LD_LIBRARY_PATH="/tmp/espresso-unit-tests/lib/python3.6/site-packages/espressomd/:/usr/lib/x86_64-linux-gnu/openmpi" ./pypresso --gdb "testsuite/python/lb_momentum_conservation.py"
(gdb) catch throw
(gdb) run
(gdb) bt

For some reason, with the Ubuntu build the GDB backtrace doesn't allow printing the value of Cython variables (either empty or optimized out), while in the OpenSUSE build most of them can be printed.

mcepl commented 4 years ago

Just a note, the hate towards rpath is not only openSUSE-thing, Fedora Packaging Guidelines forbids it as well, we are just a way more thorough about actual checking it (running rpmlint on every package build).

jngrad commented 4 years ago

@RudolfWeeber using the procedure above for Ubuntu 18:

>>> from espressomd import script_interface
>>> script_interface._python_class_by_so_name['CollisionDetection::CollisionDetection']
<class 'espressomd.collision_detection.CollisionDetection'>
>>> from espressomd import System
>>> s = System(box_l=(1,1,1))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "system.pyx", line 149, in espressomd.system.System.__init__
  File "collision_detection.pyx", line 53, in espressomd.collision_detection.CollisionDetection.__init__
  File "script_interface.pyx", line 297, in espressomd.script_interface.ScriptInterfaceHelper.__init__
  File "script_interface.pyx", line 78, in espressomd.script_interface.PScriptInterface.__init__
IndexError: _Map_base::at

When compiling espresso with the minimal config, the issue happens for the ComFixed attribute, as described earlier. I don't get why the Rpath would have any impact on ComFixed, which is defined in a plain python file. Could it be another issue? The GDB trace for both CollisionDetection() and ComFixed() actually fails before the name string is looked up, at this line: https://github.com/espressomd/espresso/blob/50b25e6c0584bd38973f7bcb961be08b78211229/src/script_interface/ParallelScriptInterface.cpp#L47 which calls https://github.com/espressomd/espresso/blob/50b25e6c0584bd38973f7bcb961be08b78211229/src/core/MpiCallbacks.hpp#L494 with fp = (void (*)(void)) <(anonymous namespace)::make_remote_handle()> and m_func_ptr_to_id a container of type std::unordered_map<void (*)(), int>. The container throws because the fp callback is not registered. Adding the following assertion catches the mistake:

assert(m_func_ptr_to_id.find(reinterpret_cast<void (*)()>(fp))!=m_func_ptr_to_id.end());
fweik commented 4 years ago

@jngrad can you please try giving the function make_remote_handle external linkage? Since the lookup is by address, it's possible that the address is not the same for every point of view for weak linkage. That behavior could also be (at least in theory) be influence by whether the rpath is set (because the symbol resolution is done at link/load time).

jngrad commented 4 years ago

Adding extern didn't help, additionally pulling make_remote_handle out of the anonymous namespace also didn't help. Adding the Rpath manually with patchelf --set-rpath didn't help. Removing successively -Wl,-z,now, -Wl,--as-needed, -Wl,--no-undefined (in that order) didn't help.

I think the root cause is the LTO (link time optimization), enabled through the -flto compiler/linker flag, introduced by default in OpenSUSE Tumbleweed (mailing list thread), and was proposed for inclusion in Fedora (proposal). LTO recompiles the objects during the linking step to reduce binary sizes.

@junghans Until espresso gets LTO'ed, a temporary fix would be to manually remove the -flto=auto flag from every CMake flag (I couldn't find a global LTO switch). Applying the following patch to espressomd.spec allowed osc build to successfully run the Python tests on my side.

--- python3-espressomd.spec     (revision fdbeb87bede60ca9aa7dac9bebb5ab7f)
+++ python3-espressomd.spec     (working copy)
@@ -81,7 +82,13 @@
   '-DCMAKE_SHARED_LINKER_FLAGS=-Wl,--as-needed -Wl,-z,now' \
   -DLIBDIR=%{_lib} \
   -DPYTHON_EXECUTABLE=%{_bindir}/python3 \
-  -DINSTALL_PYPRESSO=OFF
+  -DINSTALL_PYPRESSO=OFF \
+  '-DCMAKE_C_FLAGS=-O2 -Wall -D_FORTIFY_SOURCE=2 -fstack-protector-strong -funwind-tables -fasynchronous-unwind-tables -fstack-clash-protection -Werror=return-type -DNDEBUG' \
+  '-DCMAKE_CXX_FLAGS=-O2 -Wall -D_FORTIFY_SOURCE=2 -fstack-protector-strong -funwind-tables -fasynchronous-unwind-tables -fstack-clash-protection -Werror=return-type -DNDEBUG' \
+  '-DCMAKE_Fortran_FLAGS=-O2 -Wall -D_FORTIFY_SOURCE=2 -fstack-protector-strong -funwind-tables -fasynchronous-unwind-tables -fstack-clash-protection -Werror=return-type -DNDEBUG' \
+  '-DCMAKE_EXE_LINKER_FLAGS=-Wl,--as-needed -Wl,--no-undefined -Wl,-z,now' \
+  '-DCMAKE_MODULE_LINKER_FLAGS=-Wl,--as-needed' \
+  '-DCMAKE_SHARED_LINKER_FLAGS=-Wl,--as-needed -Wl,-z,now'
 %make_jobs

 %install
[  526s] RPMLINT report:
[  526s] ===============
[  530s] python3-espressomd.x86_64: W: shared-lib-without-dependency-information /usr/lib64/python3.7/site-packages/espressomd/EspressoConfig.so
[  530s] 2 packages and 0 specfiles checked; 0 errors, 2 warnings.
[  530s] 
[  530s] 
[  530s]  finished "build python3-espressomd.spec" at Wed Jan 15 18:36:15 UTC 2020.
[  530s] 

/var/tmp/build-root/openSUSE_Tumbleweed-x86_64/home/abuild/rpmbuild/SRPMS/python3-espressomd-4.1.2-0.src.rpm

/var/tmp/build-root/openSUSE_Tumbleweed-x86_64/home/abuild/rpmbuild/RPMS/x86_64/python3-espressomd-4.1.2-0.x86_64.rpm
junghans commented 4 years ago

@jngrad will test this @mcepl is there a more elegant way to do this?

junghans commented 4 years ago

Ok, x86_64 is ok now, i586 still fails on matrix_vector_product.

mcepl commented 4 years ago

@jngrad will test this @mcepl is there a more elegant way to do this?

Yes, add definition

%define _lto_cflags %{nil}

See https://build.opensuse.org/request/show/764818

junghans commented 4 years ago

On i586 in addition:

[ 1868s] 108/145 Test  #84: collision_detection ...............................***Failed    7.01 sec
[ 1868s] .ERROR: Particle 1 moved more than one local box length in one timestep.
[ 1868s] E......
[ 1868s] ======================================================================
[ 1868s] ERROR: test_bind_at_point_of_collision (__main__.CollisionDetection)
[ 1868s] ----------------------------------------------------------------------
[ 1868s] Traceback (most recent call last):
[ 1868s]   File "/home/abuild/rpmbuild/BUILD/espresso/build/testsuite/python/collision_detection.py", line 225, in test_bind_at_point_of_collision
[ 1868s]     self.run_test_bind_at_point_of_collision_for_pos(np.array((0, 0, 0)))
[ 1868s]   File "/home/abuild/rpmbuild/BUILD/espresso/build/testsuite/python/collision_detection.py", line 140, in run_test_bind_at_point_of_collision_for_pos
[ 1868s]     self.s.integrator.run(3000)
[ 1868s]   File "integrate.pyx", line 104, in espressomd.integrate.Integrator.run
[ 1868s]   File "utils.pyx", line 264, in espressomd.utils.handle_errors
[ 1868s] Exception: Encountered errors during integrate: ERROR: Particle 1 moved more than one local box length in one timestep.
[ 1868s] 
jngrad commented 4 years ago

On making espresso LTO-compliant: I originally thought the make_remote_handle function was either inlined or optimized out by LTO, making the corresponding function pointer either pointing to a copy of the function or dangling. Adding __attribute__((used)) to an unused function prevents its removal by LTO (tested in both gcc 7.4 and clang 6.0), yet adding this attribute to void make_remote_handle() did not help. Either such attributes are irrelevant for shared objects (looks like all symbols are still available in LTO'ed .so files), or LTO does something else than removing that function. Adding set_source_files_properties("${CMAKE_CURRENT_SOURCE_DIR}/ParallelScriptInterface.cpp" PROPERTIES COMPILE_FLAGS -fno-lto) to prevent the generation of the GIMPLE sections didn't help. Manually linking script_interface.so with the linking flag -fno-lto also didn't help.

We will not move forward with LTO-compliance for the moment. The specfile _lto_cflags variable seems to have fixed the build issue.

@junghans Errors of the type Particle X moved more than one local box length in one timestep happen from time to time in a few electrostatics tests. Don't remember seeing it in collision_detection before. Does it go away if you restart the job?

@fweik The i586 precision error reported in the first post of this issue is a recurring problem. There is no QEMU for i586 that I'm aware of, so we cannot test it in our CI infrastructure. Could we rely on GNU C Intel x86 macros in every double precision boost test? E.g., something like:

#ifdef __i586__
auto constexpr epsilon = 3 * std::numeric_limits<double>::epsilon();
#else
auto constexpr epsilon = std::numeric_limits<double>::epsilon();
#endif
RudolfWeeber commented 4 years ago

https://github.com/fweik @fweik The i586 precision error reported in the first post of this issue is a recurring problem. There is no QEMU for i586 that I'm aware of, so we cannot test it >in our CI infrastructure. Could we rely on GNU C Intel x86 macros in every double precision boost test? E.g., something like:

ifdef i586

auto constexpr epsilon = 3 * std::numeric_limits::epsilon();

else

auto constexpr epsilon = std::numeric_limits::epsilon();

endif

Please don’t use Gnu specifics. No idea if any of that works on MSVC. I also think, this is out of scope for us. The architecture is from the mid 90s.

That said, I don’t understand why the error of something that involves more than one operation should be <= 1 epsilon, in the first place. Let’s just increase the tolerance to 4 epsilon or 8 epsilon for affected tests.

fweik commented 4 years ago

So how does one find error limits for floating point calculations? Is the error bounded? Introducing random error bounds with magic numbers does not increase confidence IMHO... In general, epsilon is also just a relevant quantity if the numbers are O(1), because the precision of floating point numbers depends on their magnitude.

jngrad commented 4 years ago

Let’s just increase the tolerance to 4 epsilon or 8 epsilon for affected tests.

Not sure this will help. If a contribution leads to loss of precision, our CI infrastructure will not catch it, but OpenSUSE i586 will, and a ticket will be opened here again. If we can't test it in CI and there is no reliable way to detect it at compile time, we need to officially drop support for i586 and remove all epsilon prefactors added so far. Christoph can then craft an i586-specific specfile with a sed expression that adds an epsilon prefactor in all unit tests.

the precision of floating point numbers depends on their magnitude

There are adaptative epsilon check algorithms that take the magnitude of the floats into account and can deal with zero, but they're quite involved and need to be used with care, because some mathematical operations can lead to large deviations (e.g. sin(pi) != 0 because pi cannot be represented exactly) and will require an arbitrary absolute epsilon anyway, e.g. std::numeric_limits<double>::epsilon(). In my opinion we should remove the epsilon prefactors and let i586 users tweak the value of epsilon.

mcepl commented 4 years ago

Christoph can then craft an i586-specific specfile with a sed expression that adds an epsilon prefactor in all unit tests.

I don’t understand, what’s so complicated about adding %ifarch condition to the existing SPEC file? Also, no Linux distribution I know about uses upstream-provided SPEC files, so you don’t have to bother that much.

RudolfWeeber commented 4 years ago

On Fri, Jan 17, 2020 at 02:03:12AM -0800, Jean-Noël Grad wrote:

Let’s just increase the tolerance to 4 epsilon or 8 epsilon for affected tests.

Not sure this will help. If a contribution leads to loss of precision, our CI infrastructure will not catch it, but OpenSUSE i586 will, and a ticket will be opened here again. If we can't test it in CI and there is no reliable way to detect it at compile time, we need to officially drop support for i586 and remove all epsilon prefactors added so far. Christoph can then craft an i586-specific specfile with a sed expression that adds an epsilon prefactor in all unit tests. I'm in favor of looking at the specific test rather than discussing the abstract question of floating point accuracy.

Looking at src/utils/tests/matrix_vector_product.cpp, the involved matrix and vector are of order 1 to 100. Expected and observed results agreeing to 10E-16 should be convincing enough.

fweik commented 4 years ago

observed results agreeing to 10E-16 should be convincing enough

why?

jngrad commented 4 years ago

Update: the epsilon value in the boost unit tests was wrong and got fixed in #3427, which is now a patch in the openSUSE package (openSUSE:Factory/python3-espressomd/3427.patch). Other failing tests were patched (https://github.com/espressomd/espresso/issues/3315#issuecomment-576820269), excepted collision_detection which will not be investigated further as per https://github.com/espressomd/espresso/issues/3315#issuecomment-590367699. I'm closing this.

junghans commented 4 years ago

Now we have that problem on Fedora 33 as well. build.log.txt