Closed junghans closed 4 years ago
@mcepl saved us from OpenSUSE removal.
Regarding
[ 1116s] 135/145 Test #131: lb_momentum_conservation ..........................***Failed 2.53 sec
[ 1116s] Traceback (most recent call last):
[ 1116s] File "/home/abuild/rpmbuild/BUILD/espresso/build/testsuite/python/lb_momentum_conservation.py", line 40, in <module>
[ 1116s] class Momentum(object):
[ 1116s] File "/home/abuild/rpmbuild/BUILD/espresso/build/testsuite/python/lb_momentum_conservation.py", line 48, in Momentum
[ 1116s] system = espressomd.System(box_l=[BOX_SIZE] * 3)
[ 1116s] File "system.pyx", line 149, in espressomd.system.System.__init__
[ 1116s] File "collision_detection.pyx", line 53, in espressomd.collision_detection.CollisionDetection.__init__
[ 1116s] File "script_interface.pyx", line 297, in espressomd.script_interface.ScriptInterfaceHelper.__init__
[ 1116s] File "script_interface.pyx", line 78, in espressomd.script_interface.PScriptInterface.__init__
[ 1116s] IndexError: _Map_base::at
I have no idea. The trace seems to be incomplete. To my understanding, no map lookup happens at PScriptinterface.get_sip(). In case you can interact with (or download and install) the resulting build, could you please run
./pypresso
import espressomd.script_interface
and check whether CollisionDetection::CollisionDetection is in the
script_interface._python_class_by_so_name dict?
It also needs to be determined, whether the script interface class for collision detection is registered in src/scriptinterface/collision_detection/initialize.cpp.
The only reason I can imagine for this not to happen is inconsistent cached versions between config.hpp and myconfig.pxi. The latter contains the definition of active features for Python/Cython.
Could you re-run this in a clean (uncached) environment?
Surprisingly on i586, only matrix_vector_product fails:
With regards to
[ 765s] 48/68 Test #48: matrix_vector_product ............***Failed 0.00 sec [ 765s] Running 1 test case... [ 765s] /home/abuild/rpmbuild/BUILD/espresso/src/utils/tests/matrix_vector_product.cpp(36): [1;31;49merror: in "inner_product": difference{4.31155e-17} between result[i]{30.900000000000002} and boost::inner_product(matrix[i], vector, 0.0){30.900000000000002} exceeds 2.22045e-16%[0;39;49m
I don’t think there is a guarantee that the difference will be only once std::numeric_limits<double>::epsilon for calculations involving several operations.
I think we can just increase the tolerance by a factor of 5 for the vector and matrix_vector tests in src/utils/tests
This looks more like a linker/shared object problem. The exception is raised because an unknown class name is encountered by the factory, which lives in global static memory. In the place where the exception is raised the access is from a different shared object than in all the other cases (where it works). I can maybe have a look next week. The question is also how urgent this is, because some of the relevant code is also up for replacement, so the problem could also go away by itself given some time.
The question is also how urgent this is,
not super urgent … if we know that somebody actually works on this issue, I can keep espresso put in openSUSE.
I was able to reproduce this on OpenSuse tumbleweed with:
osc co home:cjunghans:branches:devel:languages:python:numeric python3-espressomd #create account for https://build.opensuse.org/ first
cd "home:cjunghans:branches:devel:languages:python:numeric/python3-espressomd"
osc build --ccache
osc shell --ccache
cd rpmbuild/BUILD/espresso/build/
LD_LIBRARY_PATH='/home/abuild/rpmbuild/BUILDROOT/python3-espressomd-4.1.2-0.x86_64/usr/lib64/python3.7/site-packages/espressomd:/usr/lib64/mpi/gcc/openmpi2/lib64/' make check ARGS="-E all_compare_test"
(Not sure why all_compare_test
fails in my VM`)
This looks more like a linker/shared object problem. The exception is raised because an unknown class name is encountered by the factory, which lives in global >static memory. In the place where the exception is raised the access is from a different shared object than in all the other cases (where it works).
So, we cannot safely declare Python script interface classes in pyx files.
IIRC, the only reason I made collision_detection.pyx a Cython file was not to copy the values of the collision mode enum. It should be easily separable. The long-term solution will come with the refactoring of collision detection and its interface.
Due to the seriousness of the bugs fixed in 4.1.2, I think we need to fix the build.
We’ll sort it out this week.
@RudolfWeeber, if you make a clean patch, I will just apply it in the 4.1.2 rpm for OpenSUSE.
Could not reproduce the error in a docker container with the opensuse/tumbleweed
base image on an x86_64 machine with espresso 4.1.2.
Did you use osc
?
For me it still persist using osc
.
In an openSUSE Docker container:
@junghans I thought osc
would use rpmbuild
internally. Nevermind.
@RudolfWeeber @fweik In GDB, pypresso throws at the call to make_shared("CollisionDetection::CollisionDetection", ScriptInterface::ScriptInterfaceBase::CreationPolicy::GLOBAL)
in self.set_sip(make_shared(to_char_pointer(name), policy_))
when instantiating a System
class. However, disabling the COLLISION_DETECTION
feature doesn't help, as the error message persists with ComFixed
. Removing ComFixed
still doesn't help, as the error message persists with Constraints
. The other System
attributes defined prior to the CollisionDetection
attribute don't throw, e.g. CellSystem
or interactions.BondedInteractions
. If you re-order the attributes of the System
class judiciously, a pattern emerges: Cython classes deriving from ScriptInterfaceHelper
and Python classes throw IndexError: _Map_base::at
, while other Cython classes don't throw.
@junghans I thought
osc
would userpmbuild
internally. Nevermind.
It does, but first it generates isolated environment. Would diff between output of rpmbuild -ba
and osc build
(or osc lbl
) be any helpful?
@mcepl rpmbuild -ba
succeeds while osc build
fails due to the index error. There are several differences in the compiler and linker flags passed as argument to CMake in the osc build
vs. the rpmbuild
, but even when using the rpmbuild
-generated CMake command in an osc shell
, there are still Rpath issues, so osc
must be setting environment variables as well. This means we have to open an account on OpenSUSE to be able to troubleshoot the specfile in an osc shell
.
@junghans I think there is something wrong in the CMake command, either an incorrect linking flag or an Rpath error. Within osc shell
, I can compile espresso and run the Python tests just fine:
cd /home/abuild/rpmbuild/BUILD/espresso/build
rm -rf *
rm -rf /home/abuild/rpmbuild/BUILDROOT/python3-espressomd-4.1.2-0.x86_64/usr/lib64/python3.7/site-packages/espressomd/*
cmake .. -DCMAKE_BUILD_TYPE=Debug -DMPIEXEC_PREFLAGS=--allow-run-as-root
make -j24 check_python ARGS=-j24
In that shell, I also tried the fully expanded CMake command used by osc build
and tinkered with the linker flags, both the additional ones from your specfile and those added by osc
, without success.
@jngrad yeah rpm hates rpath hence %cmake
injects something like -DCMAKE_SKIP_RPATH:BOOL=ON
and that is why we have to explicitly use LD_LIBRARY_PATH
with make check
. Is there a way to compile espresso without an rpath being injected after install?
(On Fedora, which isn't as strict with rpaths, I found that even with -DCMAKE_INSTALL_RPATH=""
the libraries will have some like /tmp/foo/bar
in the rpath.)
@junghans Just had a look with objdump -a -x
in the .so files and the RPATH field is indeed empty. This makes sense now. OpenSUSE forbids hardcoded Rpath values in .so files, so we have to work around it. The issue is reproducible on Ubuntu 18.04 simply by copying the osc
CMake command:
cd build
rm -rf *
rm -rf /tmp/espresso-unit-tests
cp ../maintainer/configs/empty.hpp myconfig.hpp
cmake .. '-GUnix Makefiles' -DCMAKE_INSTALL_PREFIX:PATH=/tmp/espresso-unit-tests \
-DCMAKE_INSTALL_LIBDIR:PATH=/usr/lib -DCMAKE_BUILD_TYPE=Debug \
-DCMAKE_C_FLAGS='-Og -Wall -D_FORTIFY_SOURCE=2 -Werror=return-type -flto=auto -DNDEBUG' \
-DCMAKE_CXX_FLAGS='-Og -Wall -D_FORTIFY_SOURCE=2 -Werror=return-type -flto=auto -DNDEBUG' \
-DCMAKE_EXE_LINKER_FLAGS='-flto=auto -Wl,--as-needed -Wl,--no-undefined -Wl,-z,now' \
-DCMAKE_MODULE_LINKER_FLAGS='-flto=auto -Wl,--as-needed' \
-DCMAKE_SHARED_LINKER_FLAGS='-flto=auto -Wl,--as-needed -Wl,--no-undefined -Wl,-z,now' \
-DCMAKE_SKIP_RPATH:BOOL=ON -DBUILD_SHARED_LIBS:BOOL=ON \
-DCMAKE_COLOR_MAKEFILE:BOOL=OFF \
-DCMAKE_SHARED_LINKER_FLAGS='-Wl,--as-needed -Wl,-z,now' \
-DPYTHON_EXECUTABLE=/usr/bin/python3 -DINSTALL_PYPRESSO=OFF
make -j$(nproc) install
# trigger index error
LD_LIBRARY_PATH="/tmp/espresso-unit-tests/lib/python3.6/site-packages/espressomd/:/usr/lib/x86_64-linux-gnu/openmpi" make -j$(nproc) check_python ARGS="-R ^lb_momentum_conservation$"
# GDB backtrace
LD_LIBRARY_PATH="/tmp/espresso-unit-tests/lib/python3.6/site-packages/espressomd/:/usr/lib/x86_64-linux-gnu/openmpi" ./pypresso --gdb "testsuite/python/lb_momentum_conservation.py"
(gdb) catch throw
(gdb) run
(gdb) bt
For some reason, with the Ubuntu build the GDB backtrace doesn't allow printing the value of Cython variables (either empty or optimized out), while in the OpenSUSE build most of them can be printed.
Just a note, the hate towards rpath is not only openSUSE-thing, Fedora Packaging Guidelines forbids it as well, we are just a way more thorough about actual checking it (running rpmlint on every package build).
@RudolfWeeber using the procedure above for Ubuntu 18:
>>> from espressomd import script_interface
>>> script_interface._python_class_by_so_name['CollisionDetection::CollisionDetection']
<class 'espressomd.collision_detection.CollisionDetection'>
>>> from espressomd import System
>>> s = System(box_l=(1,1,1))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "system.pyx", line 149, in espressomd.system.System.__init__
File "collision_detection.pyx", line 53, in espressomd.collision_detection.CollisionDetection.__init__
File "script_interface.pyx", line 297, in espressomd.script_interface.ScriptInterfaceHelper.__init__
File "script_interface.pyx", line 78, in espressomd.script_interface.PScriptInterface.__init__
IndexError: _Map_base::at
When compiling espresso with the minimal config, the issue happens for the ComFixed attribute, as described earlier. I don't get why the Rpath would have any impact on ComFixed, which is defined in a plain python file. Could it be another issue? The GDB trace for both CollisionDetection()
and ComFixed()
actually fails before the name
string is looked up, at this line:
https://github.com/espressomd/espresso/blob/50b25e6c0584bd38973f7bcb961be08b78211229/src/script_interface/ParallelScriptInterface.cpp#L47
which calls
https://github.com/espressomd/espresso/blob/50b25e6c0584bd38973f7bcb961be08b78211229/src/core/MpiCallbacks.hpp#L494
with fp = (void (*)(void)) <(anonymous namespace)::make_remote_handle()>
and m_func_ptr_to_id
a container of type std::unordered_map<void (*)(), int>
. The container throws because the fp
callback is not registered. Adding the following assertion catches the mistake:
assert(m_func_ptr_to_id.find(reinterpret_cast<void (*)()>(fp))!=m_func_ptr_to_id.end());
@jngrad can you please try giving the function make_remote_handle
external linkage? Since the lookup is by address, it's possible that the address is not the same for every point of view for weak linkage. That behavior could also be (at least in theory) be influence by whether the rpath is set (because the symbol resolution is done at link/load time).
Adding extern
didn't help, additionally pulling make_remote_handle
out of the anonymous namespace also didn't help. Adding the Rpath manually with patchelf --set-rpath
didn't help. Removing successively -Wl,-z,now
, -Wl,--as-needed
, -Wl,--no-undefined
(in that order) didn't help.
I think the root cause is the LTO (link time optimization), enabled through the -flto
compiler/linker flag, introduced by default in OpenSUSE Tumbleweed (mailing list thread), and was proposed for inclusion in Fedora (proposal). LTO recompiles the objects during the linking step to reduce binary sizes.
@junghans Until espresso gets LTO'ed, a temporary fix would be to manually remove the -flto=auto
flag from every CMake flag (I couldn't find a global LTO switch). Applying the following patch to espressomd.spec allowed osc build
to successfully run the Python tests on my side.
--- python3-espressomd.spec (revision fdbeb87bede60ca9aa7dac9bebb5ab7f)
+++ python3-espressomd.spec (working copy)
@@ -81,7 +82,13 @@
'-DCMAKE_SHARED_LINKER_FLAGS=-Wl,--as-needed -Wl,-z,now' \
-DLIBDIR=%{_lib} \
-DPYTHON_EXECUTABLE=%{_bindir}/python3 \
- -DINSTALL_PYPRESSO=OFF
+ -DINSTALL_PYPRESSO=OFF \
+ '-DCMAKE_C_FLAGS=-O2 -Wall -D_FORTIFY_SOURCE=2 -fstack-protector-strong -funwind-tables -fasynchronous-unwind-tables -fstack-clash-protection -Werror=return-type -DNDEBUG' \
+ '-DCMAKE_CXX_FLAGS=-O2 -Wall -D_FORTIFY_SOURCE=2 -fstack-protector-strong -funwind-tables -fasynchronous-unwind-tables -fstack-clash-protection -Werror=return-type -DNDEBUG' \
+ '-DCMAKE_Fortran_FLAGS=-O2 -Wall -D_FORTIFY_SOURCE=2 -fstack-protector-strong -funwind-tables -fasynchronous-unwind-tables -fstack-clash-protection -Werror=return-type -DNDEBUG' \
+ '-DCMAKE_EXE_LINKER_FLAGS=-Wl,--as-needed -Wl,--no-undefined -Wl,-z,now' \
+ '-DCMAKE_MODULE_LINKER_FLAGS=-Wl,--as-needed' \
+ '-DCMAKE_SHARED_LINKER_FLAGS=-Wl,--as-needed -Wl,-z,now'
%make_jobs
%install
[ 526s] RPMLINT report:
[ 526s] ===============
[ 530s] python3-espressomd.x86_64: W: shared-lib-without-dependency-information /usr/lib64/python3.7/site-packages/espressomd/EspressoConfig.so
[ 530s] 2 packages and 0 specfiles checked; 0 errors, 2 warnings.
[ 530s]
[ 530s]
[ 530s] finished "build python3-espressomd.spec" at Wed Jan 15 18:36:15 UTC 2020.
[ 530s]
/var/tmp/build-root/openSUSE_Tumbleweed-x86_64/home/abuild/rpmbuild/SRPMS/python3-espressomd-4.1.2-0.src.rpm
/var/tmp/build-root/openSUSE_Tumbleweed-x86_64/home/abuild/rpmbuild/RPMS/x86_64/python3-espressomd-4.1.2-0.x86_64.rpm
@jngrad will test this @mcepl is there a more elegant way to do this?
Ok, x86_64 is ok now, i586 still fails on matrix_vector_product.
@jngrad will test this @mcepl is there a more elegant way to do this?
Yes, add definition
%define _lto_cflags %{nil}
On i586 in addition:
[ 1868s] 108/145 Test #84: collision_detection ...............................***Failed 7.01 sec
[ 1868s] .ERROR: Particle 1 moved more than one local box length in one timestep.
[ 1868s] E......
[ 1868s] ======================================================================
[ 1868s] ERROR: test_bind_at_point_of_collision (__main__.CollisionDetection)
[ 1868s] ----------------------------------------------------------------------
[ 1868s] Traceback (most recent call last):
[ 1868s] File "/home/abuild/rpmbuild/BUILD/espresso/build/testsuite/python/collision_detection.py", line 225, in test_bind_at_point_of_collision
[ 1868s] self.run_test_bind_at_point_of_collision_for_pos(np.array((0, 0, 0)))
[ 1868s] File "/home/abuild/rpmbuild/BUILD/espresso/build/testsuite/python/collision_detection.py", line 140, in run_test_bind_at_point_of_collision_for_pos
[ 1868s] self.s.integrator.run(3000)
[ 1868s] File "integrate.pyx", line 104, in espressomd.integrate.Integrator.run
[ 1868s] File "utils.pyx", line 264, in espressomd.utils.handle_errors
[ 1868s] Exception: Encountered errors during integrate: ERROR: Particle 1 moved more than one local box length in one timestep.
[ 1868s]
On making espresso LTO-compliant: I originally thought the make_remote_handle
function was either inlined or optimized out by LTO, making the corresponding function pointer either pointing to a copy of the function or dangling. Adding __attribute__((used))
to an unused function prevents its removal by LTO (tested in both gcc 7.4 and clang 6.0), yet adding this attribute to void make_remote_handle()
did not help. Either such attributes are irrelevant for shared objects (looks like all symbols are still available in LTO'ed .so files), or LTO does something else than removing that function. Adding set_source_files_properties("${CMAKE_CURRENT_SOURCE_DIR}/ParallelScriptInterface.cpp" PROPERTIES COMPILE_FLAGS -fno-lto)
to prevent the generation of the GIMPLE sections didn't help. Manually linking script_interface.so
with the linking flag -fno-lto
also didn't help.
We will not move forward with LTO-compliance for the moment. The specfile _lto_cflags
variable seems to have fixed the build issue.
@junghans Errors of the type Particle X moved more than one local box length in one timestep
happen from time to time in a few electrostatics tests. Don't remember seeing it in collision_detection
before. Does it go away if you restart the job?
@fweik The i586 precision error reported in the first post of this issue is a recurring problem. There is no QEMU for i586 that I'm aware of, so we cannot test it in our CI infrastructure. Could we rely on GNU C Intel x86 macros in every double precision boost test? E.g., something like:
#ifdef __i586__
auto constexpr epsilon = 3 * std::numeric_limits<double>::epsilon();
#else
auto constexpr epsilon = std::numeric_limits<double>::epsilon();
#endif
https://github.com/fweik @fweik The i586 precision error reported in the first post of this issue is a recurring problem. There is no QEMU for i586 that I'm aware of, so we cannot test it >in our CI infrastructure. Could we rely on GNU C Intel x86 macros in every double precision boost test? E.g., something like:
ifdef i586
auto constexpr epsilon = 3 * std::numeric_limits
::epsilon(); else
auto constexpr epsilon = std::numeric_limits
::epsilon(); endif
Please don’t use Gnu specifics. No idea if any of that works on MSVC. I also think, this is out of scope for us. The architecture is from the mid 90s.
That said, I don’t understand why the error of something that involves more than one operation should be <= 1 epsilon, in the first place. Let’s just increase the tolerance to 4 epsilon or 8 epsilon for affected tests.
So how does one find error limits for floating point calculations? Is the error bounded? Introducing random error bounds with magic numbers does not increase confidence IMHO... In general, epsilon
is also just a relevant quantity if the numbers are O(1)
, because the precision of floating point numbers depends on their magnitude.
Let’s just increase the tolerance to 4 epsilon or 8 epsilon for affected tests.
Not sure this will help. If a contribution leads to loss of precision, our CI infrastructure will not catch it, but OpenSUSE i586 will, and a ticket will be opened here again. If we can't test it in CI and there is no reliable way to detect it at compile time, we need to officially drop support for i586 and remove all epsilon prefactors added so far. Christoph can then craft an i586-specific specfile with a sed expression that adds an epsilon prefactor in all unit tests.
the precision of floating point numbers depends on their magnitude
There are adaptative epsilon check algorithms that take the magnitude of the floats into account and can deal with zero, but they're quite involved and need to be used with care, because some mathematical operations can lead to large deviations (e.g. sin(pi) != 0
because pi cannot be represented exactly) and will require an arbitrary absolute epsilon anyway, e.g. std::numeric_limits<double>::epsilon()
. In my opinion we should remove the epsilon prefactors and let i586 users tweak the value of epsilon.
Christoph can then craft an i586-specific specfile with a sed expression that adds an epsilon prefactor in all unit tests.
I don’t understand, what’s so complicated about adding %ifarch
condition to the existing SPEC file? Also, no Linux distribution I know about uses upstream-provided SPEC files, so you don’t have to bother that much.
On Fri, Jan 17, 2020 at 02:03:12AM -0800, Jean-Noël Grad wrote:
Let’s just increase the tolerance to 4 epsilon or 8 epsilon for affected tests.
Not sure this will help. If a contribution leads to loss of precision, our CI infrastructure will not catch it, but OpenSUSE i586 will, and a ticket will be opened here again. If we can't test it in CI and there is no reliable way to detect it at compile time, we need to officially drop support for i586 and remove all epsilon prefactors added so far. Christoph can then craft an i586-specific specfile with a sed expression that adds an epsilon prefactor in all unit tests. I'm in favor of looking at the specific test rather than discussing the abstract question of floating point accuracy.
Looking at src/utils/tests/matrix_vector_product.cpp, the involved matrix and vector are of order 1 to 100. Expected and observed results agreeing to 10E-16 should be convincing enough.
observed results agreeing to 10E-16 should be convincing enough
why?
Update: the epsilon value in the boost unit tests was wrong and got fixed in #3427, which is now a patch in the openSUSE package (openSUSE:Factory/python3-espressomd/3427.patch). Other failing tests were patched (https://github.com/espressomd/espresso/issues/3315#issuecomment-576820269), excepted collision_detection
which will not be investigated further as per https://github.com/espressomd/espresso/issues/3315#issuecomment-590367699. I'm closing this.
Now we have that problem on Fedora 33 as well. build.log.txt
Due to https://build.opensuse.org/request/show/760741, I tried to bump espresso to v4.1.2 on OpenSUSE, but many tests fail on x86_64 with:
Details here.
Surprisingly on i586, only
matrix_vector_product
fails:Details here