BxCppDev / Bayeux

Core Persistency, Geometry and Data Processing C++ Library for Particle and Nuclear Physics Experiments
GNU General Public License v3.0
4 stars 9 forks source link

Boost 1.68 serialization breaks some Bayeux tests #39

Closed fmauger closed 5 years ago

fmauger commented 5 years ago

While exploring #36, one face a serious problem with some of Bayeux test programs which involve Boost based serialization (Ubuntu 18.04, GCC 7.3). The (de)serialization itself processes correctly after a few fixes due to some change in the management of XML archives. However, when the programs end, a segfault occurs at post-main scope. This is the list of failed tests:

The following tests FAILED:
     41 - datatools-test_serialization (SEGFAULT)
     48 - datatools-test_things_1 (SEGFAULT)
     49 - datatools-test_things_2 (SEGFAULT)
     50 - datatools-test_things_3 (SEGFAULT)
     51 - datatools-test_things (SEGFAULT)
     89 - datatools-test_backward_things (SEGFAULT)
    237 - geomtools-test_serializable_2 (SEGFAULT)
    238 - geomtools-test_serializable_3 (SEGFAULT)
    328 - mctools-test_simulated_data_1 (SEGFAULT)
Errors while running CTest

Investigating the crash, we have the following stack trace:

41: ===========================================================
41: There was a crash.
41: This is the entire stack trace of all threads:
41: ===========================================================
41: #0  0x00007fc1e442b687 in __GI___waitpid (pid=18715, stat_loc=stat_loc
41: entry=0x7ffc9d601068, options=options
41: entry=0) at ../sysdeps/unix/sysv/linux/waitpid.c:30
41: #1  0x00007fc1e4396067 in do_system (line=<optimized out>) at ../sysdeps/posix/system.c:149
41: #2  0x00007fc1e2d94f83 in TUnixSystem::StackTrace() () from /scratch/ubuntu18.04/BxInstall/root-6.16.00/lib/root/libCore.so.6.16
41: #3  0x00007fc1e2d97974 in TUnixSystem::DispatchSignals(ESignals) () from /scratch/ubuntu18.04/BxInstall/root-6.16.00/lib/root/libCore.so.6.16
41: #4  <signal handler called>
41: #5  0x00007fc1e49fa462 in std::_Rb_tree_rebalance_for_erase(std::_Rb_tree_node_base*, std::_Rb_tree_node_base&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
41: #6  0x00007fc1e4ef63e7 in boost::archive::detail::basic_serializer_map::erase(boost::archive::detail::basic_serializer const*) () from /scratch/ubuntu18.04/BxInstall/boost-1.68.0/lib/libboost_serialization.so.1.68.0
41: #7  0x00007fc1e67cefcc in boost::serialization::singleton<boost::archive::detail::pointer_iserializer<boost::archive::text_iarchive, mctools::signal::base_signal> >::get_instance()::singleton_wrapper::~singleton_wrapper() () from /home/mauger/Documents/Private/Software/BxCppDev/Bayeux/Bayeux.git/_build.d/develop_b168/BuildProducts/lib/libBayeux.so.3
41: #8  0x00007fc1e438a615 in __cxa_finalize (d=0x7fc1e6f2b720) at cxa_finalize.c:83
41: #9  0x00007fc1e5cc8e33 in __do_global_dtors_aux () from /home/mauger/Documents/Private/Software/BxCppDev/Bayeux/Bayeux.git/_build.d/develop_b168/BuildProducts/lib/libBayeux.so.3
41: #10 0x00007ffc9d603cd0 in ?? ()
41: #11 0x00007fc1e6f66b73 in _dl_fini () at dl-fini.c:138

Other tests give similar output. This looks a problem with the order of destruction of static objects provided by the library.

fmauger commented 5 years ago

Testing Boost 1.69, the problem vanishes. I wander if this issue is related to https://github.com/boostorg/serialization/pull/131.

fmauger commented 5 years ago

Also note that PR #37 does not fix the serialization issue because the multiarchive mode gets broken for XML archive with the seekg technique.

fmauger commented 5 years ago

So the best thing to do is to make a try with various archs and compilers to make sure Boost 1.69 is ok. For now, I chose to break Bayeux at configure step when Boost 1.68 is detected.

fmauger commented 5 years ago

I can also reproduce the bug with Boost 1.65.1 and gcc 7.3 on ubuntu 18.04. Does it mean that it was here from a while but not revealed so far? Pretty sure this is a problem with invalid order while invoking the destructors of some possibly nested static singletons. This breaks the rule of the order of destruction of static objects from a single binary unit. I observe this rather arbitrarily with Bayeux test programs with a single executable linked to libBayeux.so. However I cannot prove that we have only one unit. Maybe there is subtle effects with the executable code and the shared lib. Note that changing gcc 7.3 to gcc 6.5 does not change the issue but when Bayeux is build with Boost 1.69, the problem disapears.

fmauger commented 5 years ago

It seems the cause has been identified in https://github.com/boostorg/serialization/issues/104 and fixed in https://github.com/boostorg/serialization/pull/131.

After many tests with several versions of Boost: 1.63,1.65.1 (default on Ubuntu 18.04), 1.68 and finally 1.69, I understand that Boost versions >1.64 and < 1.69 (with Linux+GCC) have a broken singleton implementation with respect to order of calling static objects' destructors from shared libs using Boost/Ser with GCC under Linux. I used GCC 6.5 and 7.3 and reproduced the same crash at program termination as expected and described by experts.

Passing "-Bsymbolic -Bsymbolic-functions" to the linker should fix the Boost/Serialization crash but I did not test it.

So I consider that we should not use Boost 1.65 to 1.68 for Bayeux and bump directly to 1.69 which seems to solve the problem as mentioned in the discussions in https://github.com/boostorg/serialization/issues/104 and https://github.com/boostorg/serialization/pull/131.

Of course, the scope of this issue has no effect on the consistency of the data serialized through the Bayeux I/O tools.