boostorg / serialization

Boost.org serialization module
http://boost.org/libs/serialization
119 stars 139 forks source link

Singleton destruction check completely broken leading to memory leaks or crashs #104

Closed Flamefire closed 5 years ago

Flamefire commented 6 years ago

The commit https://github.com/boostorg/serialization/commit/b0a794da38b52c8ae0f8155cf3bb30321cf67f00 introduced a change that completely breaks the singleton implementation.

The problem is: singleton<T> is NOT a singleton! It is NEVER instantiated. Instead it instantiates a class singleton_wrapper derived from T. That instance is then returned in the get_instance method.

Because it is never instantiated it is also never destroyed. This leads to the destroyed flag never set. It only happens to be false because it is default (or value?) initialized.

This was made even worse with commit https://github.com/boostorg/serialization/commit/7d216b47619f6b7aa57c5f0e48b9131a025426f9 where the singleton instance is not a static variable with automatic, thread-safe construction/destruction but a pointer which is leaked.

The whole implementation is against the singleton pattern: Currently is is completely possible to create a singleton<Foo> instance. Even worse: Currently you would have to do this EXACTLY once to have correct code for EVERY singleton. If you don't do this you get a memory leak or corrupted memory depending on the version. If you do this more than once you yet a double-free error which cannot be caught.

My suggestion: Make the singleton a real singleton (private ctor, no assignment etc.), inherit from T directly and let the instance variable be singleton instead of singleton_wrapper (get rid of that, but check with http://tinyurl.com/ljdp8 that this does not cause a regression, if it does, put the members ( destroyed...) into singleton_wrapper like before), check the setting of the destroyed flag which must rely on the actual destruction of the "real"(!!!) singleton instance.

Flamefire commented 6 years ago

And to echo it here a reproducer in text:

The problem is very complicated to reproduce, due to unknown init/finalize order. I assume: typeid-singleton gets created and registers for destruction. It calls into ktmap which causes its instantiation and registers ktmap for destruction. Finalize order is the reverse so ktmap is destroyed first.

So one problem is the usage of the singleton ctor/dtor which makes people assume it is used.

Flamefire commented 6 years ago

Testcase to demonstrate this: https://gist.github.com/Flamefire/80af5fcdc0f2a6787858738ce07b25bf

Run with Boost 1.64 -> 1.67:

boost_1_64_0
With assert:
With print:
~A<1>
singleton<A<0> >::is_destroyed(): 0
singleton<A<1> >::is_destroyed(): 1
~A<0>
singleton<A<0> >::is_destroyed(): 1
singleton<A<1> >::is_destroyed(): 1

boost_1_65_0
With assert:
terminate called after throwing an instance of 'std::runtime_error'
  what():  Assertion failed: singleton<A<0> >::is_destroyed()
Abgebrochen (Speicherabzug geschrieben)
With print:
~A<1>
singleton<A<0> >::is_destroyed(): 0
singleton<A<1> >::is_destroyed(): 0
~A<0>
singleton<A<0> >::is_destroyed(): 0
singleton<A<1> >::is_destroyed(): 0

boost_1_65_1
With assert:
terminate called after throwing an instance of 'std::runtime_error'
  what():  Assertion failed: singleton<A<0> >::is_destroyed()
Abgebrochen (Speicherabzug geschrieben)
With print:
~A<1>
singleton<A<0> >::is_destroyed(): 0
singleton<A<1> >::is_destroyed(): 0
~A<0>
singleton<A<0> >::is_destroyed(): 0
singleton<A<1> >::is_destroyed(): 0

boost_1_66_0
With assert:
terminate called after throwing an instance of 'std::runtime_error'
  what():  Assertion failed: states[0] == CS_DESTROYED
Abgebrochen (Speicherabzug geschrieben)
With print:

boost_1_67_0
With assert:
terminate called after throwing an instance of 'std::runtime_error'
  what():  Assertion failed: states[0] == CS_DESTROYED
Abgebrochen (Speicherabzug geschrieben)
With print:

As you can easily see 1.64 is working, 1.65.0 introduced the bug that is_destroyed is never set and 1.66 and up never destroys the instance.

Flamefire commented 6 years ago

Some more analysis: singleton<T> is used in 2 ways: Once can inherit from it and use T::get_const_instance() or not change T and use singleton<T>::get_const_instance()

The problem is with the latter as singleton<T> is in this case never instantiated while it is in the first case (through T). This is why the code breaks sometimes but not with every use.

Flamefire commented 6 years ago

After some more analysis: I wonder what really happened.

extended_type_info_typeid refered to singleton<tkmap> in its ctor. This means that singleton<tkmap> has to be constructed first. Which means that it should be destructed last (see https://stackoverflow.com/questions/335369/finding-c-static-initialization-order-problems#335746)

Apparently (maybe a bug in the compiler?) singleton<tkmap> is destructed first. This leads to the assertion failure Assertion failed: (! singleton<tkmap>::is_destroyed()), function type_unregister, file libs/serialization/src/extended_type_info_typeid.cpp, line 98. as reported by @vasild in #69. However that assertion failure is ok as there is a check for the same below in which case the unregister is skipped. The bug introduced does not set is_destroyed and hence it crashes ( singleton<tkmap> is still destructed first!)

To summarize: A crash/valgrind detected memory error can occur, IF the compiler destroys singleton<tkmap> first. I read that the order of destruction is the order of the finished ctors. This would preclude the "bad" scenario. But it could be the order of started ctors. If both are valid in C++ then the occasional crash depending on compiler version is explained.

Local Static function member is explained here: https://isocpp.org/wiki/faq/ctors#construct-on-first-use-v2. However I did not find anything about dtor order for this case.

jalegido commented 6 years ago

According to Microsoft (https://msdn.microsoft.com/en-us/library/xwec471e.aspx) msvc creates and destroys static objects in the same order they are in the code. In my experience, this means that it deppends of the order files in cl.exe. For example, static objects in A.cpp are initialezed before than objects in B.cpp if you exec "cl A.cpp B.cpp". But if you change the order "cl B.cpp A.cpp", then the initialization order also reverse.

Flamefire commented 6 years ago

This does not apply here. We don't have static objects, we have member function local static objects. This is different. According to this (https://stackoverflow.com/questions/335369/finding-c-static-initialization-order-problems#335746) it should work.

IsoCPP FAQ does mention this too: https://isocpp.org/wiki/faq/ctors#construct-on-first-use-v2

However I did not find anything about dtor order for this case.

Flamefire commented 6 years ago

The problem occurs when using multiple shared libraries. It seems like this messes up the destruction order.

Reproduction: https://gist.github.com/Flamefire/286e9e0e501731a04f10786450d3e711

Some debug info by hooking the ctor/dtor of the singletons:

ctor 0x7f9f0aa074a0 std::multiset<const boost::serialization::typeid_system::extended_type_info_typeid_0*, boost::serialization::typeid_system::type_compare> ctor 0x7f9f0a7f63e0 boost::serialization::extended_type_info_typeid

ctor 0x7f9f0a7f64a0 std::multiset<const boost::serialization::typeid_system::extended_type_info_typeid_0*, boost::serialization::typeid_system::type_compare> ctor 0x7f9f0aa073e0 boost::serialization::extended_type_info_typeid

dtor 0x7f9f0aa073e0 boost::serialization::extended_type_info_typeid dtor 0x7f9f0aa074a0 std::multiset<const boost::serialization::typeid_system::extended_type_info_typeid_0, boost::serialization::typeid_system::type_compare> dtor 0x7f9f0a7f64a0 std::multiset<const boost::serialization::typeid_system::extended_type_info_typeid_0, boost::serialization::typeid_system::type_compare> dtor 0x7f9f0a7f63e0 boost::serialization::extended_type_info_typeid

Note how the 2 maps are destructed together which leads to the wrong ordering for the 2nd pair.

robertramey commented 6 years ago

OK - I'm working on this now. I'm presuming that https://gist.github.com/Flamefire/286e9e0e501731a04f10786450d3e711 contains the definitive test case. I can see that it was a bitch to figure this out. I would like to figure out a way to included in the test suite - but there doesn't seem to be a way to do this. Making two DLLS etc. ...

robertramey commented 6 years ago

we have two test cases here. Are they both useful? Or is it only the second one?

Flamefire commented 6 years ago

I found a way to include it in the test suite and opened PR #105 for that and the fix.

The 2 test cases test something different: The 2nd (https://gist.github.com/Flamefire/286e9e0e501731a04f10786450d3e711) tests the library from a higher level (just minimal code to break it)

The other one tests the implementation of the singleton thoroughly (e.g. checking that they are in fact destructed and set the flag)

kunshouout commented 6 years ago

I suffer from this issue recently. My code got memory leaks. Is there a quick fix for this problem. I am using vs2017 + boost 1.66. Thx

Flamefire commented 6 years ago

You can test my PR: https://github.com/boostorg/serialization/pull/105

Either as a whole or just the fix (last 2 commits). Note that you can generate patches to download by appending ".diff" to the url on GitHub: https://github.com/boostorg/serialization/commit/17c952b7634b9c0ab8f257c679451587b5c7f280.diff and https://github.com/boostorg/serialization/commit/d63b8535f5e615c1edc8ab50078508e4d56816cf.diff

res2k commented 6 years ago

FWIW, the reason why test_dll_exported currently succeeds on Travis appears to be --as-needed: Ubuntu enables this flag by default, and that apparently is enough to change the initialization/finalization order such that everything works.

res2k commented 6 years ago

Small addendum to previous comment: --as-needed doesn't change the initialization/finalization, it causes the removal of the reference to libpolymorphic_derived2.so. Defining something in that lib so it's actually needed at runtime – so the reference stays – causes the test to fail again: https://travis-ci.org/res2k/serialization/builds/421484746

robertramey commented 6 years ago

FYI - I spent a little time reviewing the test matrix for test_dll_export to try to find some commonality. It seems that failures occur only on linux platforms. Failures seem to be across all combinations of compiler and standard library implementations. I'm still looking at this so my assessment could change. It's a little tedious to do with our tools.

Flamefire commented 6 years ago

This matches my experiments with the crash as described here and shown in #111: It only affects linux but there for all environments. Looks like shared libraries are handled differently in linux in general.

res2k commented 6 years ago

I think I found the culprit: By default, the Linux dynamic linker binds dynamic symbols in a shared object to any already existing symbols in the global namespace. Some singleton<> instances are defined in both the executable and the .so (as it may happen if you use singleton<> in a shared header). When the singletons are used the instance in the executable is constructed; for some reason, their destruction is scheduled for after the libboost_serialization finalization, meaning it's singletons are already destroyed, resulting in funny things happening as they're being used by the singletons related to the polymorphic classes. One way to change this behaviour is to pass -Bsymbolic -Bsymbolic-functions to the linker; doing so fixes the crash. It might also possible obtain the desired behaviour by using __attribute__((visibility(…))), I'm trying to see if I can get anywhere with that.

res2k commented 6 years ago

It's possible to fix the issue by fiddling with __attribute__((visibility(…))), see https://gist.github.com/res2k/c5a9e90bb2f22d0aea0b9f8cde6f1ef4 for the changes I had to make.

res2k commented 6 years ago

Changes, cleaned up, and CI results: https://github.com/res2k/serialization/pull/10

Flamefire commented 6 years ago

Thanks! What I'm missing in your explanation: There are definitely 2 instances of the same singleton for shared libraries linked to static boost and they are destroyed just after each other instead of their correct, library intern order which leads to the crash in the 2nd library.

Did you also test #111? Would be interesting to know if that fixes it too.

And finally: Of course the singleton bug has to be fixed anyway: #110

krojew commented 6 years ago

I am running into crashes with boost::serialization::singleton in 1.68; probably related to this issue. Example valgrind dump:

 ==2665== Invalid read of size 8
==2665==    at 0xA63DB81: boost::serialization::typeid_system::extended_type_info_typeid_0::type_unregister() (in libboost_serialization.so.1.68.0)
==2665==    by 0x7BD40D: boost::serialization::extended_type_info_typeid<std::unordered_set<long, std::hash<long>, std::equal_to<long>, std::allocator<long> > >::~extended_type_info_typeid() (extended_type_info_typeid.hpp:96)
==2665==    by 0x7BCFC4: boost::serialization::singleton<boost::serialization::extended_type_info_typeid<std::unordered_set<long, std::hash<long>, std::equal_to<long>, std::allocator<long> > > >::get_instance()::singleton_wrapper::~singleton_wrapper() (singleton.hpp:121)
==2665==    by 0xC3D3040: __run_exit_handlers (exit.c:108)
==2665==    by 0xC3D3139: exit (exit.c:139)
==2665==    by 0xC3B1B9D: (below main) (libc-start.c:344)
==2665==  Address 0x12bfc820 is 32 bytes inside a block of size 40 free'd
==2665==    at 0x4C3123B: operator delete(void*) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==2665==    by 0xA63E1A7: boost::serialization::singleton<std::multiset<boost::serialization::typeid_system::extended_type_info_typeid_0 const*, boost::serialization::typeid_system::type_compare, std::allocator<boost::serialization::typeid_system::extended_type_info_typeid_0 const*> > >::get_instance()::singleton_wrapper::~singleton_wrapper() (in libboost_serialization.so.1.68.0)
==2665==    by 0xC3D3614: __cxa_finalize (cxa_finalize.c:83)
==2665==    by 0xA6355F2: ??? (in libboost_serialization.so.1.68.0)
==2665==    by 0x4010B72: _dl_fini (dl-fini.c:138)
==2665==    by 0xC3D3040: __run_exit_handlers (exit.c:108)
==2665==    by 0xC3D3139: exit (exit.c:139)
==2665==    by 0xC3B1B9D: (below main) (libc-start.c:344)
==2665==  Block was alloc'd at
==2665==    at 0x4C3017F: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==2665==    by 0xA63D9F8: boost::serialization::typeid_system::extended_type_info_typeid_0::type_register(std::type_info const&) (in libboost_serialization.so.1.68.0)
==2665==    by 0x7BD053: boost::serialization::extended_type_info_typeid<std::unordered_set<long, std::hash<long>, std::equal_to<long>, std::allocator<long> > >::extended_type_info_typeid() (extended_type_info_typeid.hpp:91)
==2665==    by 0x7BCF8E: boost::serialization::singleton<boost::serialization::extended_type_info_typeid<std::unordered_set<long, std::hash<long>, std::equal_to<long>, std::allocator<long> > > >::get_instance()::singleton_wrapper::singleton_wrapper() (singleton.hpp:121)
==2665==    by 0x74DC99: boost::serialization::singleton<boost::serialization::extended_type_info_typeid<std::unordered_set<long, std::hash<long>, std::equal_to<long>, std::allocator<long> > > >::get_instance() (singleton.hpp:122)
==2665==    by 0x7BB248: boost::serialization::singleton<boost::serialization::extended_type_info_typeid<std::unordered_set<long, std::hash<long>, std::equal_to<long>, std::allocator<long> > > >::get_const_instance() (singleton.hpp:148)
==2665==    by 0x50D095F: boost::archive::detail::iserializer<boost::archive::binary_iarchive, std::unordered_set<long, std::hash<long>, std::equal_to<long>, std::allocator<long> > >::iserializer() (iserializer.hpp:130)
==2665==    by 0x50D08CE: boost::serialization::singleton<boost::archive::detail::iserializer<boost::archive::binary_iarchive, std::unordered_set<long, std::hash<long>, std::equal_to<long>, std::allocator<long> > > >::get_instance()::singleton_wrapper::singleton_wrapper() (singleton.hpp:121)
==2665==    by 0x50898E8: boost::serialization::singleton<boost::archive::detail::iserializer<boost::archive::binary_iarchive, std::unordered_set<long, std::hash<long>, std::equal_to<long>, std::allocator<long> > > >::get_instance() (singleton.hpp:122)
==2665==    by 0x5072DD8: __cxx_global_var_init.35 (singleton.hpp:162)
==2665==    by 0x4010732: call_init (dl-init.c:72)
==2665==    by 0x4010732: _dl_init (dl-init.c:119)
==2665==    by 0x40010C9: ??? (in /lib/x86_64-linux-gnu/ld-2.27.so)
Flamefire commented 6 years ago

Yes this is exactly this issue. You can manually apply #105 to fix that.

res2k commented 6 years ago

I made further changes related to ELF visibility (still at https://github.com/res2k/serialization/pull/10); the test cases from #110 and #111 now run successfully.

@robertramey: Your feedback would be greatly appreciated.

robertramey commented 6 years ago

I'm currently bogged down in other stuff I can't set aside. My experience is that once I get into something like this it ends up consuming much more time than one would think. But, I've been following this discussion and appreciate the contributions and efforts of all parties. As I've said before this seems to me something related to unstandardized and differently implemented features of the C/C++ programming surface:dynamic loading and symbol visibility.

It's seems you (everyone) is making progress. I think it would be good if these results could be boiled down to a couple of tests which I could add to the permanent test suite. Then we would see the results across the whole test matrix. The test matrix is not as diverse as I would hope and it doesn't make it easy to review results by attributes like 32 vs 64 bit, std version, compiler version, os version, etc. But it's the best we have and it's the best we have in a situation like this.

Soooo ... - I would like to see this boiled down to one or two test. Before doing this, please look carefully at the current tests. The are set up so that, if appropriate, the same test can be run with different archive types. The use "lightweight test" macros and follow the same general organization. This makes it easier for me and others to understand. This might not seem important but it is. Keep in mind that the serialization library testing is designed to be a running history of all the failures we've ever found. Once a failure is found, we add it to the permanent test suite so the the effort cannot creep back into the code. It also avoids playing "whack-a-mole" where by fixing problem1 creates problem2 and vice-versa resulting in an infinite "ping-pong". After 15 years, the boost serialization is better than it's ever been and still getting better. This is in large part due to the efforts of all of you and and others. It's much appreciated.

Flamefire commented 6 years ago

I think it would be good if these results could be boiled down to a couple of tests which I could add to the permanent test suite.

I already did that: #110 and #111
Those PRs are just 2 tests (1 each) showing the bug in Boost.Serialization (#110) and the issue with shared libs which is caused by whatever (#111). They fail exactly as expected and described.

105 consists of those 2 PRs and 2 commits fixing the issue. I did it this way to decompose this (admittedly difficult) issue into its pieces so understanding the tests and changes is as easy as possible.

res2k commented 6 years ago

Though, 5d3d71424a043b720648e7a99af2a1f3f3f67d01 (the commit that makes #105 "work") merely removes the assertions that cause the test failures. Keep the assertions, they fire - the destruction order is still not quite "correct" there.

Flamefire commented 6 years ago

Correct. This is IMO the only commit that requires a bit discussion as the rest is straight forward. My approach was: "We cannot fix the underlying issue as it is (C++) runtime related" (you seem to have fixed that though, but the amount of additional macros and their different use cases scare me... Maybe something was forgotten?)
So after fixing the singleton::is_destroyed check I went for the simpler solution: Treat it as a simple pointer-is-valid check. If you look at those places you'll see that they are all like:

BOOST_ASSERT(global_ptr);
if(global_ptr)
  global_ptr->access(...);

So the assert isn't really required. And it makes sense to: What is done in this code is unregistering stuff that the calling class has registered. If the registry is already destroyed then there is no point in trying to unregister something.

If those visibility macros work and are accepted, then this commit can be reverted though but the checks below should be kept. This way the assert is only a development aid to catch missing visibility stuff and the runtime check catches those unexpected cases so they don't crash the application. An explanation should also be added to the assert so no one will go and remove the check.

res2k commented 6 years ago

One thing I'm not so sure about is whether the "pointer check" approach plays well with dynamically loaded SOs using Boost.Serialization. The ELF symbol overriding may foul up such dynamic loading and unloading. (There's a test case, but unfortunately is broken.)

Note that "lots of macros" isn't the only approach to reign in ELF symbol overriding. As I tried earlier, the -Bsymbolic -Bsymbolic-functions linker options also yield the desired behavior. Or, if the -fvisibility compiler option(s) is used, the amount of explicit visibility attributes could probably be reduced. Thing is just that these options require the library client to link anything using Boost.Serialization with the right flags. But I chose the approach to have everything in the headers as that is transparent to the client.

Flamefire commented 6 years ago

One thing I'm not so sure about is whether the "pointer check" approach plays well with dynamically loaded SOs using Boost.Serialization.

I doubt that this can be a problem: We check a value in the same class we are gonna call. Even if something would be mixed up, we'll still get the desired behaviour: Call the class when its valid.

Flamefire commented 5 years ago

Fixed by https://github.com/boostorg/serialization/commit/f297d80d6ef55ac66525320d0477c17806aa57cd which should be shown by tests from https://github.com/boostorg/serialization/commit/cc2a996384baddd7b45ee93a5644e3829e9e4e86

Rseding91 commented 4 months ago

I was wondering if anyone with enough knowledge about this specific issue could tell me if a crash we're seeing on-exit could be this issue. Specifically our game Factorio is getting 'random' crashes in the AMD graphics driver for AMD GPU users when exiting the game. The stack trace is always identical and goes through boost::serialization::singleton before crashing. A symbolized stack trace (to the best of our abilities) can be seen in this report: https://forums.factorio.com/113613

Some information:

Squall-Leonhart commented 2 days ago

I was wondering if anyone with enough knowledge about this specific issue could tell me if a crash we're seeing on-exit could be this issue. Specifically our game Factorio is getting 'random' crashes in the AMD graphics driver for AMD GPU users when exiting the game. The stack trace is always identical and goes through boost::serialization::singleton before crashing. A symbolized stack trace (to the best of our abilities) can be seen in this report: https://forums.factorio.com/113613

Cemu users are seeing the same in 24.8.1 (Vulkan) and 24.9.1 (Vulkan and OpenGL)

Wonder if AMD has unknowingly upgraded core driver components into a broken state.