PcieBackend spurious crash during recovery

mhier commented 3 years ago

Due to another bug, I had a server in a "recovery loop": The devices were switched to error state via setException() and then recovered again via open() in an endless loop. Occasionally (quite rare actually), the a crash "double free or corruption (fasttop)" happened with the following backtrace:

#0  0x00007f6af659c438 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007f6af659e03a in __GI_abort () at abort.c:89
#2  0x00007f6af65de7fa in __libc_message (do_abort=do_abort@entry=2, fmt=fmt@entry=0x7f6af66f7f98 "*** Error in `%s': %s: 0x%s ***\n")
    at ../sysdeps/posix/libc_fatal.c:175
#3  0x00007f6af65e738a in malloc_printerr (ar_ptr=<optimized out>, ptr=<optimized out>, str=0x7f6af66f8060 "double free or corruption (fasttop)", action=3)
    at malloc.c:5020
#4  _int_free (av=<optimized out>, p=<optimized out>, have_lock=0) at malloc.c:3874
#5  0x00007f6af65eb58c in __GI___libc_free (mem=<optimized out>) at malloc.c:2975
#6  0x00007f6afb00092d in boost::detail::function::functor_manager<boost::_bi::bind_t<void, boost::_mfi::mf4<void, ChimeraTK::PcieBackend, unsigned char, unsigned int, int*, unsigned long>, boost::_bi::list5<boost::_bi::value<ChimeraTK::PcieBackend*>, boost::arg<1>, boost::arg<2>, boost::arg<3>, boost::_bi::value<unsigned long> > > >::manager (op=<optimized out>, out_buffer=..., in_buffer=...) at /usr/include/boost/function/function_base.hpp:389
#7  boost::detail::function::functor_manager<boost::_bi::bind_t<void, boost::_mfi::mf4<void, ChimeraTK::PcieBackend, unsigned char, unsigned int, int*, unsigned long>, boost::_bi::list5<boost::_bi::value<ChimeraTK::PcieBackend*>, boost::arg<1>, boost::arg<2>, boost::arg<3>, boost::_bi::value<unsigned long> > > >::manager (
    op=<optimized out>, out_buffer=..., in_buffer=...) at /usr/include/boost/function/function_base.hpp:412
#8  boost::detail::function::functor_manager<boost::_bi::bind_t<void, boost::_mfi::mf4<void, ChimeraTK::PcieBackend, unsigned char, unsigned int, int*, unsigned long>, boost::_bi::list5<boost::_bi::value<ChimeraTK::PcieBackend*>, boost::arg<1>, boost::arg<2>, boost::arg<3>, boost::_bi::value<unsigned long> > > >::manage (in_buffer=..., 
    out_buffer=..., op=<optimized out>) at /usr/include/boost/function/function_base.hpp:440
#9  0x00007f6afaffea44 in boost::detail::function::basic_vtable4<void, unsigned char, unsigned int, int*, unsigned long>::clear (this=<optimized out>, functor=...)
    at /usr/include/boost/function/function_template.hpp:510
#10 boost::function4<void, unsigned char, unsigned int, int*, unsigned long>::clear (this=0x7f688a59b330) at /usr/include/boost/function/function_template.hpp:883
#11 boost::function4<void, unsigned char, unsigned int, int*, unsigned long>::~function4 (this=0x7f688a59b330, __in_chrg=<optimized out>)
    at /usr/include/boost/function/function_template.hpp:765
#12 boost::function<void (unsigned char, unsigned int, int*, unsigned long)>::~function() (this=0x7f688a59b330, __in_chrg=<optimized out>)
    at /usr/include/boost/function/function_template.hpp:1056
#13 boost::function<void (unsigned char, unsigned int, int*, unsigned long)>::operator=<boost::_bi::bind_t<void, boost::_mfi::mf4<void, ChimeraTK::PcieBackend, unsigned char, unsigned int, int*, unsigned long>, boost::_bi::list5<boost::_bi::value<ChimeraTK::PcieBackend*>, boost::arg<1>, boost::arg<2>, boost::arg<3>, boost::arg<4> > > >(boost::_bi::bind_t<void, boost::_mfi::mf4<void, ChimeraTK::PcieBackend, unsigned char, unsigned int, int*, unsigned long>, boost::_bi::list5<boost::_bi::value<ChimeraTK::PcieBackend*>, boost::arg<1>, boost::arg<2>, boost::arg<3>, boost::arg<4> > >) (f=..., this=0x1e554b0) at /usr/include/boost/function/function_template.hpp:1132
#14 ChimeraTK::PcieBackend::determineDriverAndConfigureIoctl (this=this@entry=0x1e55250)
    at /build/libchimeratk-deviceaccess-02.01xenial1.01/device_backends/pcie/src/PcieBackend.cc:80
#15 0x00007f6afafffd0a in ChimeraTK::PcieBackend::open (this=0x1e55250) at /build/libchimeratk-deviceaccess-02.01xenial1.01/device_backends/pcie/src/PcieBackend.cc:40
#16 0x00007f6afae6ba8a in ChimeraTK::LogicalNameMappingBackend::open (this=0x2952210)
    at /build/libchimeratk-deviceaccess-02.01xenial1.01/device_backends/LogicalNameMapping/src/LogicalNameMappingBackend.cc:46
#17 0x00007f6af860a5e2 in ChimeraTK::DeviceModule::handleException() () from /usr/lib/libChimeraTK-ApplicationCore.so.02.00xenial3
#18 0x00007f6afb4b65d5 in ?? () from /usr/lib/x86_64-linux-gnu/libboost_thread.so.1.58.0
#19 0x00007f6af99a86ba in start_thread (arg=0x7f688a59c700) at pthread_create.c:333
#20 0x00007f6af666e4dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

The server was the llrfctrl server at A0M, and it was stuck in the recovery loop because one ADC board was powered down through the MCH.

mhier commented 3 years ago

A theory how this bug happens:

When using the Logical Name Mapper, it may happen that the same PcieBackend instance is used in two LNM devices. In ApplicationCore, this connection is not known. Hence the two DeviceModules might attempt to recover the same PcieBackend concurrently (indirectly through the logical device).

Since open() is not considered to be thread safe, this is not allowed. On the other hand, the application has no way of knowing about this entanglement. Not sure how to best solve this problem. Either ApplicationCore (and basically any application) has to make sure, no device is concurrently opened/recovered with any other device, or we have to change the requirement and expect open() to be thread safe.

Note: The logical name mapping backend cannot fix this. One of the usages could be direct, without a LNM backend in between, so it is impossible to know for the LNM backend if a concurrent open() is currently in progress.

killenb commented 3 years ago

As the problem you describe only can happen when using the LNM, we could require that you have to use LNM for all devices in your application. Then we could build it into the LNM.

Or we put this task to each application and put the same mechanism into ApplicationCore. The call to open() in the DeviceModule could be surrounded by a global mutex, which makes all recoveries/initialisations sequential.

mhier commented 3 years ago

I don't like the first option (force using LNM for all devices), since this can easily be forgotten and somehow contradicts our principle of abstraction.

ChimeraTK / DeviceAccess

PcieBackend spurious crash during recovery #197