FairRootGroup / FairMQ

C++ Message Queuing Library and Framework
GNU Lesser General Public License v3.0
83 stars 33 forks source link

Exceptions routinely raised while running with shmem #278

Open ktf opened 4 years ago

ktf commented 4 years ago

I see that the fair::mq::shmem::Manager::SendHeartbeats method raises a number of exceptions, which are apparently caught and ignored. Is this normal?

frame #9: 0x000000010ec68017 libFairMQ.1.4.18.dylib`fair::mq::shmem::Manager::SendHeartbeats(this=0x00007f9db2d99720) at Manager.h:405 [opt]
   402          std::string controlQueueName("fmq_" + fShmId + "_cq");
   403          while (fSendHeartbeats) {
   404              try {
-> 405                  boost::interprocess::message_queue mq(boost::interprocess::open_only, controlQueueName.c_str());
rbx commented 4 years ago

It is trying to open a queue that should be created by the monitor. If it keeps doing it for you it likely means that the monitor is not (yet) started. Is that intentional in your case, or does it fail to locate it?

ktf commented 4 years ago

I do not create the monitor intentionally, because that can leak a zombie. Is there any way to disable the heartbeat checking for it?

rbx commented 4 years ago

No, there isn't. Even if the monitor is not automatically launched, device should not assume one will not be started at a later point, e.g. for debugging. Starting the monitor manually will not only check for present shared memory regions, but also which devices are still alive (= sending heartbeats).

ktf commented 4 years ago

I've a second problem, which I think it's actually related. Now when there is a segfault in one of the devices the others get in some weird state:

  * frame #0: 0x00007fff71228756 libsystem_kernel.dylib`__semwait_signal + 10
    frame #1: 0x00007fff711abeea libsystem_c.dylib`nanosleep + 196
    frame #2: 0x000000010c3846d1 libFairMQ.1.4.18.dylib`boost::interprocess::spin_wait::yield() [inlined] boost::interprocess::ipcdetail::thread_sleep_tick() at os_thread_functions.hpp:426:4 [opt]
    frame #3: 0x000000010c384694 libFairMQ.1.4.18.dylib`boost::interprocess::spin_wait::yield(this=0x00007ffee4b027a0) at wait.hpp:125 [opt]
    frame #4: 0x000000010c3992b8 libFairMQ.1.4.18.dylib`boost::interprocess::interprocess_condition::notify_all() at common_algorithms.hpp:68:19 [opt]
    frame #5: 0x000000010c399275 libFairMQ.1.4.18.dylib`boost::interprocess::interprocess_condition::notify_all() [inlined] boost::interprocess::ipcdetail::spin_mutex::lock(this=0x000000010b215014) at mutex.hpp:79 [opt]
    frame #6: 0x000000010c399275 libFairMQ.1.4.18.dylib`boost::interprocess::interprocess_condition::notify_all() [inlined] boost::interprocess::ipcdetail::spin_condition::notify(this=0x000000010b215014, command=2) at condition.hpp:147 [opt]
    frame #7: 0x000000010c399275 libFairMQ.1.4.18.dylib`boost::interprocess::interprocess_condition::notify_all() [inlined] boost::interprocess::ipcdetail::spin_condition::notify_all(this=0x000000010b215014) at condition.hpp:137 [opt]
    frame #8: 0x000000010c399275 libFairMQ.1.4.18.dylib`boost::interprocess::interprocess_condition::notify_all(this=0x000000010b215014) at interprocess_condition.hpp:98 [opt]
    frame #9: 0x000000010c399228 libFairMQ.1.4.18.dylib`boost::interprocess::ipcdetail::condition_any_algorithm<boost::interprocess::ipcdetail::shm_named_condition::internal_condition_members>::signal(data=0x000000010b215010, broadcast=true) at condition_any_algorithm.hpp:89:26 [opt]
    frame #10: 0x000000010c3988b5 libFairMQ.1.4.18.dylib`fair::mq::shmem::Manager::UnsubscribeFromRegionEvents() [inlined] boost::interprocess::ipcdetail::condition_any_wrapper<boost::interprocess::ipcdetail::shm_named_condition::internal_condition_members>::notify_all(this=<unavailable>) at condition_any_algorithm.hpp:177:7 [opt]
    frame #11: 0x000000010c3988ab libFairMQ.1.4.18.dylib`fair::mq::shmem::Manager::UnsubscribeFromRegionEvents() [inlined] boost::interprocess::ipcdetail::shm_named_condition::notify_all(this=0x00007fa3d333c720) at named_condition.hpp:208 [opt]
    frame #12: 0x000000010c3988a0 libFairMQ.1.4.18.dylib`fair::mq::shmem::Manager::UnsubscribeFromRegionEvents() [inlined] boost::interprocess::named_condition::notify_all(this=0x00007fa3d333c720) at named_condition.hpp:157 [opt]
    frame #13: 0x000000010c3988a0 libFairMQ.1.4.18.dylib`fair::mq::shmem::Manager::UnsubscribeFromRegionEvents(this=0x00007fa3d333c660) at Manager.h:349 [opt]
    frame #14: 0x000000010c397b01 libFairMQ.1.4.18.dylib`fair::mq::shmem::Manager::~Manager(this=0x00007fa3d333c660) at Manager.h:108:9 [opt]
    frame #15: 0x000000010c3adbb4 libFairMQ.1.4.18.dylib`fair::mq::shmem::TransportFactory::~TransportFactory() [inlined] fair::mq::shmem::Manager::~Manager(this=0x00007fa3d333c660) at Manager.h:104:5 [opt]
    frame #16: 0x000000010c3adbac libFairMQ.1.4.18.dylib`fair::mq::shmem::TransportFactory::~TransportFactory() [inlined] std::__1::default_delete<fair::mq::shmem::Manager>::operator(this=<unavailable>, __ptr=0x00007fa3d333c660)(fair::mq::shmem::Manager*) const at memory:2338 [opt]
    frame #17: 0x000000010c3adbac libFairMQ.1.4.18.dylib`fair::mq::shmem::TransportFactory::~TransportFactory() [inlined] std::__1::unique_ptr<fair::mq::shmem::Manager, std::__1::default_delete<fair::mq::shmem::Manager> >::reset(this=0x00007fa3d333bdd8, __p=0x0000000000000000) at memory:2651 [opt]
    frame #18: 0x000000010c3adb95 libFairMQ.1.4.18.dylib`fair::mq::shmem::TransportFactory::~TransportFactory() [inlined] std::__1::unique_ptr<fair::mq::shmem::Manager, std::__1::default_delete<fair::mq::shmem::Manager> >::~unique_ptr(this=0x00007fa3d333bdd8) at memory:2605 [opt]
    frame #19: 0x000000010c3adb95 libFairMQ.1.4.18.dylib`fair::mq::shmem::TransportFactory::~TransportFactory() [inlined] std::__1::unique_ptr<fair::mq::shmem::Manager, std::__1::default_delete<fair::mq::shmem::Manager> >::~unique_ptr(this=0x00007fa3d333bdd8) at memory:2605 [opt]
    frame #20: 0x000000010c3adb95 libFairMQ.1.4.18.dylib`fair::mq::shmem::TransportFactory::~TransportFactory(this=0x00007fa3d333bd58) at TransportFactory.h:201 [opt]
    frame #21: 0x000000010c351d38 libFairMQ.1.4.18.dylib`FairMQDevice::ResetWrapper() [inlined] std::__1::__shared_count::__release_shared(this=0x00007fa3d333bd40) at memory:3543:9 [opt]
    frame #22: 0x000000010c351d1d libFairMQ.1.4.18.dylib`FairMQDevice::ResetWrapper() [inlined] std::__1::__shared_weak_count::__release_shared(this=0x00007fa3d333bd40) at memory:3585 [opt]
    frame #23: 0x000000010c351d1d libFairMQ.1.4.18.dylib`FairMQDevice::ResetWrapper() [inlined] std::__1::shared_ptr<FairMQTransportFactory>::~shared_ptr(this=0x00007fa3d333c818) at memory:4521 [opt]
    frame #24: 0x000000010c351d14 libFairMQ.1.4.18.dylib`FairMQDevice::ResetWrapper() [inlined] std::__1::shared_ptr<FairMQTransportFactory>::~shared_ptr(this=0x00007fa3d333c818) at memory:4519 [opt]
    frame #25: 0x000000010c351d14 libFairMQ.1.4.18.dylib`FairMQDevice::ResetWrapper() [inlined] std::__1::pair<fair::mq::Transport const, std::__1::shared_ptr<FairMQTransportFactory> >::~pair(this=0x00007fa3d333c810) at utility:315 [opt]
    frame #26: 0x000000010c351d14 libFairMQ.1.4.18.dylib`FairMQDevice::ResetWrapper() [inlined] std::__1::pair<fair::mq::Transport const, std::__1::shared_ptr<FairMQTransportFactory> >::~pair(this=0x00007fa3d333c810) at utility:315 [opt]
    frame #27: 0x000000010c351d14 libFairMQ.1.4.18.dylib`FairMQDevice::ResetWrapper() [inlined] void std::__1::allocator_traits<std::__1::allocator<std::__1::__hash_node<std::__1::__hash_value_type<fair::mq::Transport, std::__1::shared_ptr<FairMQTransportFactory> >, void*> > >::__destroy<std::__1::pair<fair::mq::Transport const, std::__1::shared_ptr<FairMQTransportFactory> > >((null)=<unavailable>, __p=0x00007fa3d333c810) at memory:1747 [opt]
    frame #28: 0x000000010c351d14 libFairMQ.1.4.18.dylib`FairMQDevice::ResetWrapper() [inlined] void std::__1::allocator_traits<std::__1::allocator<std::__1::__hash_node<std::__1::__hash_value_type<fair::mq::Transport, std::__1::shared_ptr<FairMQTransportFactory> >, void*> > >::destroy<std::__1::pair<fair::mq::Transport const, std::__1::shared_ptr<FairMQTransportFactory> > >(__a=<unavailable>, __p=0x00007fa3d333c810) at memory:1595 [opt]
    frame #29: 0x000000010c351d14 libFairMQ.1.4.18.dylib`FairMQDevice::ResetWrapper() at __hash_table:1600 [opt]
    frame #30: 0x000000010c351ceb libFairMQ.1.4.18.dylib`FairMQDevice::ResetWrapper() at __hash_table:1850 [opt]
    frame #31: 0x000000010c351cdc libFairMQ.1.4.18.dylib`FairMQDevice::ResetWrapper() [inlined] std::__1::unordered_map<fair::mq::Transport, std::__1::shared_ptr<FairMQTransportFactory>, std::__1::hash<fair::mq::Transport>, std::__1::equal_to<fair::mq::Transport>, std::__1::allocator<std::__1::pair<fair::mq::Transport const, std::__1::shared_ptr<FairMQTransportFactory> > > >::clear(this=0x00007fa3d1812818 size=2) at unordered_map:1198 [opt]
    frame #32: 0x000000010c351cdc libFairMQ.1.4.18.dylib`FairMQDevice::ResetWrapper(this=0x00007fa3d1812800) at FairMQDevice.cxx:885 [opt]
    frame #33: 0x000000010c356668 libFairMQ.1.4.18.dylib`std::__1::__function::__func<FairMQDevice::FairMQDevice(fair::mq::ProgOptions*, fair::mq::tools::Version)::$_1, std::__1::allocator<FairMQDevice::FairMQDevice(fair::mq::ProgOptions*, fair::mq::tools::Version)::$_1>, void (fair::mq::State)>::operator()(fair::mq::State&&) [inlined] FairMQDevice::FairMQDevice(this=<unavailable>, state=<unavailable>)::$_1::operator()(fair::mq::State) const at FairMQDevice.cxx:166:17 [opt]
    frame #34: 0x000000010c3562be libFairMQ.1.4.18.dylib`std::__1::__function::__func<FairMQDevice::FairMQDevice(fair::mq::ProgOptions*, fair::mq::tools::Version)::$_1, std::__1::allocator<FairMQDevice::FairMQDevice(fair::mq::ProgOptions*, fair::mq::tools::Version)::$_1>, void (fair::mq::State)>::operator()(fair::mq::State&&) [inlined] decltype(__f=<unavailable>, __args=<unavailable>)::$_1&>(fp)(std::__1::forward<fair::mq::State>(fp0))) std::__1::__invoke<FairMQDevice::FairMQDevice(fair::mq::ProgOptions*, fair::mq::tools::Version)::$_1&, fair::mq::State>(FairMQDevice::FairMQDevice(fair::mq::ProgOptions*, fair::mq::tools::Version)::$_1&, fair::mq::State&&) at type_traits:4425 [opt]
    frame #35: 0x000000010c3562b2 libFairMQ.1.4.18.dylib`std::__1::__function::__func<FairMQDevice::FairMQDevice(fair::mq::ProgOptions*, fair::mq::tools::Version)::$_1, std::__1::allocator<FairMQDevice::FairMQDevice(fair::mq::ProgOptions*, fair::mq::tools::Version)::$_1>, void (fair::mq::State)>::operator()(fair::mq::State&&) [inlined] void std::__1::__invoke_void_return_wrapper<void>::__call<FairMQDevice::FairMQDevice(__args=<unavailable>, __args=<unavailable>)::$_1&, fair::mq::State>(FairMQDevice::FairMQDevice(fair::mq::ProgOptions*, fair::mq::tools::Version)::$_1&, fair::mq::State&&) at __functional_base:348 [opt]
    frame #36: 0x000000010c3562b2 libFairMQ.1.4.18.dylib`std::__1::__function::__func<FairMQDevice::FairMQDevice(fair::mq::ProgOptions*, fair::mq::tools::Version)::$_1, std::__1::allocator<FairMQDevice::FairMQDevice(fair::mq::ProgOptions*, fair::mq::tools::Version)::$_1>, void (fair::mq::State)>::operator()(fair::mq::State&&) [inlined] std::__1::__function::__alloc_func<FairMQDevice::FairMQDevice(fair::mq::ProgOptions*, fair::mq::tools::Version)::$_1, std::__1::allocator<FairMQDevice::FairMQDevice(fair::mq::ProgOptions*, fair::mq::tools::Version)::$_1>, void (fair::mq::State)>::operator(this=<unavailable>, __arg=<unavailable>)(fair::mq::State&&) at functional:1533 [opt]
    frame #37: 0x000000010c3562b2 libFairMQ.1.4.18.dylib`std::__1::__function::__func<FairMQDevice::FairMQDevice(fair::mq::ProgOptions*, fair::mq::tools::Version)::$_1, std::__1::allocator<FairMQDevice::FairMQDevice(fair::mq::ProgOptions*, fair::mq::tools::Version)::$_1>, void (fair::mq::State)>::operator(this=<unavailable>, __arg=<unavailable>)(fair::mq::State&&) at functional:1707 [opt]
    frame #38: 0x000000010c9a2931 libFairMQStateMachine.1.4.18.dylib`boost::detail::function::void_function_obj_invoker1<std::__1::function<void (fair::mq::State)>, void, fair::mq::State>::invoke(boost::detail::function::function_buffer&, fair::mq::State) [inlined] std::__1::__function::__value_func<void (fair::mq::State)>::operator(this=<unavailable>, __args=0x00007ffee4b03c3c)(fair::mq::State&&) const at functional:1860:16 [opt]
    frame #39: 0x000000010c9a291e libFairMQStateMachine.1.4.18.dylib`boost::detail::function::void_function_obj_invoker1<std::__1::function<void (fair::mq::State)>, void, fair::mq::State>::invoke(boost::detail::function::function_buffer&, fair::mq::State) [inlined] std::__1::function<void (fair::mq::State)>::operator(this=<unavailable>, __arg=ResettingDevice)(fair::mq::State) const at functional:2419 [opt]
    frame #40: 0x000000010c9a291e libFairMQStateMachine.1.4.18.dylib`boost::detail::function::void_function_obj_invoker1<std::__1::function<void (fair::mq::State)>, void, fair::mq::State>::invoke(function_obj_ptr=<unavailable>, a0=<unavailable>) at function_template.hpp:158 [opt]
    frame #41: 0x000000010c98f174 libFairMQStateMachine.1.4.18.dylib`boost::signals2::detail::slot_call_iterator_t<boost::signals2::detail::variadic_slot_invoker<boost::signals2::detail::void_type, fair::mq::State>, std::__1::__list_iterator<boost::shared_ptr<boost::signals2::detail::connection_body<std::__1::pair<boost::signals2::detail::slot_meta_group, boost::optional<int> >, boost::signals2::slot<void (fair::mq::State), boost::function<void (fair::mq::State)> >, boost::signals2::mutex> >, void*>, boost::signals2::detail::connection_body<std::__1::pair<boost::signals2::detail::slot_meta_group, boost::optional<int> >, boost::signals2::slot<void (fair::mq::State), boost::function<void (fair::mq::State)> >, boost::signals2::mutex> >::dereference() const [inlined] boost::function1<void, fair::mq::State>::operator(this=<unavailable>, a0=<unavailable>)(fair::mq::State) const at function_template.hpp:763:14 [opt]
    frame #42: 0x000000010c98f169 libFairMQStateMachine.1.4.18.dylib`boost::signals2::detail::slot_call_iterator_t<boost::signals2::detail::variadic_slot_invoker<boost::signals2::detail::void_type, fair::mq::State>, std::__1::__list_iterator<boost::shared_ptr<boost::signals2::detail::connection_body<std::__1::pair<boost::signals2::detail::slot_meta_group, boost::optional<int> >, boost::signals2::slot<void (fair::mq::State), boost::function<void (fair::mq::State)> >, boost::signals2::mutex> >, void*>, boost::signals2::detail::connection_body<std::__1::pair<boost::signals2::detail::slot_meta_group, boost::optional<int> >, boost::signals2::slot<void (fair::mq::State), boost::function<void (fair::mq::State)> >, boost::signals2::mutex> >::dereference() const [inlined] boost::signals2::detail::void_type boost::signals2::detail::call_with_tuple_args<boost::signals2::detail::void_type>::m_invoke<boost::function<void (this=<unavailable>, func=<unavailable>, args=<unavailable>, (null)=<unavailable>)>, 0u, fair::mq::State&>(boost::function<void (fair::mq::State)>&, boost::signals2::detail::unsigned_meta_array<0u>, std::__1::tuple<fair::mq::State&> const&, boost::enable_if<boost::is_void<boost::function<void (fair::mq::State)>::result_type>, void>::type*) const at variadic_slot_invoker.hpp:105 [opt]
    frame #43: 0x000000010c98f157 libFairMQStateMachine.1.4.18.dylib`boost::signals2::detail::slot_call_iterator_t<boost::signals2::detail::variadic_slot_invoker<boost::signals2::detail::void_type, fair::mq::State>, std::__1::__list_iterator<boost::shared_ptr<boost::signals2::detail::connection_body<std::__1::pair<boost::signals2::detail::slot_meta_group, boost::optional<int> >, boost::signals2::slot<void (fair::mq::State), boost::function<void (fair::mq::State)> >, boost::signals2::mutex> >, void*>, boost::signals2::detail::connection_body<std::__1::pair<boost::signals2::detail::slot_meta_group, boost::optional<int> >, boost::signals2::slot<void (fair::mq::State), boost::function<void (fair::mq::State)> >, boost::signals2::mutex> >::dereference() const [inlined] boost::signals2::detail::void_type boost::signals2::detail::call_with_tuple_args<boost::signals2::detail::void_type>::operator(this=<unavailable>, func=<unavailable>, args=<unavailable>)<boost::function<void (fair::mq::State)>, fair::mq::State&, 1ul>(boost::function<void (fair::mq::State)>&, std::__1::tuple<fair::mq::State&> const&, mpl_::size_t<1ul>) const at variadic_slot_invoker.hpp:90 [opt]
    frame #44: 0x000000010c98f157 libFairMQStateMachine.1.4.18.dylib`boost::signals2::detail::slot_call_iterator_t<boost::signals2::detail::variadic_slot_invoker<boost::signals2::detail::void_type, fair::mq::State>, std::__1::__list_iterator<boost::shared_ptr<boost::signals2::detail::connection_body<std::__1::pair<boost::signals2::detail::slot_meta_group, boost::optional<int> >, boost::signals2::slot<void (fair::mq::State), boost::function<void (fair::mq::State)> >, boost::signals2::mutex> >, void*>, boost::signals2::detail::connection_body<std::__1::pair<boost::signals2::detail::slot_meta_group, boost::optional<int> >, boost::signals2::slot<void (fair::mq::State), boost::function<void (fair::mq::State)> >, boost::signals2::mutex> >::dereference() const [inlined] boost::signals2::detail::void_type boost::signals2::detail::variadic_slot_invoker<boost::signals2::detail::void_type, fair::mq::State>::operator(this=<unavailable>, connectionBody=<unavailable>)<boost::shared_ptr<boost::signals2::detail::connection_body<std::__1::pair<boost::signals2::detail::slot_meta_group, boost::optional<int> >, boost::signals2::slot<void (fair::mq::State), boost::function<void (fair::mq::State)> >, boost::signals2::mutex> > >(boost::shared_ptr<boost::signals2::detail::connection_body<std::__1::pair<boost::signals2::detail::slot_meta_group, boost::optional<int> >, boost::signals2::slot<void (fair::mq::State), boost::function<void (fair::mq::State)> >, boost::signals2::mutex> > const&) const at variadic_slot_invoker.hpp:133 [opt]
    frame #45: 0x000000010c98f14f libFairMQStateMachine.1.4.18.dylib`boost::signals2::detail::slot_call_iterator_t<boost::signals2::detail::variadic_slot_invoker<boost::signals2::detail::void_type, fair::mq::State>, std::__1::__list_iterator<boost::shared_ptr<boost::signals2::detail::connection_body<std::__1::pair<boost::signals2::detail::slot_meta_group, boost::optional<int> >, boost::signals2::slot<void (fair::mq::State), boost::function<void (fair::mq::State)> >, boost::signals2::mutex> >, void*>, boost::signals2::detail::connection_body<std::__1::pair<boost::signals2::detail::slot_meta_group, boost::optional<int> >, boost::signals2::slot<void (fair::mq::State), boost::function<void (fair::mq::State)> >, boost::signals2::mutex> >::dereference(this=0x00007ffee4b03ce8) const at slot_call_iterator.hpp:110 [opt]
    frame #46: 0x000000010c98e541 libFairMQStateMachine.1.4.18.dylib`boost::signals2::detail::signal_impl<void (fair::mq::State), boost::signals2::optional_last_value<void>, int, std::__1::less<int>, boost::function<void (fair::mq::State)>, boost::function<void (boost::signals2::connection const&, fair::mq::State)>, boost::signals2::mutex>::operator()(fair::mq::State) [inlined] boost::signals2::detail::slot_call_iterator_t<boost::signals2::detail::variadic_slot_invoker<boost::signals2::detail::void_type, fair::mq::State>, std::__1::__list_iterator<boost::shared_ptr<boost::signals2::detail::connection_body<std::__1::pair<boost::signals2::detail::slot_meta_group, boost::optional<int> >, boost::signals2::slot<void (fair::mq::State), boost::function<void (fair::mq::State)> >, boost::signals2::mutex> >, void*>, boost::signals2::detail::connection_body<std::__1::pair<boost::signals2::detail::slot_meta_group, boost::optional<int> >, boost::signals2::slot<void (fair::mq::State), boost::function<void (fair::mq::State)> >, boost::signals2::mutex> >::reference boost::iterators::iterator_core_access::dereference<boost::signals2::detail::slot_call_iterator_t<boost::signals2::detail::variadic_slot_invoker<boost::signals2::detail::void_type, fair::mq::State>, std::__1::__list_iterator<boost::shared_ptr<boost::signals2::detail::connection_body<std::__1::pair<boost::signals2::detail::slot_meta_group, boost::optional<int> >, boost::signals2::slot<void (f=<unavailable>), boost::function<void (fair::mq::State)> >, boost::signals2::mutex> >, void*>, boost::signals2::detail::connection_body<std::__1::pair<boost::signals2::detail::slot_meta_group, boost::optional<int> >, boost::signals2::slot<void (fair::mq::State), boost::function<void (fair::mq::State)> >, boost::signals2::mutex> > >(boost::signals2::detail::slot_call_iterator_t<boost::signals2::detail::variadic_slot_invoker<boost::signals2::detail::void_type, fair::mq::State>, std::__1::__list_iterator<boost::shared_ptr<boost::signals2::detail::connection_body<std::__1::pair<boost::signals2::detail::slot_meta_group, boost::optional<int> >, boost::signals2::slot<void (fair::mq::State), boost::function<void (fair::mq::State)> >, boost::signals2::mutex> >, void*>, boost::signals2::detail::connection_body<std::__1::pair<boost::signals2::detail::slot_meta_group, boost::optional<int> >, boost::signals2::slot<void (fair::mq::State), boost::function<void (fair::mq::State)> >, boost::signals2::mutex> > const&) at iterator_facade.hpp:550:20 [opt]
    frame #47: 0x000000010c98e539 libFairMQStateMachine.1.4.18.dylib`boost::signals2::detail::signal_impl<void (fair::mq::State), boost::signals2::optional_last_value<void>, int, std::__1::less<int>, boost::function<void (fair::mq::State)>, boost::function<void (boost::signals2::connection const&, fair::mq::State)>, boost::signals2::mutex>::operator()(fair::mq::State) [inlined] boost::iterators::detail::iterator_facade_base<boost::signals2::detail::slot_call_iterator_t<boost::signals2::detail::variadic_slot_invoker<boost::signals2::detail::void_type, fair::mq::State>, std::__1::__list_iterator<boost::shared_ptr<boost::signals2::detail::connection_body<std::__1::pair<boost::signals2::detail::slot_meta_group, boost::optional<int> >, boost::signals2::slot<void (fair::mq::State), boost::function<void (fair::mq::State)> >, boost::signals2::mutex> >, void*>, boost::signals2::detail::connection_body<std::__1::pair<boost::signals2::detail::slot_meta_group, boost::optional<int> >, boost::signals2::slot<void (fair::mq::State), boost::function<void (fair::mq::State)> >, boost::signals2::mutex> >, boost::signals2::detail::void_type, boost::iterators::single_pass_traversal_tag, boost::signals2::detail::void_type const&, long, false, false>::operator*(this=<unavailable>) const at iterator_facade.hpp:656 [opt]
    frame #48: 0x000000010c98e539 libFairMQStateMachine.1.4.18.dylib`boost::signals2::detail::signal_impl<void (fair::mq::State), boost::signals2::optional_last_value<void>, int, std::__1::less<int>, boost::function<void (fair::mq::State)>, boost::function<void (boost::signals2::connection const&, fair::mq::State)>, boost::signals2::mutex>::operator()(fair::mq::State) at optional_last_value.hpp:57 [opt]
    frame #49: 0x000000010c98e4de libFairMQStateMachine.1.4.18.dylib`boost::signals2::detail::signal_impl<void (fair::mq::State), boost::signals2::optional_last_value<void>, int, std::__1::less<int>, boost::function<void (fair::mq::State)>, boost::function<void (boost::signals2::connection const&, fair::mq::State)>, boost::signals2::mutex>::operator()(fair::mq::State) [inlined] void boost::signals2::detail::combiner_invoker<void>::operator(this=<unavailable>, combiner=<unavailable>, last=slot_call_iterator_t<boost::signals2::detail::variadic_slot_invoker<boost::signals2::detail::void_type, fair::mq::State>, std::__1::__list_iterator<boost::shared_ptr<boost::signals2::detail::connection_body<std::__1::pair<boost::signals2::detail::slot_meta_group, boost::optional<int> >, boost::signals2::slot<void (fair::mq::State), boost::function<void (fair::mq::State)> >, boost::signals2::mutex> >, void *>, boost::signals2::detail::connection_body<std::__1::pair<boost::signals2::detail::slot_meta_group, boost::optional<int> >, boost::signals2::slot<void (fair::mq::State), boost::function<void (fair::mq::State)> >, boost::signals2::mutex> > @ 0x00007f975ccf8410)<boost::signals2::optional_last_value<void>, boost::signals2::detail::slot_call_iterator_t<boost::signals2::detail::variadic_slot_invoker<boost::signals2::detail::void_type, fair::mq::State>, std::__1::__list_iterator<boost::shared_ptr<boost::signals2::detail::connection_body<std::__1::pair<boost::signals2::detail::slot_meta_group, boost::optional<int> >, boost::signals2::slot<void (fair::mq::State), boost::function<void (fair::mq::State)> >, boost::signals2::mutex> >, void*>, boost::signals2::detail::connection_body<std::__1::pair<boost::signals2::detail::slot_meta_group, boost::optional<int> >, boost::signals2::slot<void (fair::mq::State), boost::function<void (fair::mq::State)> >, boost::signals2::mutex> > >(boost::signals2::optional_last_value<void>&, boost::signals2::detail::slot_call_iterator_t<boost::signals2::detail::variadic_slot_invoker<boost::signals2::detail::void_type, fair::mq::State>, std::__1::__list_iterator<boost::shared_ptr<boost::signals2::detail::connection_body<std::__1::pair<boost::signals2::detail::slot_meta_group, boost::optional<int> >, boost::signals2::slot<void (fair::mq::State), boost::function<void (fair::mq::State)> >, boost::signals2::mutex> >, void*>, boost::signals2::detail::connection_body<std::__1::pair<boost::signals2::detail::slot_meta_group, boost::optional<int> >, boost::signals2::slot<void (fair::mq::State), boost::function<void (fair::mq::State)> >, boost::signals2::mutex> >, boost::signals2::detail::slot_call_iterator_t<boost::signals2::detail::variadic_slot_invoker<boost::signals2::detail::void_type, fair::mq::State>, std::__1::__list_iterator<boost::shared_ptr<boost::signals2::detail::connection_body<std::__1::pair<boost::signals2::detail::slot_meta_group, boost::optional<int> >, boost::signals2::slot<void (fair::mq::State), boost::function<void (fair::mq::State)> >, boost::signals2::mutex> >, void*>, boost::signals2::detail::connection_body<std::__1::pair<boost::signals2::detail::slot_meta_group, boost::optional<int> >, boost::signals2::slot<void (fair::mq::State), boost::function<void (fair::mq::State)> >, boost::signals2::mutex> >) const at result_type_wrapper.hpp:64 [opt]
    frame #50: 0x000000010c98e4de libFairMQStateMachine.1.4.18.dylib`boost::signals2::detail::signal_impl<void (fair::mq::State), boost::signals2::optional_last_value<void>, int, std::__1::less<int>, boost::function<void (fair::mq::State)>, boost::function<void (boost::signals2::connection const&, fair::mq::State)>, boost::signals2::mutex>::operator(this=0x00007fa3d1418740, args=<unavailable>)(fair::mq::State) at signal_template.hpp:242 [opt]
    frame #51: 0x000000010c98d94a libFairMQStateMachine.1.4.18.dylib`fair::mq::fsm::Machine_::ProcessWork() [inlined] boost::signals2::signal<void (fair::mq::State), boost::signals2::optional_last_value<void>, int, std::__1::less<int>, boost::function<void (fair::mq::State)>, boost::function<void (boost::signals2::connection const&, fair::mq::State)>, boost::signals2::mutex>::operator(this=0x00007fa3d1418548, args=ResettingDevice)(fair::mq::State) const at signal_template.hpp:726:16 [opt]
    frame #52: 0x000000010c98d93b libFairMQStateMachine.1.4.18.dylib`fair::mq::fsm::Machine_::ProcessWork() [inlined] fair::mq::fsm::Machine_::CallStateHandler(this=0x00007fa3d14184a0, state=ResettingDevice) const at StateMachine.cxx:158 [opt]
    frame #53: 0x000000010c98d927 libFairMQStateMachine.1.4.18.dylib`fair::mq::fsm::Machine_::ProcessWork(this=<unavailable>) at StateMachine.cxx:206 [opt]
    frame #54: 0x000000010c98cfb0 libFairMQStateMachine.1.4.18.dylib`fair::mq::StateMachine::ProcessWork(this=0x00007fa3d1812898) at StateMachine.cxx:375:14 [opt]
    frame #55: 0x000000010c3124d9 libFairMQ.1.4.18.dylib`fair::mq::DeviceRunner::Run() [inlined] FairMQDevice::RunStateMachine(this=<unavailable>) at FairMQDevice.h:402:23 [opt]
    frame #56: 0x000000010c3124cd libFairMQ.1.4.18.dylib`fair::mq::DeviceRunner::Run(this=0x00007ffee4b074f0) at DeviceRunner.cxx:174 [opt]
    frame #57: 0x000000010b3c876b libO2Framework.dylib`doChild(argc=35, argv=0x00007ffee4b090e0, spec=0x00007fa3d280d4d0, errorPolicy=<unavailable>) at runDataProcessing.cxx:764:19 [opt]
    frame #58: 0x000000010b3ccd8e libO2Framework.dylib`runStateMachine(workflow=size=6, workflowInfo=0x00007ffee4b08450, previousDataProcessorInfos=size=6, driverControl=0x00007ffee4b087e0, driverInfo=<unavailable>, metricsInfos=size=0, frameworkId="\x83???\0\0??\0\0\0(Hidden child options\0\0\0P\0\0\0(\0\0\0pq`ѣ\x7f\0\0pq`ѣ\x7f\0\0\x10r`ѣ\x7f\0\0`q`ѣ\x7f\0\0\n\0\0\0\0\0\0\0\x01\0\0\0\0\0\0\0\x10``ѣ\x7f\0\0\x10``ѣ\x7f\0\0 ``ѣ") at runDataProcessing.cxx:952:20 [opt]
    frame #59: 0x000000010b3d4fbe libO2Framework.dylib`doMain(argc=35, argv=0x00007ffee4b090e0, workflow=size=0, channelPolicies=size=1, completionPolicies=size=1, dispatchPolicies=size=1, currentWorkflowOptions=size=2, configContext=0x00007ffee4b089c0) at runDataProcessing.cxx:1671:10 [opt]
    frame #60: 0x000000010b104e61 o2-analysistutorial-mc-histograms`main(argc=35, argv=0x00007ffee4b090e0) at runDataProcessing.h:155:14 [opt]
    frame #61: 0x00007fff710e4cc9 libdyld.dylib`start + 1
rbx commented 4 years ago

Does it also happen on Linux?

I can reproduce a similar situation via sigkilling one device - another will hang on MacOS, but not on Linux. Still investigating if I indeed see the same as you on MacOS.

I don't think it is related to the original issue, but is something more serious. On the first sight it looks to me as a problem similar to https://github.com/FairRootGroup/DDS/commit/73b7209b3df4b7b3ec239334b1123e3c7735bd9c#diff-af3b638bc2a3e6c650974192a53c7291R134 Investigating.

rbx commented 4 years ago

Adding the workaround cure (BOOST_INTERPROCESS_ENABLE_TIMEOUT_WHEN_LOCKING + BOOST_INTERPROCESS_TIMEOUT_WHEN_LOCKING_DURATION_MS=3000) does "unhang" the remaining device after 3 seconds (earlier sometimes) with a crash:

libc++abi.dylib: terminating with uncaught exception of type boost::interprocess::interprocess_exception: Interprocess mutex timeout when locking. Possible deadlock: owner died without unlocking?
fish: './examples/region/fairmq-ex-reg…' terminated by signal SIGABRT (Abort)

So indeed it looks like it is trying to acquire a mutex that has been locked by the dead process. A bit surprizing that it hangs in Manager.h:349 (boost::interprocess::condition_variable.notify_all()), where our visible mutex is already locked and unlocked. Perhaps it is trying to get a lock on some another internal mutex. I'll investigate if there is a better cure, or at least catch that unhandled exception.

ktf commented 4 years ago

thanks. Any progress on this?

ktf commented 4 years ago

Still using this issue because I am not sure if related or not. I see also some deadlock in:

* thread #1, queue = 'com.apple.main-thread', stop reason = instruction step over
    frame #0: 0x00007fff71246af0 libsystem_kernel.dylib`sem_wait + 12
    frame #1: 0x0000000110b044f3 libFairMQ.1.4.18.dylib`boost::interprocess::ipcdetail::posix_named_semaphore::wait() [inlined] boost::interprocess::ipcdetail::semaphore_wait(handle=<unavailable>) at semaphore_wrapper.hpp:176:14 [opt]
    frame #2: 0x0000000110b044ee libFairMQ.1.4.18.dylib`boost::interprocess::ipcdetail::posix_named_semaphore::wait(this=<unavailable>) at named_semaphore.hpp:63 [opt]
    frame #3: 0x0000000110b1fae0 libFairMQ.1.4.18.dylib`fair::mq::shmem::Manager::SubscribeToRegionEvents(std::__1::function<void (FairMQRegionInfo)>) [inlined] boost::interprocess::ipcdetail::posix_named_mutex::lock(this=0x00007ffd6bb09418) at named_mutex.hpp:92:10 [opt]
    frame #4: 0x0000000110b1fad1 libFairMQ.1.4.18.dylib`fair::mq::shmem::Manager::SubscribeToRegionEvents(std::__1::function<void (FairMQRegionInfo)>) [inlined] boost::interprocess::named_mutex::lock(this=0x00007ffd6bb09418) at named_mutex.hpp:154 [opt]
    frame #5: 0x0000000110b1fad1 libFairMQ.1.4.18.dylib`fair::mq::shmem::Manager::SubscribeToRegionEvents(std::__1::function<void (FairMQRegionInfo)>) [inlined] boost::interprocess::scoped_lock<boost::interprocess::named_mutex>::scoped_lock(this=<unavailable>, m=0x00007ffd6bb09418) at scoped_lock.hpp:81 [opt]
    frame #6: 0x0000000110b1fad1 libFairMQ.1.4.18.dylib`fair::mq::shmem::Manager::SubscribeToRegionEvents(std::__1::function<void (FairMQRegionInfo)>) [inlined] boost::interprocess::scoped_lock<boost::interprocess::named_mutex>::scoped_lock(this=<unavailable>, m=0x00007ffd6bb09418) at scoped_lock.hpp:81 [opt]
    frame #7: 0x0000000110b1fad1 libFairMQ.1.4.18.dylib`fair::mq::shmem::Manager::SubscribeToRegionEvents(this=0x00007ffd6bb09360, callback=fair::mq::RegionEventCallback @ 0x00007ffee02c5320)>) at Manager.h:329 [opt]
    frame #8: 0x0000000110ae2621 libFairMQ.1.4.18.dylib`fair::mq::shmem::TransportFactory::SubscribeToRegionEvents(this=<unavailable>, callback=<unavailable>)>) at TransportFactory.h:174:85 [opt]
  * frame #9: 0x000000010fa657c5 libO2Framework.dylib`o2::framework::DataProcessingDevice::InitTask(this=0x00007ffd78822200) at DataProcessingDevice.cxx:255:39 [opt]

not always, but often enough. Does it ring any bell?

rbx commented 4 years ago

I see also some deadlock in:

It rings a bell if that is also when a peer is crashed/killed. There are several places where a deadlock can occur in that case. Otherwise it could be something else.

ktf commented 4 years ago

mmm... indeed I might be screwing up something downstream with my libuv attempts...

rbx commented 4 years ago

I've been playing with BOOST_INTERPROCESS_ENABLE_TIMEOUT_WHEN_LOCKING. It adds timeouts for the boost internal shm locks. Then there are also my own locks that I can as well instruct to timeout.

But what is a good value for the timeout? Something could be blocking for a valid reason. In that case it is not reasonable from me to add a timeout for it.

I think if some device does crash or is killed, it is valid to let others that are waiting for it hang, at least from the perspective of those devices. The controller should detect a crash and decide how to handle it (which is shmem case is to restart other shm users, assuming the crashed device corrupted the contents).

dennisklein commented 4 years ago

@ktf, @rbx: TBH, I am worried about adding timeout-locks, let me explain how I see it:

What happens, if an OS thread within an OS process crashes (uncatchable)? - All other threads within this process are killed preemptively, right? By whom - some external control entity (e.g. OS kernel).

If we attach shared memory to two OS processes they essentially become tightly coupled very similar to OS threads within the same process but with a custom external control entity. Within this analogy, if one (shmem-enabled) FairMQ devices fails in an uncontrolled manner, the control entity shall detect it and preemptively kill the other FairMQ devices within the same session.

So,

Using timeouts means we want to resolve the device failure cooperatively and distributively (no central controller). Since we cannot just assume timeout == failure, it means we are entering consensus algorithm land. And from what I remember, consensus is highly non-trivial and we most probably do not want to implement anything ourselves here.

ktf commented 4 years ago

@dennisklein I fully agree with your assessment, however this is basically saying that if we use the shared memory backend once a process dies all the ones connected to the same shared memory region should probably be killed, which currently would mean that the whole topology is killed.

We could think of mitigate the issue by having non-shared memory based transports in a few strategic places, but the problem remains, I guess.

ktf commented 4 years ago

@dennisklein I also agree that most likely timed locks are a cure which is worse than the disease.

dennisklein commented 4 years ago

I guess, we should look into more detail on the failing device side and see, if we can handle more of the common error cases with regard to the shmem transport. SIGSEGV can have user handler, can it not? Etc..

ktf commented 4 years ago

Well, I would not trust anything in memory after a SIGSEGV, frankly, and I would just terminate and maybe dump a stacktrace. Anything beyond that is asking for trouble. I've seen even the stacktrace dumping resulting in locking issues in a remote past...

IMHO one thing which would improve the situation is the ability to connect "read-only" to a given shared memory region, so that a crash can only affect the devices downstream.

rbx commented 4 years ago

Declaring a device as read-only will surely give some assurance that it hasn't corrupted anything if it does crash, although I'm not sure how trustworthy that would be. To allow this I think a flag for the controller would be sufficient that marks a device as read-only. Not sure what guarantees we can give here from FairMQ side - once we give out the buffer pointer, it can get messed up. We could make the pointer const for a better contract.

Another issue that arises when a shmem-participating device crashes is that it is likely to contain meta-data messages in its ZeroMQ queue that become lost during the crash. The meta info is lost, but the shmem is still occupied by the buffers that the meta data points to. Which means a chunk of memory is now occupied by something without reference and it will not get cleaned until a full reset. In the worst case a large enough chunk is occupied that nothing can write to shmem anymore.

Two possible recovery solutions are - (1) do additional book-keeping for the in-flight messages, (2) replace the ZeroMQ queues with interprocess queues, where the queues itself are located in the shared memory and thus are not lost when crash occurs.

Both would have some performance draw backs, but probably within acceptable limits. My latest test with a simple implementation for (2) showed something like 30% lower transfer rate (compared to 1MHz, so still well within requirements I imagine). Simple implementation for (1) would involve something like a interprocess queue in addition to zmq, so in my mind going straight for (2) is better.

Implementation of (2) would also, at least at first, have a smaller feature set (e.g. only PAIR sockets) and higher cost of multiplexing between multiple channels, because interprocess queues from boost don't have anything like a file descriptor to integrate in asio or a different polling mechanism. But as a bonus (2) would give better control over the queue capacity and current level (compared to zmq, where there are at least 4 queues with different properties for each channel), (although we don't provide FairMQ API to check current level).

This still doesn't solve the trustworthiness of the memory after a crash though - even if the code that caused the crash didn't have a meta data to another buffer, it could still have corrupted it via segmentation violation.

ktf commented 4 years ago

By "read only" i meant actually protected memory pages, not simply "const". I think with mmap you could do it (e.g. map a file PROT_READ|PROT_WRITE on one side, PROTREAD on the receiving side), not sure if shm* stuff allows for it.

rbx commented 4 years ago

Yes, boost::interprocess allows read only access. It translates to mmap flags.

ktf commented 4 years ago

Can we somehow control that from the Channel specification? DPL does know which channels are for reading and which are for writing.

rbx commented 4 years ago

No, that is not implemented atm. Neither is "read only" in general. I was just saying that it is possible.

Does a per-channel flag make sense? If a device has an input reading channel and and output writing channel, the flag for the shared memory segment would have to be read_write. Also the channels are typically instantiated after the segment. So if we add such a setting, it would be for the transport factory.

ktf commented 4 years ago

If it's not per channel, then it's not much useful. So basically you are saying I should have one transport for the inputs and another one for the outputs of my device, in order to achieve what I want?

rbx commented 4 years ago

I'm saying if you have a device that is allowed to write to shared memory (via any one of the channels), and it crashes, then you cannot assume it didn't mess up anything, downstream or upstream. For this scenario I am only considering case where there is a single segment for the entire topology.

ktf commented 4 years ago

I can assume that the shared pages which were read only were not affected, no? So Given the communication graph (at least for DPL) is a DAG, I could safely assume that upstream (of the one which crashed) devices were not affected.