ComputationalRadiationPhysics / graybat

Graph Approach for Highly Generic Communication Schemes Based on Adaptive Topologies :satellite:
Other
8 stars 4 forks source link

Reimplemented MultiKeyMap.hpp without boost::hana #115

Closed fabian-jung closed 7 years ago

erikzenker commented 7 years ago

Regarding my comment about the context test above: the test leads to a std::out_of_range: map::at exception when the context is split multiple times in a loop (see here). In particular, the test case runs the splitContext method 1000 times, but about 150 iterations are enough to trigger the exception.

Looks like a race condition to me e.g.: reading and writing concurrently from the MultiKeyMap with broken locking. I Need to ask the debugger for more details (debugging mpi with gdb).

erikzenker commented 7 years ago

Generate gdb trace:

make gbZMQSignaling && make checkFast && mpiexec -n 2 xterm -e gdb --args ./checkFast --log_level=test_suite --run_test=graybat_cp_point_to_point/context

gdb$ catch throw
gdb$ run
gdb$ c #loop until the out_of_range exception occurs
gdb$ bt

Resulting gdb trace:

#0  __cxxabiv1::__cxa_throw (obj=obj@entry=0x879e00, tinfo=0x7ffff6aa4a98 <typeinfo for std::out_of_range>, dest=0x7ffff67cedf0 <std::out_of_range::~out_of_range()>) at /build/gcc-multilib/src/gcc/libstdc++-v3/libsupc++/eh_throw.cc:62
#1  0x00007ffff67e2def in std::__throw_out_of_range (__s=0x486da1 "map::at") at /build/gcc-multilib/src/gcc/libstdc++-v3/src/c++11/functexcept.cc:90
#2  0x000000000047e6df in std::map<std::tuple<graybat::communicationPolicy::MsgTypeType, unsigned int, unsigned int, unsigned int>, std::queue<graybat::communicationPolicy::zmq::Message<graybat::communicationPolicy::ZMQ>, std::deque<graybat::communicationPolicy::zmq::Message<graybat::communicationPolicy::ZMQ>, std::allocator<graybat::communicationPolicy::zmq::Message<graybat::communicationPolicy::ZMQ> > > >, std::less<std::tuple<graybat::communicationPolicy::MsgTypeType, unsigned int, unsigned int, unsigned int> >, std::allocator<std::pair<std::tuple<graybat::communicationPolicy::MsgTypeType, unsigned int, unsigned int, unsigned int> const, std::queue<graybat::communicationPolicy::zmq::Message<graybat::communicationPolicy::ZMQ>, std::deque<graybat::communicationPolicy::zmq::Message<graybat::communicationPolicy::ZMQ>, std::allocator<graybat::communicationPolicy::zmq::Message<graybat::communicationPolicy::ZMQ> > > > > > >::at (this=0x6a2330 <zmqCP+400>, __k=std::tuple containing = {...}) at /usr/include/c++/6.2.1/bits/stl_map.h:523
#3  0x000000000047a3c7 in utils::MultiKeyMap<std::queue<graybat::communicationPolicy::zmq::Message<graybat::communicationPolicy::ZMQ>, std::deque<graybat::communicationPolicy::zmq::Message<graybat::communicationPolicy::ZMQ>, std::allocator<graybat::communicationPolicy::zmq::Message<graybat::communicationPolicy::ZMQ> > > >, graybat::communicationPolicy::MsgTypeType, unsigned int, unsigned int, unsigned int>::at (this=0x6a2330 <zmqCP+400>, keys#0=graybat::communicationPolicy::MsgTypeType::SPLIT, keys#1=0x340, keys#2=0x1, keys#3=0x0) at /home/erik/projects/graybat/include/graybat/utils/MultiKeyMap.hpp:139
#4  0x00000000004741e5 in utils::MessageBox<graybat::communicationPolicy::zmq::Message<graybat::communicationPolicy::ZMQ>, graybat::communicationPolicy::MsgTypeType, unsigned int, unsigned int, unsigned int>::waitDequeue (this=0x6a2248 <zmqCP+168>, keys#0=graybat::communicationPolicy::MsgTypeType::SPLIT, keys#1=0x340, keys#2=0x1, keys#3=0x0) at /home/erik/projects/graybat/include/graybat/utils/MultiKeyMap.hpp:253
#5  0x000000000047022a in graybat::communicationPolicy::socket::Base<graybat::communicationPolicy::ZMQ>::recvImpl<std::array<unsigned int, 0ul> > (this=0x6a21a0 <zmqCP>, msgType=graybat::communicationPolicy::MsgTypeType::SPLIT, context=..., srcVAddr=0x1, tag=0x0, recvData=...) at /home/erik/projects/graybat/include/graybat/communicationPolicy/socket/Base.hpp:683
#6  0x0000000000469c4b in graybat::communicationPolicy::socket::Base<graybat::communicationPolicy::ZMQ>::splitContext (this=0x6a21a0 <zmqCP>, isMember=0x1, oldContext=...) at /home/erik/projects/graybat/include/graybat/communicationPolicy/socket/Base.hpp:518
#7  0x0000000000458444 in graybat_cp_point_to_point::context::<lambda(auto:2)>::operator()<std::reference_wrapper<graybat::communicationPolicy::ZMQ> >(std::reference_wrapper<graybat::communicationPolicy::ZMQ>) const (__closure=0x7fffffffc13f, cpRef=...) at /home/erik/projects/graybat/test/CommunicationPolicyUT.cpp:132
#8  0x0000000000458527 in boost::hana::detail::on_each<graybat_cp_point_to_point::context::test_method()::<lambda(auto:2)>*>::operator()<std::reference_wrapper<graybat::communicationPolicy::ZMQ>&>(std::reference_wrapper<graybat::communicationPolicy::ZMQ> &) const (this=0x7fffffffc0d0, xs#0=...) at /usr/include/boost/hana/for_each.hpp:46
#9  0x0000000000458562 in boost::hana::unpack_impl<boost::hana::basic_tuple_tag, void>::apply<0ul, std::reference_wrapper<graybat::communicationPolicy::ZMQ>, boost::hana::detail::on_each<graybat_cp_point_to_point::context::test_method()::<lambda(auto:2)>*> >(boost::hana::detail::basic_tuple_impl<std::integer_sequence<unsigned long, 0ul>, std::reference_wrapper<graybat::communicationPolicy::ZMQ> > &, <unknown type in /home/erik/projects/graybat/build/checkFast, CU 0x11a36, DIE 0x871ba>) (xs=..., f=<unknown type in /home/erik/projects/graybat/build/checkFast, CU 0x11a36, DIE 0x871ba>) at /usr/include/boost/hana/basic_tuple.hpp:189
#10 0x000000000045858b in boost::hana::unpack_t::operator()<boost::hana::basic_tuple<std::reference_wrapper<graybat::communicationPolicy::ZMQ> >&, boost::hana::detail::on_each<graybat_cp_point_to_point::context::test_method()::<lambda(auto:2)>*> >(boost::hana::basic_tuple<std::reference_wrapper<graybat::communicationPolicy::ZMQ> > &, <unknown type in /home/erik/projects/graybat/build/checkFast, CU 0x11a36, DIE 0x871ba>) const (this=0x4868e8 <boost::hana::unpack>, xs=..., f=<unknown type in /home/erik/projects/graybat/build/checkFast, CU 0x11a36, DIE 0x871ba>) at /usr/include/boost/hana/unpack.hpp:47
#11 0x00000000004585b2 in boost::hana::unpack_impl<boost::hana::tuple_tag, void>::apply<boost::hana::tuple<std::reference_wrapper<graybat::communicationPolicy::ZMQ> >&, boost::hana::detail::on_each<graybat_cp_point_to_point::context::test_method()::<lambda(auto:2)>*> >(boost::hana::tuple<std::reference_wrapper<graybat::communicationPolicy::ZMQ> > &, <unknown type in /home/erik/projects/graybat/build/checkFast, CU 0x11a36, DIE 0x871ba>) (xs=..., f=<unknown type in /home/erik/projects/graybat/build/checkFast, CU 0x11a36, DIE 0x871ba>) at /usr/include/boost/hana/tuple.hpp:205
#12 0x00000000004585db in boost::hana::unpack_t::operator()<boost::hana::tuple<std::reference_wrapper<graybat::communicationPolicy::ZMQ> >&, boost::hana::detail::on_each<graybat_cp_point_to_point::context::test_method()::<lambda(auto:2)>*> >(boost::hana::tuple<std::reference_wrapper<graybat::communicationPolicy::ZMQ> > &, <unknown type in /home/erik/projects/graybat/build/checkFast, CU 0x11a36, DIE 0x871ba>) const (this=0x4868e8 <boost::hana::unpack>, xs=..., f=<unknown type in /home/erik/projects/graybat/build/checkFast, CU 0x11a36, DIE 0x871ba>) at /usr/include/boost/hana/unpack.hpp:47
#13 0x000000000045860a in boost::hana::for_each_impl<boost::hana::tuple_tag, boost::hana::when<true> >::apply<boost::hana::tuple<std::reference_wrapper<graybat::communicationPolicy::ZMQ> >&, graybat_cp_point_to_point::context::test_method()::<lambda(auto:2)> >(boost::hana::tuple<std::reference_wrapper<graybat::communicationPolicy::ZMQ> > &, <unknown type in /home/erik/projects/graybat/build/checkFast, CU 0x11a36, DIE 0x871ad>) (xs=..., f=<unknown type in /home/erik/projects/graybat/build/checkFast, CU 0x11a36, DIE 0x871ad>) at /usr/include/boost/hana/for_each.hpp:56
#14 0x0000000000458635 in boost::hana::for_each_t::operator()<boost::hana::tuple<std::reference_wrapper<graybat::communicationPolicy::ZMQ> >&, graybat_cp_point_to_point::context::test_method()::<lambda(auto:2)> >(boost::hana::tuple<std::reference_wrapper<graybat::communicationPolicy::ZMQ> > &, <unknown type in /home/erik/projects/graybat/build/checkFast, CU 0x11a36, DIE 0x87161>) const (this=0x486c18 <boost::hana::for_each>, xs=..., f=<unknown type in /home/erik/projects/graybat/build/checkFast, CU 0x11a36, DIE 0x87161>) at /usr/include/boost/hana/for_each.hpp:35
#15 0x000000000045865a in graybat_cp_point_to_point::context::test_method (this=0x7fffffffc15f) at /home/erik/projects/graybat/test/CommunicationPolicyUT.cpp:145
#16 0x000000000045826c in graybat_cp_point_to_point::context_invoker () at /home/erik/projects/graybat/test/CommunicationPolicyUT.cpp:117
#17 0x000000000047a23f in boost::detail::function::void_function_invoker0<void (*)(), void>::invoke (function_ptr=...) at /usr/include/boost/function/function_template.hpp:118
#18 0x00007ffff7b7027e in boost::detail::function::function_obj_invoker0<boost::detail::forward, int>::invoke(boost::detail::function::function_buffer&) () from /usr/lib/libboost_unit_test_framework.so.1.61.0
#19 0x00007ffff7b6f9ad in boost::execution_monitor::catch_signals(boost::function<int ()> const&) () from /usr/lib/libboost_unit_test_framework.so.1.61.0
#20 0x00007ffff7b6faa8 in boost::execution_monitor::execute(boost::function<int ()> const&) () from /usr/lib/libboost_unit_test_framework.so.1.61.0
#21 0x00007ffff7b700f6 in boost::execution_monitor::vexecute(boost::function<void ()> const&) () from /usr/lib/libboost_unit_test_framework.so.1.61.0
#22 0x00007ffff7b9998b in boost::unit_test::unit_test_monitor_t::execute_and_translate(boost::function<void ()> const&, unsigned int) () from /usr/lib/libboost_unit_test_framework.so.1.61.0
#23 0x00007ffff7b7b25b in boost::unit_test::framework::state::execute_test_tree(unsigned long, unsigned int, boost::unit_test::framework::state::random_generator_helper const*) () from /usr/lib/libboost_unit_test_framework.so.1.61.0
#24 0x00007ffff7b7b82a in boost::unit_test::framework::state::execute_test_tree(unsigned long, unsigned int, boost::unit_test::framework::state::random_generator_helper const*) () from /usr/lib/libboost_unit_test_framework.so.1.61.0
#25 0x00007ffff7b7403a in boost::unit_test::framework::run(unsigned long, bool) () from /usr/lib/libboost_unit_test_framework.so.1.61.0
#26 0x00007ffff7b9778f in boost::unit_test::unit_test_main(bool (*)(), int, char**) () from /usr/lib/libboost_unit_test_framework.so.1.61.0
#27 0x000000000045732b in main (argc=0x3, argv=0x7fffffffddd8) at /usr/include/boost/test/unit_test.hpp:63
fabian-jung commented 7 years ago

I did a rerun of the test suit and the the cp_point_to_point/context case. I get no failures if i restart the .gbZmqSignaling server between the sessions. But i do get them if the zmqSignalingServer is running.

I take a look at this.

fabian-jung commented 7 years ago

Good news everyone. I enabled the debug flag and could now reproduce your exception.

fabian-jung commented 7 years ago

It seems that under some circumstances the testcase tries to get an element from an empty multiKeyMap. I could not manage to find how or why the mkm is not initialised at this point in time. Very odd behaviour.

fabian-jung commented 7 years ago

I have traced the error a bit up the stack trace. It has nothing to do with the MultikeyMap. I traced the access of the MKM and there is one access of an Element which has not been inserted to the map. I also checked the locking in the messagebox itself and it seems correct. I think there is something off in the splitContext function of Socket::base()

One ugly fix might be to use the default constructor of the value_t in the MKM::at.

theZiz commented 7 years ago

Okay, I think, I was able to fix this bug. As it seemd, the problem is, that multiKeyMap.at(keys...) should not be called parallel to multiKeyMap.push(...). Although multiKeyMap.test(keys...) returns true. meaning the queue for keys... exists, multiKeyMap.at(keys...) fails. I could fix this on my system with replacing line 233

while(multiKeyMap.at(keys...).empty()){

with

while([&](){
    //The access of "multiKeyMap.at(keys...)" is a critical section
    bool empty = true;
    {
        std::lock_guard<std::mutex> accessLock(access);
        empty = multiKeyMap.at(keys...).empty();
    }
    return empty;
}()){

Imho as conditional variables cannot deadlock code because of the time outs and as every critical section consits of only a few lines of code without possible infinite loops, this should not cause deadlocks. But please have a look at this yourself, too.

fabian-jung commented 7 years ago

I checked the patch and it seemed to work.

erikzenker commented 7 years ago

@theZiz Thank you very much for providing this little fix. It looks weird, but I think that it does what it should and @fabian-jung is happy. I think about adding an additional dev branch dev-c++11, where we could merge this PR.

theZiz commented 7 years ago

@erikzenker Yes, it looks very weird, but tbh this is the first time I used a C++ lambda and found it useful. The problem is, I need to make sure, that the access mutex is locked while calling the atmethod. This is the most atomar solution without defing a new explicit helper function. Although I thought about defining a MACRO, which then itself defines a lambda function... :>

erikzenker commented 7 years ago

Can you please close this PR and open a new one for the dev-c++11 branch ?

erikzenker commented 7 years ago

@theZiz Never saw a lambda in a while loop statement. Thank you for sharing this first moment with me. I kind of like it !