Closed fabian-jung closed 8 years ago
Generate gdb trace:
make gbZMQSignaling && make checkFast && mpiexec -n 2 xterm -e gdb --args ./checkFast --log_level=test_suite --run_test=graybat_cp_point_to_point/context
gdb$ catch throw
gdb$ run
gdb$ c #loop until the out_of_range exception occurs
gdb$ bt
Resulting gdb trace:
#0 __cxxabiv1::__cxa_throw (obj=obj@entry=0x879e00, tinfo=0x7ffff6aa4a98 <typeinfo for std::out_of_range>, dest=0x7ffff67cedf0 <std::out_of_range::~out_of_range()>) at /build/gcc-multilib/src/gcc/libstdc++-v3/libsupc++/eh_throw.cc:62
#1 0x00007ffff67e2def in std::__throw_out_of_range (__s=0x486da1 "map::at") at /build/gcc-multilib/src/gcc/libstdc++-v3/src/c++11/functexcept.cc:90
#2 0x000000000047e6df in std::map<std::tuple<graybat::communicationPolicy::MsgTypeType, unsigned int, unsigned int, unsigned int>, std::queue<graybat::communicationPolicy::zmq::Message<graybat::communicationPolicy::ZMQ>, std::deque<graybat::communicationPolicy::zmq::Message<graybat::communicationPolicy::ZMQ>, std::allocator<graybat::communicationPolicy::zmq::Message<graybat::communicationPolicy::ZMQ> > > >, std::less<std::tuple<graybat::communicationPolicy::MsgTypeType, unsigned int, unsigned int, unsigned int> >, std::allocator<std::pair<std::tuple<graybat::communicationPolicy::MsgTypeType, unsigned int, unsigned int, unsigned int> const, std::queue<graybat::communicationPolicy::zmq::Message<graybat::communicationPolicy::ZMQ>, std::deque<graybat::communicationPolicy::zmq::Message<graybat::communicationPolicy::ZMQ>, std::allocator<graybat::communicationPolicy::zmq::Message<graybat::communicationPolicy::ZMQ> > > > > > >::at (this=0x6a2330 <zmqCP+400>, __k=std::tuple containing = {...}) at /usr/include/c++/6.2.1/bits/stl_map.h:523
#3 0x000000000047a3c7 in utils::MultiKeyMap<std::queue<graybat::communicationPolicy::zmq::Message<graybat::communicationPolicy::ZMQ>, std::deque<graybat::communicationPolicy::zmq::Message<graybat::communicationPolicy::ZMQ>, std::allocator<graybat::communicationPolicy::zmq::Message<graybat::communicationPolicy::ZMQ> > > >, graybat::communicationPolicy::MsgTypeType, unsigned int, unsigned int, unsigned int>::at (this=0x6a2330 <zmqCP+400>, keys#0=graybat::communicationPolicy::MsgTypeType::SPLIT, keys#1=0x340, keys#2=0x1, keys#3=0x0) at /home/erik/projects/graybat/include/graybat/utils/MultiKeyMap.hpp:139
#4 0x00000000004741e5 in utils::MessageBox<graybat::communicationPolicy::zmq::Message<graybat::communicationPolicy::ZMQ>, graybat::communicationPolicy::MsgTypeType, unsigned int, unsigned int, unsigned int>::waitDequeue (this=0x6a2248 <zmqCP+168>, keys#0=graybat::communicationPolicy::MsgTypeType::SPLIT, keys#1=0x340, keys#2=0x1, keys#3=0x0) at /home/erik/projects/graybat/include/graybat/utils/MultiKeyMap.hpp:253
#5 0x000000000047022a in graybat::communicationPolicy::socket::Base<graybat::communicationPolicy::ZMQ>::recvImpl<std::array<unsigned int, 0ul> > (this=0x6a21a0 <zmqCP>, msgType=graybat::communicationPolicy::MsgTypeType::SPLIT, context=..., srcVAddr=0x1, tag=0x0, recvData=...) at /home/erik/projects/graybat/include/graybat/communicationPolicy/socket/Base.hpp:683
#6 0x0000000000469c4b in graybat::communicationPolicy::socket::Base<graybat::communicationPolicy::ZMQ>::splitContext (this=0x6a21a0 <zmqCP>, isMember=0x1, oldContext=...) at /home/erik/projects/graybat/include/graybat/communicationPolicy/socket/Base.hpp:518
#7 0x0000000000458444 in graybat_cp_point_to_point::context::<lambda(auto:2)>::operator()<std::reference_wrapper<graybat::communicationPolicy::ZMQ> >(std::reference_wrapper<graybat::communicationPolicy::ZMQ>) const (__closure=0x7fffffffc13f, cpRef=...) at /home/erik/projects/graybat/test/CommunicationPolicyUT.cpp:132
#8 0x0000000000458527 in boost::hana::detail::on_each<graybat_cp_point_to_point::context::test_method()::<lambda(auto:2)>*>::operator()<std::reference_wrapper<graybat::communicationPolicy::ZMQ>&>(std::reference_wrapper<graybat::communicationPolicy::ZMQ> &) const (this=0x7fffffffc0d0, xs#0=...) at /usr/include/boost/hana/for_each.hpp:46
#9 0x0000000000458562 in boost::hana::unpack_impl<boost::hana::basic_tuple_tag, void>::apply<0ul, std::reference_wrapper<graybat::communicationPolicy::ZMQ>, boost::hana::detail::on_each<graybat_cp_point_to_point::context::test_method()::<lambda(auto:2)>*> >(boost::hana::detail::basic_tuple_impl<std::integer_sequence<unsigned long, 0ul>, std::reference_wrapper<graybat::communicationPolicy::ZMQ> > &, <unknown type in /home/erik/projects/graybat/build/checkFast, CU 0x11a36, DIE 0x871ba>) (xs=..., f=<unknown type in /home/erik/projects/graybat/build/checkFast, CU 0x11a36, DIE 0x871ba>) at /usr/include/boost/hana/basic_tuple.hpp:189
#10 0x000000000045858b in boost::hana::unpack_t::operator()<boost::hana::basic_tuple<std::reference_wrapper<graybat::communicationPolicy::ZMQ> >&, boost::hana::detail::on_each<graybat_cp_point_to_point::context::test_method()::<lambda(auto:2)>*> >(boost::hana::basic_tuple<std::reference_wrapper<graybat::communicationPolicy::ZMQ> > &, <unknown type in /home/erik/projects/graybat/build/checkFast, CU 0x11a36, DIE 0x871ba>) const (this=0x4868e8 <boost::hana::unpack>, xs=..., f=<unknown type in /home/erik/projects/graybat/build/checkFast, CU 0x11a36, DIE 0x871ba>) at /usr/include/boost/hana/unpack.hpp:47
#11 0x00000000004585b2 in boost::hana::unpack_impl<boost::hana::tuple_tag, void>::apply<boost::hana::tuple<std::reference_wrapper<graybat::communicationPolicy::ZMQ> >&, boost::hana::detail::on_each<graybat_cp_point_to_point::context::test_method()::<lambda(auto:2)>*> >(boost::hana::tuple<std::reference_wrapper<graybat::communicationPolicy::ZMQ> > &, <unknown type in /home/erik/projects/graybat/build/checkFast, CU 0x11a36, DIE 0x871ba>) (xs=..., f=<unknown type in /home/erik/projects/graybat/build/checkFast, CU 0x11a36, DIE 0x871ba>) at /usr/include/boost/hana/tuple.hpp:205
#12 0x00000000004585db in boost::hana::unpack_t::operator()<boost::hana::tuple<std::reference_wrapper<graybat::communicationPolicy::ZMQ> >&, boost::hana::detail::on_each<graybat_cp_point_to_point::context::test_method()::<lambda(auto:2)>*> >(boost::hana::tuple<std::reference_wrapper<graybat::communicationPolicy::ZMQ> > &, <unknown type in /home/erik/projects/graybat/build/checkFast, CU 0x11a36, DIE 0x871ba>) const (this=0x4868e8 <boost::hana::unpack>, xs=..., f=<unknown type in /home/erik/projects/graybat/build/checkFast, CU 0x11a36, DIE 0x871ba>) at /usr/include/boost/hana/unpack.hpp:47
#13 0x000000000045860a in boost::hana::for_each_impl<boost::hana::tuple_tag, boost::hana::when<true> >::apply<boost::hana::tuple<std::reference_wrapper<graybat::communicationPolicy::ZMQ> >&, graybat_cp_point_to_point::context::test_method()::<lambda(auto:2)> >(boost::hana::tuple<std::reference_wrapper<graybat::communicationPolicy::ZMQ> > &, <unknown type in /home/erik/projects/graybat/build/checkFast, CU 0x11a36, DIE 0x871ad>) (xs=..., f=<unknown type in /home/erik/projects/graybat/build/checkFast, CU 0x11a36, DIE 0x871ad>) at /usr/include/boost/hana/for_each.hpp:56
#14 0x0000000000458635 in boost::hana::for_each_t::operator()<boost::hana::tuple<std::reference_wrapper<graybat::communicationPolicy::ZMQ> >&, graybat_cp_point_to_point::context::test_method()::<lambda(auto:2)> >(boost::hana::tuple<std::reference_wrapper<graybat::communicationPolicy::ZMQ> > &, <unknown type in /home/erik/projects/graybat/build/checkFast, CU 0x11a36, DIE 0x87161>) const (this=0x486c18 <boost::hana::for_each>, xs=..., f=<unknown type in /home/erik/projects/graybat/build/checkFast, CU 0x11a36, DIE 0x87161>) at /usr/include/boost/hana/for_each.hpp:35
#15 0x000000000045865a in graybat_cp_point_to_point::context::test_method (this=0x7fffffffc15f) at /home/erik/projects/graybat/test/CommunicationPolicyUT.cpp:145
#16 0x000000000045826c in graybat_cp_point_to_point::context_invoker () at /home/erik/projects/graybat/test/CommunicationPolicyUT.cpp:117
#17 0x000000000047a23f in boost::detail::function::void_function_invoker0<void (*)(), void>::invoke (function_ptr=...) at /usr/include/boost/function/function_template.hpp:118
#18 0x00007ffff7b7027e in boost::detail::function::function_obj_invoker0<boost::detail::forward, int>::invoke(boost::detail::function::function_buffer&) () from /usr/lib/libboost_unit_test_framework.so.1.61.0
#19 0x00007ffff7b6f9ad in boost::execution_monitor::catch_signals(boost::function<int ()> const&) () from /usr/lib/libboost_unit_test_framework.so.1.61.0
#20 0x00007ffff7b6faa8 in boost::execution_monitor::execute(boost::function<int ()> const&) () from /usr/lib/libboost_unit_test_framework.so.1.61.0
#21 0x00007ffff7b700f6 in boost::execution_monitor::vexecute(boost::function<void ()> const&) () from /usr/lib/libboost_unit_test_framework.so.1.61.0
#22 0x00007ffff7b9998b in boost::unit_test::unit_test_monitor_t::execute_and_translate(boost::function<void ()> const&, unsigned int) () from /usr/lib/libboost_unit_test_framework.so.1.61.0
#23 0x00007ffff7b7b25b in boost::unit_test::framework::state::execute_test_tree(unsigned long, unsigned int, boost::unit_test::framework::state::random_generator_helper const*) () from /usr/lib/libboost_unit_test_framework.so.1.61.0
#24 0x00007ffff7b7b82a in boost::unit_test::framework::state::execute_test_tree(unsigned long, unsigned int, boost::unit_test::framework::state::random_generator_helper const*) () from /usr/lib/libboost_unit_test_framework.so.1.61.0
#25 0x00007ffff7b7403a in boost::unit_test::framework::run(unsigned long, bool) () from /usr/lib/libboost_unit_test_framework.so.1.61.0
#26 0x00007ffff7b9778f in boost::unit_test::unit_test_main(bool (*)(), int, char**) () from /usr/lib/libboost_unit_test_framework.so.1.61.0
#27 0x000000000045732b in main (argc=0x3, argv=0x7fffffffddd8) at /usr/include/boost/test/unit_test.hpp:63
I did a rerun of the test suit and the the cp_point_to_point/context case. I get no failures if i restart the .gbZmqSignaling server between the sessions. But i do get them if the zmqSignalingServer is running.
I take a look at this.
Good news everyone. I enabled the debug flag and could now reproduce your exception.
It seems that under some circumstances the testcase tries to get an element from an empty multiKeyMap. I could not manage to find how or why the mkm is not initialised at this point in time. Very odd behaviour.
I have traced the error a bit up the stack trace. It has nothing to do with the MultikeyMap. I traced the access of the MKM and there is one access of an Element which has not been inserted to the map. I also checked the locking in the messagebox itself and it seems correct. I think there is something off in the splitContext function of Socket::base()
One ugly fix might be to use the default constructor of the value_t in the MKM::at.
Okay, I think, I was able to fix this bug. As it seemd, the problem is, that multiKeyMap.at(keys...)
should not be called parallel to multiKeyMap.push(...)
. Although multiKeyMap.test(keys...)
returns true. meaning the queue for keys...
exists, multiKeyMap.at(keys...)
fails. I could fix this on my system with replacing line 233
while(multiKeyMap.at(keys...).empty()){
with
while([&](){
//The access of "multiKeyMap.at(keys...)" is a critical section
bool empty = true;
{
std::lock_guard<std::mutex> accessLock(access);
empty = multiKeyMap.at(keys...).empty();
}
return empty;
}()){
Imho as conditional variables cannot deadlock code because of the time outs and as every critical section consits of only a few lines of code without possible infinite loops, this should not cause deadlocks. But please have a look at this yourself, too.
I checked the patch and it seemed to work.
@theZiz Thank you very much for providing this little fix. It looks weird, but I think that it does what it should and @fabian-jung is happy. I think about adding an additional dev branch dev-c++11, where we could merge this PR.
@erikzenker Yes, it looks very weird, but tbh this is the first time I used a C++ lambda and found it useful. The problem is, I need to make sure, that the access mutex is locked while calling the at
method. This is the most atomar solution without defing a new explicit helper function. Although I thought about defining a MACRO, which then itself defines a lambda function... :>
Can you please close this PR and open a new one for the dev-c++11 branch ?
@theZiz Never saw a lambda in a while loop statement. Thank you for sharing this first moment with me. I kind of like it !
Regarding my comment about the context test above: the test leads to a
std::out_of_range: map::at
exception when the context is split multiple times in a loop (see here). In particular, the test case runs thesplitContext
method 1000 times, but about 150 iterations are enough to trigger the exception.Looks like a race condition to me e.g.: reading and writing concurrently from the
MultiKeyMap
with broken locking. I Need to ask the debugger for more details (debugging mpi with gdb).