Describe the bug
BE crashes with the following logs in be.out
start time: Tue May 19 18:15:02 CST 2020
*** Aborted at 1590397931 (unix time) try "date -d @1590397931" if you are using GNU date ***
PC: @ 0x0 (unknown)
*** SIGSEGV (@0x0) received by PID 23260 (TID 0x7f0f49140700) from PID 0; stack trace: ***
@ 0x7f0faa6575e0 (unknown)
For unknown reason Glog fails to capture backtrace. Fortunately we have coredump enabled and got the following backtrace
#0 0x0000000000000000 in ?? ()
#1 0x0000000000ff40e2 in doris::MemTracker::~MemTracker (this=0x2543c3040, __in_chrg=<optimized out>) at /home/gaodayue/doris-git/be/src/runtime/mem_tracker.cpp:194
#2 0x00000000015c79cb in operator() (this=<optimized out>, __ptr=0x2543c3040) at /usr/local/include/c++/7.2.0/bits/unique_ptr.h:78
#3 ~unique_ptr (this=0x238bff9e0, __in_chrg=<optimized out>) at /usr/local/include/c++/7.2.0/bits/unique_ptr.h:268
#4 doris::DataStreamSender::~DataStreamSender (this=0x238bff860, __in_chrg=<optimized out>) at /home/gaodayue/doris-git/be/src/runtime/data_stream_sender.cpp:430
#5 0x00000000015c7c61 in doris::DataStreamSender::~DataStreamSender (this=0x238bff860, __in_chrg=<optimized out>) at /home/gaodayue/doris-git/be/src/runtime/data_stream_sender.cpp:434
#6 0x000000000102d96e in checked_delete<doris::DataSink> (x=<optimized out>) at /home/gaodayue/doris-git/thirdparty/installed/include/boost/core/checked_delete.hpp:34
#7 ~scoped_ptr (this=0x6da945260, __in_chrg=<optimized out>) at /home/gaodayue/doris-git/thirdparty/installed/include/boost/smart_ptr/scoped_ptr.hpp:89
#8 doris::PlanFragmentExecutor::~PlanFragmentExecutor (this=0x6da9450f0, __in_chrg=<optimized out>) at /home/gaodayue/doris-git/be/src/runtime/plan_fragment_executor.cpp:62
#9 0x0000000000fbbe2d in doris::FragmentExecState::~FragmentExecState (this=0x6da945080, __in_chrg=<optimized out>) at /home/gaodayue/doris-git/be/src/runtime/fragment_mgr.cpp:178
#10 0x0000000000fc34ea in std::_Sp_counted_ptr<doris::FragmentExecState*, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=<optimized out>) at /usr/local/include/c++/7.2.0/bits/shared_ptr_base.h:376
#11 0x0000000000d72f76 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0xfa38b180) at /usr/local/include/c++/7.2.0/bits/shared_ptr_base.h:154
#12 0x0000000000fc3dca in ~list3 (this=0x43930d9b0, __in_chrg=<optimized out>) at /home/gaodayue/doris-git/thirdparty/installed/include/boost/bind/bind.hpp:358
#13 ~bind_t (this=0x43930d9a0, __in_chrg=<optimized out>) at /home/gaodayue/doris-git/thirdparty/installed/include/boost/bind/bind.hpp:1275
#14 manager (op=<optimized out>, out_buffer=..., in_buffer=...) at /home/gaodayue/doris-git/thirdparty/installed/include/boost/function/function_base.hpp:358
#15 manager (op=<optimized out>, out_buffer=..., in_buffer=...) at /home/gaodayue/doris-git/thirdparty/installed/include/boost/function/function_base.hpp:379
#16 boost::detail::function::functor_manager<boost::_bi::bind_t<void, boost::_mfi::mf2<void, doris::FragmentMgr, std::shared_ptr<doris::FragmentExecState>, std::function<void (doris::PlanFragmentExecutor*)> >
, boost::_bi::list3<boost::_bi::value<doris::FragmentMgr*>, boost::_bi::value<std::shared_ptr<doris::FragmentExecState> >, boost::_bi::value<std::function<void (doris::PlanFragmentExecutor*)> > > > >::manage(boost::detail::function::function_buffer const&, boost::detail::function::function_buffer&, boost::detail::function::functor_manager_operation_type) (in_buffer=..., out_buffer=..., op=<optimized out>)
at /home/gaodayue/doris-git/thirdparty/installed/include/boost/function/function_base.hpp:407
#17 0x0000000000ec23aa in clear (this=<optimized out>, functor=...) at /home/gaodayue/doris-git/thirdparty/installed/include/boost/function/function_template.hpp:510
#18 clear (this=0x7f0f4913dfc0) at /home/gaodayue/doris-git/thirdparty/installed/include/boost/function/function_template.hpp:868
#19 ~function0 (this=0x7f0f4913dfc0, __in_chrg=<optimized out>) at /home/gaodayue/doris-git/thirdparty/installed/include/boost/function/function_template.hpp:752
#20 ~function (this=0x7f0f4913dfc0, __in_chrg=<optimized out>) at /home/gaodayue/doris-git/thirdparty/installed/include/boost/function/function_template.hpp:1041
#21 doris::ThreadPool::work_thread (this=0x68d0b00, thread_id=<optimized out>) at /home/gaodayue/doris-git/be/src/util/thread_pool.hpp:117
#22 0x0000000001a0adfd in thread_proxy ()
#23 0x00007f0faa64fe25 in start_thread () from /lib64/libpthread.so.0
#24 0x00007f0fa9a5934d in clone () from /lib64/libc.so.6
The program crashes when destructing DataStreamSender's MemTracker. Let's look at its state
(gdb) f 1
(gdb) p this
$25 = (doris::MemTracker * const) 0x2543c3040
(gdb) p _all_trackers
$26 = std::vector of length 4, capacity 4 = {0x2543c3040, 0x181ea2b60, 0x12cdae000, 0x77f0d00}
From cores, we see that it has three parent trackers
instance tracker 0x181ea2b60
query tracker 0x12cdae000
process tracker 0x77f0d00
After looking into the state of the these trackers, I'm pretty sure the memory of instance and query tracker is corrupted.
For example
1) _type of instance tracker's _consumption field is invalid value 630994688.
So far the cause of problem is clear, DataSink uses MemTrackers from RuntimeState but the latter is destructed before the former. There is one last remaining question: What are the conditions that trigger the BUG?
We can get the fragment instance id 36bf368c3c0d4ce2-8034bef11abe15cf from the coredump. Grepping be.INFO, we found that the fragment is cancelled due to timeout.
Describe the bug BE crashes with the following logs in be.out
For unknown reason Glog fails to capture backtrace. Fortunately we have coredump enabled and got the following backtrace
The program crashes when destructing DataStreamSender's MemTracker. Let's look at its state
From cores, we see that it has three parent trackers
After looking into the state of the these trackers, I'm pretty sure the memory of instance and query tracker is corrupted.
For example 1)
_type
of instance tracker's_consumption
field is invalid value 630994688.2)
_child_trackers
of query tracker is corruptedSince instance and query tracker belongs to RuntimeState, we then look at the state of RuntimeState. It looks that it has already been destructed.
So far the cause of problem is clear, DataSink uses MemTrackers from RuntimeState but the latter is destructed before the former. There is one last remaining question: What are the conditions that trigger the BUG?
We can get the fragment instance id 36bf368c3c0d4ce2-8034bef11abe15cf from the coredump. Grepping be.INFO, we found that the fragment is cancelled due to timeout.
In this case
PlanFragmentExecutor::open_internal
will return early athttps://github.com/apache/incubator-doris/blob/bc35f3a31f7004fd548a12c8dee3ab2e8c9e4f22/be/src/runtime/plan_fragment_executor.cpp#L288
Then RuntimeState will be deallocated before DataSink according their declaration order.