apache / doris

Apache Doris is an easy-to-use, high performance and unified analytics database.
https://doris.apache.org
Apache License 2.0
12.66k stars 3.27k forks source link

BE cores because RuntimeState is deallocated before DataSink when fragment is cancelled #3706

Closed gaodayue closed 4 years ago

gaodayue commented 4 years ago

Describe the bug BE crashes with the following logs in be.out

start time: Tue May 19 18:15:02 CST 2020
*** Aborted at 1590397931 (unix time) try "date -d @1590397931" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGSEGV (@0x0) received by PID 23260 (TID 0x7f0f49140700) from PID 0; stack trace: ***
    @     0x7f0faa6575e0 (unknown)

For unknown reason Glog fails to capture backtrace. Fortunately we have coredump enabled and got the following backtrace

#0  0x0000000000000000 in ?? ()
#1  0x0000000000ff40e2 in doris::MemTracker::~MemTracker (this=0x2543c3040, __in_chrg=<optimized out>) at /home/gaodayue/doris-git/be/src/runtime/mem_tracker.cpp:194
#2  0x00000000015c79cb in operator() (this=<optimized out>, __ptr=0x2543c3040) at /usr/local/include/c++/7.2.0/bits/unique_ptr.h:78
#3  ~unique_ptr (this=0x238bff9e0, __in_chrg=<optimized out>) at /usr/local/include/c++/7.2.0/bits/unique_ptr.h:268
#4  doris::DataStreamSender::~DataStreamSender (this=0x238bff860, __in_chrg=<optimized out>) at /home/gaodayue/doris-git/be/src/runtime/data_stream_sender.cpp:430
#5  0x00000000015c7c61 in doris::DataStreamSender::~DataStreamSender (this=0x238bff860, __in_chrg=<optimized out>) at /home/gaodayue/doris-git/be/src/runtime/data_stream_sender.cpp:434
#6  0x000000000102d96e in checked_delete<doris::DataSink> (x=<optimized out>) at /home/gaodayue/doris-git/thirdparty/installed/include/boost/core/checked_delete.hpp:34
#7  ~scoped_ptr (this=0x6da945260, __in_chrg=<optimized out>) at /home/gaodayue/doris-git/thirdparty/installed/include/boost/smart_ptr/scoped_ptr.hpp:89
#8  doris::PlanFragmentExecutor::~PlanFragmentExecutor (this=0x6da9450f0, __in_chrg=<optimized out>) at /home/gaodayue/doris-git/be/src/runtime/plan_fragment_executor.cpp:62
#9  0x0000000000fbbe2d in doris::FragmentExecState::~FragmentExecState (this=0x6da945080, __in_chrg=<optimized out>) at /home/gaodayue/doris-git/be/src/runtime/fragment_mgr.cpp:178
#10 0x0000000000fc34ea in std::_Sp_counted_ptr<doris::FragmentExecState*, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=<optimized out>) at /usr/local/include/c++/7.2.0/bits/shared_ptr_base.h:376
#11 0x0000000000d72f76 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0xfa38b180) at /usr/local/include/c++/7.2.0/bits/shared_ptr_base.h:154
#12 0x0000000000fc3dca in ~list3 (this=0x43930d9b0, __in_chrg=<optimized out>) at /home/gaodayue/doris-git/thirdparty/installed/include/boost/bind/bind.hpp:358
#13 ~bind_t (this=0x43930d9a0, __in_chrg=<optimized out>) at /home/gaodayue/doris-git/thirdparty/installed/include/boost/bind/bind.hpp:1275
#14 manager (op=<optimized out>, out_buffer=..., in_buffer=...) at /home/gaodayue/doris-git/thirdparty/installed/include/boost/function/function_base.hpp:358
#15 manager (op=<optimized out>, out_buffer=..., in_buffer=...) at /home/gaodayue/doris-git/thirdparty/installed/include/boost/function/function_base.hpp:379
#16 boost::detail::function::functor_manager<boost::_bi::bind_t<void, boost::_mfi::mf2<void, doris::FragmentMgr, std::shared_ptr<doris::FragmentExecState>, std::function<void (doris::PlanFragmentExecutor*)> >
, boost::_bi::list3<boost::_bi::value<doris::FragmentMgr*>, boost::_bi::value<std::shared_ptr<doris::FragmentExecState> >, boost::_bi::value<std::function<void (doris::PlanFragmentExecutor*)> > > > >::manage(boost::detail::function::function_buffer const&, boost::detail::function::function_buffer&, boost::detail::function::functor_manager_operation_type) (in_buffer=..., out_buffer=..., op=<optimized out>)
    at /home/gaodayue/doris-git/thirdparty/installed/include/boost/function/function_base.hpp:407
#17 0x0000000000ec23aa in clear (this=<optimized out>, functor=...) at /home/gaodayue/doris-git/thirdparty/installed/include/boost/function/function_template.hpp:510
#18 clear (this=0x7f0f4913dfc0) at /home/gaodayue/doris-git/thirdparty/installed/include/boost/function/function_template.hpp:868
#19 ~function0 (this=0x7f0f4913dfc0, __in_chrg=<optimized out>) at /home/gaodayue/doris-git/thirdparty/installed/include/boost/function/function_template.hpp:752
#20 ~function (this=0x7f0f4913dfc0, __in_chrg=<optimized out>) at /home/gaodayue/doris-git/thirdparty/installed/include/boost/function/function_template.hpp:1041
#21 doris::ThreadPool::work_thread (this=0x68d0b00, thread_id=<optimized out>) at /home/gaodayue/doris-git/be/src/util/thread_pool.hpp:117
#22 0x0000000001a0adfd in thread_proxy ()
#23 0x00007f0faa64fe25 in start_thread () from /lib64/libpthread.so.0
#24 0x00007f0fa9a5934d in clone () from /lib64/libc.so.6

The program crashes when destructing DataStreamSender's MemTracker. Let's look at its state

(gdb) f 1
(gdb) p this
$25 = (doris::MemTracker * const) 0x2543c3040
(gdb) p _all_trackers
$26 = std::vector of length 4, capacity 4 = {0x2543c3040, 0x181ea2b60, 0x12cdae000, 0x77f0d00}

From cores, we see that it has three parent trackers

After looking into the state of the these trackers, I'm pretty sure the memory of instance and query tracker is corrupted.

For example 1) _type of instance tracker's _consumption field is invalid value 630994688.

(gdb) p *(((doris::MemTracker*) 0x181ea2b60)->_consumption)
$28 = {<doris::RuntimeProfile::Counter> = {_vptr.Counter = 0x406cadda0, _value = {_value = 4273231}, _type = 630994688}, current_value_ = {_value = 0}}

2) _child_trackers of query tracker is corrupted

(gdb) p ((doris::MemTracker*) 0x12cdae000)->_child_trackers
$29 = std::__cxx11::list = {[0] = 0x197756b60, [1] = 0x0, [2] = Cannot access memory at address 0x10

Since instance and query tracker belongs to RuntimeState, we then look at the state of RuntimeState. It looks that it has already been destructed.

(gdb) p _runtime_state
$30 = {px = 0x56fdb8a00}
(gdb) p ((doris::RuntimeState*) 0x56fdb8a00)->_block_mgr
$31 = {px = 0x0, pn = {pi_ = 0x0}}
(gdb) p ((doris::RuntimeState*) 0x56fdb8a00)->_block_mgr2
$32 = {px = 0x0, pn = {pi_ = 0x0}}

So far the cause of problem is clear, DataSink uses MemTrackers from RuntimeState but the latter is destructed before the former. There is one last remaining question: What are the conditions that trigger the BUG?

We can get the fragment instance id 36bf368c3c0d4ce2-8034bef11abe15cf from the coredump. Grepping be.INFO, we found that the fragment is cancelled due to timeout.

I0525 17:11:51.110321 24030 internal_service.cpp:149] exec plan fragment, fragment_instance_id=36bf368c3c0d4ce2-8034bef11abe15cf, coord=TNetworkAddress(hostname=xx.xx.xx.xx port=9020), backend=13
I0525 17:11:51.110373 24030 plan_fragment_executor.cpp:79] Prepare(): query_id=36bf368c3c0d4ce2-8034bef11abe15b2 fragment_instance_id=36bf368c3c0d4ce2-8034bef11abe15cf backend_num=13
I0525 17:11:51.111650 23886 plan_fragment_executor.cpp:272] Open(): fragment_instance_id=36bf368c3c0d4ce2-8034bef11abe15cf
I0525 17:12:11.104353 24028 internal_service.cpp:167] cancel framgent, fragment_instance_id=36bf368c3c0d4ce2-8034bef11abe15cf, reason: 3
I0525 17:12:11.104423 24028 plan_fragment_executor.cpp:536] cancel(): fragment_instance_id=36bf368c3c0d4ce2-8034bef11abe15cf
I0525 17:12:11.115552 23886 plan_fragment_executor.cpp:536] cancel(): fragment_instance_id=36bf368c3c0d4ce2-8034bef11abe15cf
I0525 17:12:11.116982 23886 plan_fragment_executor.cpp:594] Fragment 36bf368c3c0d4ce2-8034bef11abe15cf:(Active: 19s992ms, non-child: 0.00%)

In this case PlanFragmentExecutor::open_internal will return early at

https://github.com/apache/incubator-doris/blob/bc35f3a31f7004fd548a12c8dee3ab2e8c9e4f22/be/src/runtime/plan_fragment_executor.cpp#L288

Then RuntimeState will be deallocated before DataSink according their declaration order.

    boost::scoped_ptr<DataSink> _sink;
    boost::scoped_ptr<RuntimeState> _runtime_state;
    boost::scoped_ptr<RowBatch> _row_batch;
morningman commented 4 years ago

This is a great problem description and troubleshooting process!