[Bug] BE crash when doing insert stmt

morningman commented 4 years ago

Describe the bug BE crash with coredump like:

(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x0000000001045667 in release (bytes=8192, this=0x633b7d40) at /var/doris/palo-core/be/src/runtime/mem_tracker.h:221
#2  doris::RowBatch::clear (this=0x8fc6bce0) at /var/doris/palo-core/be/src/runtime/row_batch.cpp:294
#3  0x00000000010459dd in clear (this=0x8fc6bce0) at /var/doris/palo-core/be/src/common/object_pool.h:54
#4  doris::RowBatch::~RowBatch (this=0x8fc6bce0, __in_chrg=<optimized out>) at /var/doris/palo-core/be/src/runtime/row_batch.cpp:301
#5  0x0000000001045a01 in doris::RowBatch::~RowBatch (this=0x8fc6bce0, __in_chrg=<optimized out>) at /var/doris/palo-core/be/src/runtime/row_batch.cpp:302
#6  0x00000000014a0299 in operator() (this=<optimized out>, __ptr=<optimized out>) at /var/doris/doris-toolchain/gcc730/include/c++/7.3.0/bits/unique_ptr.h:78
#7  ~unique_ptr (this=0x7070ec98, __in_chrg=<optimized out>) at /var/doris/doris-toolchain/gcc730/include/c++/7.3.0/bits/unique_ptr.h:268
#8  doris::stream_load::NodeChannel::~NodeChannel (this=0x7070ec00, __in_chrg=<optimized out>) at /var/doris/palo-core/be/src/exec/tablet_sink.cpp:41
#9  0x00000000014a53f2 in ~SpecificElement (this=0x77645500, __in_chrg=<optimized out>) at /var/doris/palo-core/be/src/common/object_pool.h:75
#10 doris::ObjectPool::SpecificElement<doris::stream_load::NodeChannel>::~SpecificElement (this=0x77645500, __in_chrg=<optimized out>) at /var/doris/palo-core/be/src/common/object_pool.h:76
#11 0x0000000001052efd in clear (this=0x6a213220) at /var/doris/palo-core/be/src/common/object_pool.h:54
#12 ~ObjectPool (this=0x6a213220, __in_chrg=<optimized out>) at /var/doris/palo-core/be/src/common/object_pool.h:37
#13 std::_Sp_counted_ptr<doris::ObjectPool*, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=<optimized out>) at /var/doris/doris-toolchain/gcc730/include/c++/7.3.0/bits/shared_ptr_base.h:376
#14 0x000000000104b591 in _M_release (this=0x6a212fa0) at /var/doris/doris-toolchain/gcc730/include/c++/7.3.0/bits/shared_ptr_base.h:154
#15 ~__shared_count (this=0x891de310, __in_chrg=<optimized out>) at /var/doris/doris-toolchain/gcc730/include/c++/7.3.0/bits/shared_ptr_base.h:684
#16 ~__shared_ptr (this=0x891de308, __in_chrg=<optimized out>) at /var/doris/doris-toolchain/gcc730/include/c++/7.3.0/bits/shared_ptr_base.h:1123
#17 ~shared_ptr (this=0x891de308, __in_chrg=<optimized out>) at /var/doris/doris-toolchain/gcc730/include/c++/7.3.0/bits/shared_ptr.h:93
#18 doris::RuntimeState::~RuntimeState (this=0x891de300, __in_chrg=<optimized out>) at /var/doris/palo-core/be/src/runtime/runtime_state.cpp:128
#19 0x000000000103dcc6 in checked_delete<doris::RuntimeState> (x=0x891de300) at /var/doris/thirdparty/installed/include/boost/core/checked_delete.hpp:34
#20 ~scoped_ptr (this=0x88082ce0, __in_chrg=<optimized out>) at /var/doris/thirdparty/installed/include/boost/smart_ptr/scoped_ptr.hpp:89
#21 doris::PlanFragmentExecutor::~PlanFragmentExecutor (this=0x88082b70, __in_chrg=<optimized out>) at /var/doris/palo-core/be/src/runtime/plan_fragment_executor.cpp:62
#22 0x0000000000fd39ed in doris::FragmentExecState::~FragmentExecState (this=0x88082b00, __in_chrg=<optimized out>) at /var/doris/palo-core/be/src/runtime/fragment_mgr.cpp:175

Debug

In some abnormal situations, performing Insert to load will cause BE to crash. The reasons for troubleshooting are as follows:

FE side:

https://github.com/apache/incubator-doris/blob/2211cb0ee0fcd23d4fd2445494aba6cf1a020987/fe/src/main/java/org/apache/doris/qe/Coordinator.java#L475-L489

During the execution of execState.execRemoteFragmentAsync(), if an RPC error occurs, if the corresponding BE is down, an exception will be thrown directly instead of returning the error via Future<PExecPlanFragmentResult>. This time, the Coordinator will not proceed with the subsequent Cancel operation.

After an exception is thrown, the thread stack returns to handleInsertStmt() and directly returns the user error message. Insert failed. So far, FE has no further processing.

BE side

BE receives the execution plan of Insert and calls _sink->open in PlanFragmentExecutor::open_internal() to open TabletSink.

https://github.com/apache/incubator-doris/blob/2211cb0ee0fcd23d4fd2445494aba6cf1a020987/be/src/runtime/plan_fragment_executor.cpp#L272-L292

The Open method of TabletSink will open all related NodeChannels via RPC, but because some BEs problem, some of NodeChannels fail to open. An error message appears in the BE log:

tablet open failed, load_id=4e2f4d6076cc4692-b8f26cc924b6be0d, txn_id144942945, node=10.xx.xx.xx:8060, errmsg=failed to open tablet writer

However, because the majority of NodeChannels were successfully opened, TabletSink Open operations returned success.

Next, in PlanFragmentExecutor::open_internal() will call get_next_internal() to start fetching data, because Insert already failed at this time, so it returns failure here. And there are bugs in the later update_status(status) method:

https://github.com/apache/incubator-doris/blob/2211cb0ee0fcd23d4fd2445494aba6cf1a020987/be/src/runtime/plan_fragment_executor.cpp#L485-L503

Line 493: if (_status.ok()) should be if (!_status.ok()). This error caused the _status variable not to be updated. This will cause the NodeChannel to be closed instead of being canceled when the TabletSink is finally closed.

The normal NodeChannel close operation will send the last batch of RowBatch it holds and destroy the RowBatch object. And because some NodeChannels were not opened normally, they will not be closed normally. The RowBatch held by these NodeChannels will not be destroyed.

After the execution of the entire plan is completed, the PlanFragmentExecutor's destruction process is entered. The destructor call chain is as follows:

PlanFragmentExecutor
|--- RuntimeState
     |--- RuntimeProfile
     |--- ObjectPool
          |--- NodeChannel
               |--- RowBatch
                    |--- MemTracker->release()
                         |--- profile->_consumption->add(-bytes)

Note that the RuntimeProfile will be first destructed in RuntimeState, which will cause the _consumption object that has been destructed to be called when destructing the RowBatch in the NodeChannel, which will eventually cause BE crash.

The whole process has the following problems:

The bug in update_status(status) caused the NodeChannel to not be canceled correctly, which caused problems in the final destruction. (If Cancel of NodeChannel is called in advance, RowBatch will be destructed in advance).
When the FE Coordinator executes execRemoteFragmentAsync(), if it finds an RPC error, it should return a Future with an error code, continue the process afterwards, and actively call Cancel().
The _status in RuntimeState has no lock protection, which may cause some potential problems.

vagetablechicken commented 4 years ago

The destructor call chain is as follows:
PlanFragmentExecutor
|--- RuntimeState
     |--- RuntimeProfile
     |--- ObjectPool
          |--- NodeChannel
               |--- RowBatch
                    |--- MemTracker->release()
                         |--- profile->_consumption->add(-bytes)
Note that the RuntimeProfile will be first destructed in RuntimeState, which will cause the _consumption object that has been destructed to be called when destructing the RowBatch in the NodeChannel, which will eventually cause BE crash.

MemTracker is created by https://github.com/apache/incubator-doris/blob/c6f2b5ef0df61902360a3f3fcdfc00a6fff86dfe/be/src/exec/tablet_sink.cpp#L461 Is the OlapTableSink::_mem_tracker related to RuntimeState::_profile? I think the RuntimeState::_instance_mem_tracker is relevant. Because it's OlapTableSink::_mem_tracker's parent.

vagetablechicken commented 4 years ago

It is my guess: We met some errors in NodeChannel, make _cancelled == true. And NodeChannel's _cur_batch or _pending_batches have some bytes. Although some NodeChannels is cancelled internally, it isn't intolerable. OlapTableSink will use NodeChannel::mark_close(). But I write this https://github.com/apache/incubator-doris/blob/7591527977ea8b4184e45581007cc805c461b451/be/src/exec/tablet_sink.cpp#L218-L222 So when cancelled internally NodeChannels is under destruction, we will meet not empty RowBatch.

But at that time, RuntimeState::_instance_mem_tracker has already deleted.

(RowBatch's mem_tracker = OlapTableSink::_mem_tracker. RuntimeState::_instance_mem_tracker is OlapTableSink::_mem_tracker's parent)

It's my fault. I used to treat OlapTableSink::_mem_tracker & _channels as internal members of OlapTableSink.

morningman commented 4 years ago

Is the OlapTableSink::_mem_tracker related to RuntimeState::_profile? I think the RuntimeState::_instance_mem_tracker is relevant. Because it's OlapTableSink::_mem_tracker's parent.

Actually, I didn't notice the relationship between OlapTableSink::_mem_tracker and RuntimeState::_instance_mem_tracker. In this problem, I found that all memtrackers are relates to RuntimeState::_profile, because it will update some Counters in the RuntimeProfile. For example, when calling MemTracker->release(), it will use (*tracker)->_consumption, and the _consumption is a kind of Counter in RuntimeProfile. But the profile has already been deconstructed, which results in a use-after-free error.

apache / doris

[Bug] BE crash when doing insert stmt #3871