change BE config load_process_max_memory_limit_bytes to some smaller value, e.g., 1073741824
keep submitting several streaming load jobs to that table so that load memory consumption exceeds memory limit from time to time
Note: the bug can't be reproduced on small dataset. I was able to reproduce it after loading billions of rows for several hours.
Root cause analysis
The bug was caused by a race described bellow
Thread T1 has closed TabletsChannel x in LoadChannel::add_batch, and is waiting on the LoadChannel lock to remove x from _tablets_channels. At this time, all delta writers from x have been closed.
Thread T2 acquired the lock in LoadChannel::handle_mem_exceed_limit and asks x to reduce_mem_usage(). This will call DeltaWriter::flush_memtable_and_wait on an already closed delta writer, which leads to the last memtable to be flushed twice.
The race can be verified from core dump
Thread T1:
#7 0x00000000017f2803 in std::lock_guard<std::mutex>::lock_guard (this=0x7fda2bb3d220, __m=...) at /usr/local/include/c++/7.2.0/bits/std_mutex.h:162
#8 0x0000000002257302 in doris::LoadChannel::add_batch (this=0x610002104f40, request=..., tablet_vec=0x6070006f8fc8) at /home/gaodayue/doris-git/be/src/runtime/load_channel.cpp:100
#9 0x00000000022494a7 in doris::LoadChannelMgr::add_batch (this=0x60c0000a9900, request=..., tablet_vec=0x6070006f8fc8, wait_lock_time_ns=0x7fda2bb3dc40)
Thread T2:
#4 0x0000000001d6841b in doris::CounterCondVariable::block_wait (this=0x60e000f60dd8) at /home/gaodayue/doris-git/be/src/util/counter_cond_variable.hpp:83
#5 0x0000000001d5a470 in doris::FlushHandler::wait (this=0x60e000f60dc0) at /home/gaodayue/doris-git/be/src/olap/memtable_flush_executor.cpp:41
#6 0x00000000030108d6 in doris::DeltaWriter::flush_memtable_and_wait (this=0x61100134f380) at /home/gaodayue/doris-git/be/src/olap/delta_writer.cpp:193
#7 0x0000000002262012 in doris::TabletsChannel::reduce_mem_usage (this=0x6120003463c0) at /home/gaodayue/doris-git/be/src/runtime/tablets_channel.cpp:177
#8 0x0000000002257a94 in doris::LoadChannel::handle_mem_exceed_limit (this=0x610002104f40, force=true) at /home/gaodayue/doris-git/be/src/runtime/load_channel.cpp:120
#9 0x000000000224a2dc in doris::LoadChannelMgr::_handle_mem_exceed_limit (this=0x60c0000a9900) at /home/gaodayue/doris-git/be/src/runtime/load_channel_mgr.cpp:164
#10 0x0000000002249440 in doris::LoadChannelMgr::add_batch (this=0x60c0000a9900, request=..., tablet_vec=0x607000a5bf98, wait_lock_time_ns=0x7fda2c35cc40)
at /home/gaodayue/doris-git/be/src/runtime/load_channel_mgr.cpp:120
Why calling DeltaWriter::flush_memtable_and_wait after DeltaWriter::close_wait causes the last memtable to be flushed twice?
Because _mem_table is not reset in DeltaWriter::close.
Why flushing memtable twice causes the segment fault in roaring_bitmap_run_optimize?
Because after the first flush, slice->data of bitmap value points to serialized bitmap, which can't be cast to RoaringBitmap* in the second flush.
// The RoaringBitmap object memory will be released by ObjectPool
static void finalize(RowCursorCell* src, MemPool* mem_pool) {
auto *slice = reinterpret_cast<Slice*>(src->mutable_cell_ptr());
auto *bitmap = reinterpret_cast<RoaringBitmap*>(slice->data);
slice->size = bitmap->size();
slice->data = (char*)mem_pool->allocate(slice->size);
bitmap->serialize(slice->data);
}
Describe the bug BE crashes with the following stacktrace when loading into a table with bitmap column.
To Reproduce
load_process_max_memory_limit_bytes
to some smaller value, e.g., 1073741824Note: the bug can't be reproduced on small dataset. I was able to reproduce it after loading billions of rows for several hours.
Root cause analysis The bug was caused by a race described bellow
x
inLoadChannel::add_batch
, and is waiting on the LoadChannel lock to removex
from_tablets_channels
. At this time, all delta writers from x have been closed.LoadChannel::handle_mem_exceed_limit
and asksx
toreduce_mem_usage()
. This will callDeltaWriter::flush_memtable_and_wait
on an already closed delta writer, which leads to the last memtable to be flushed twice.The race can be verified from core dump
Why calling
DeltaWriter::flush_memtable_and_wait
afterDeltaWriter::close_wait
causes the last memtable to be flushed twice? Because_mem_table
is not reset inDeltaWriter::close
.Why flushing memtable twice causes the segment fault in
roaring_bitmap_run_optimize
? Because after the first flush,slice->data
of bitmap value points to serialized bitmap, which can't be cast toRoaringBitmap*
in the second flush.