apache / doris

Apache Doris is an easy-to-use, high performance and unified analytics database.
https://doris.apache.org
Apache License 2.0
12.71k stars 3.28k forks source link

[Bug] BE Nodes crash if VARIANT in key #36830

Open datenhahn opened 4 months ago

datenhahn commented 4 months ago

Search before asking

Version

2.1.3

What's Wrong?

If data is inserted into a table with a VARIANT field as DUPLICATE KEY field the BE Node crashes.

If no key type is given during a create table statement Apache Doris automatically just takes the first two columns. In our case the second column was of datatype VARIANT. Which resulted in BE nodes crashing as soon as we wanted to start our routine loads.

*** SIGABRT unknown detail explain (@0x2d3) received by PID 723 (TID 3297 OR 0x7fc79a149640) from PID 723; stack trace: *** @ 0x55d313b7ad56 google::LogMessage::SendToLog() @ 0x55fd9f2e9adc doris::MemTable::_sort_one_column() @ 0x55d313b777a0 google::LogMessage::Flush() @ 0x55fd9f2e965d doris::MemTable::_sort() @ 0x55d313b7b599 google::LogMessageFatal::~LogMessageFatal() @ 0x55fd9f2ea7d0 doris::MemTable::to_block() @ 0x55fd9f2f479c doris::FlushToken::_do_flush_memtable() @ 0x55d30cf8a221 doris::vectorized::ColumnObject::compare_at() @ 0x55d30925b083 pdqsort_detail::sort3<>() @ 0x55d30925a773 pdqsort_detail::pdqsort_loop<>() @ 0x55fd9f2f4cbf doris::FlushToken::_flush_memtable() F20240625 12:01:09.655988 3298 column_object.h:458] should not call the method in column object *** Check failure stack trace: *** @ 0x55d309256adc doris::MemTable::_sort_one_column() @ 0x55fd9f2f821d doris::MemtableFlushTask::run() @ 0x55d30925665d doris::MemTable::_sort() F20240625 12:01:09.674875 12997 column_object.h:458] should not call the method in column object *** Check failure stack trace: *** @ 0x55fda0170788 doris::ThreadPool::dispatch_thread() @ 0x55d3092577d0 doris::MemTable::to_block() @ 0x55d313b7ad56 google::LogMessage::SendToLog() @ 0x55fda0165b41 doris::Thread::supervise_thread() @ 0x7f99a2d9bac3 (unknown) @ 0x55d30926179c doris::FlushToken::_do_flush_memtable() @ 0x55d313b777a0 google::LogMessage::Flush() @ 0x7f99a2e2d850 (unknown) @ (nil) (unknown) *** Query id: 49363926c8e94859-996892754b204864 ***

What You Expected?

How to Reproduce?

CREATE TABLE reproduce_bug
(
    some_value BOOLEAN,
    value1 VARIANT,
    value2 VARIANT,
    value3 VARIANT,
    value4 VARIANT,
    value5 VARIANT,
    value6 VARIANT,
    value7 VARIANT
)
DISTRIBUTED BY HASH(some_value) BUCKETS AUTO;
 show CREATE TABLE reproduce_bug;
| reproduce_bug | CREATE TABLE `reproduce_bug` (
  `some_value` BOOLEAN NULL,
  `value1` VARIANT NULL,
  `value2` VARIANT NULL,
  `value3` VARIANT NULL,
  `value4` VARIANT NULL,
  `value5` VARIANT NULL,
  `value6` VARIANT NULL,
  `value7` VARIANT NULL
) ENGINE=OLAP
DUPLICATE KEY(`some_value`, `value1`)
COMMENT 'OLAP'
DISTRIBUTED BY HASH(`some_value`) BUCKETS AUTO
PROPERTIES (
"replication_allocation" = "tag.location.default: 3",
"min_load_replica_num" = "-1",
"is_being_synced" = "false",
"storage_medium" = "hdd",
"storage_format" = "V2",
"inverted_index_storage_format" = "V1",
"light_schema_change" = "true",
"disable_auto_compaction" = "false",
"enable_single_replica_compaction" = "false",
"group_commit_interval_ms" = "10000",
"group_commit_data_bytes" = "134217728"
); |

1 row in set (0,01 sec)
INSERT INTO reproduce_bug
VALUES
(
    true,                                   -- some_value (BOOLEAN)
    JSON_OBJECT("key", "int_value", "value", 123),                     -- value1 (VARIANT, JSON object)
    JSON_OBJECT("key", "string_value", "value", "example string"),     -- value2 (VARIANT, JSON object)
    JSON_OBJECT("key", "object_value", "value", JSON_OBJECT("inner_key", "inner_value")),  -- value3 (VARIANT, JSON object)
    JSON_OBJECT("key", "array_value", "value", JSON_ARRAY("item1", "item2", "item3")),     -- value4 (VARIANT, JSON object)
    JSON_OBJECT("key", "float_value", "value", 45.67),                 -- value5 (VARIANT, JSON object)
    JSON_OBJECT("key", "json_string", "value", "{\"nestedKey\": \"nestedValue\"}"),       -- value6 (VARIANT, JSON object)
    JSON_OBJECT("key", "null_value", "value", NULL)                    -- value7 (VARIANT, JSON object)
);

Anything Else?

No response

Are you willing to submit PR?

Code of Conduct

datenhahn commented 4 months ago

Compactions/WAL seem to replay the crash insert and we can't get our backends up and running anymore.

*** Aborted at 1719323232 (unix time) try "date -d @1719323232" if you are using GNU date ***
*** Current BE git commitID: 2dc65ce356 ***
*** SIGABRT unknown detail explain (@0x2d4) received by PID 724 (TID 1850 OR 0x7f29446be640) from PID 724; stack trace: ***
 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/zcp/repo_center/doris_release/doris/be/src/common/signal_handler.h:421
 1# 0x00007F2C3AD48520 in /lib/x86_64-linux-gnu/libc.so.6
 2# pthread_kill in /lib/x86_64-linux-gnu/libc.so.6
 3# raise in /lib/x86_64-linux-gnu/libc.so.6
 4# abort in /lib/x86_64-linux-gnu/libc.so.6
 5# 0x00005592A79A356D in /opt/apache-doris/be/lib/doris_be
 6# 0x00005592A7995C6A in /opt/apache-doris/be/lib/doris_be
 7# google::LogMessage::SendToLog() in /opt/apache-doris/be/lib/doris_be
 8# google::LogMessage::Flush() in /opt/apache-doris/be/lib/doris_be
 9# google::LogMessageFatal::~LogMessageFatal() in /opt/apache-doris/be/lib/doris_be
10# doris::vectorized::ColumnObject::compare_at(unsigned long, unsigned long, doris::vectorized::IColumn const&, int) const in /opt/apache-doris/be/lib/doris_be
11# doris::vectorized::VerticalMergeIteratorContext::compare(doris::vectorized::VerticalMergeIteratorContext const&) const at /home/zcp/repo_center/doris_release/doris/be/src/vec/olap/vertical_merge_iterator.cpp:264
12# std::priority_queue<doris::vectorized::VerticalMergeIteratorContext*, std::vector<doris::vectorized::VerticalMergeIteratorContext*, std::allocator<doris::vectorized::VerticalMergeIteratorContext*> >, doris::vectorized::VerticalHeapMergeIterator::VerticalMergeContextComparator>::push(doris::vectorized::VerticalMergeIteratorContext*&&) at /var/local/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/stl_queue.h:651
13# doris::vectorized::VerticalHeapMergeIterator::init(doris::StorageReadOptions const&) in /opt/apache-doris/be/lib/doris_be
14# doris::vectorized::VerticalBlockReader::_init_collect_iter(doris::TabletReader::ReaderParams const&) at /home/zcp/repo_center/doris_release/doris/be/src/vec/olap/vertical_block_reader.cpp:159
15# doris::vectorized::VerticalBlockReader::init(doris::TabletReader::ReaderParams const&) at /home/zcp/repo_center/doris_release/doris/be/src/vec/olap/vertical_block_reader.cpp:211
16# doris::Merger::vertical_compact_one_group(std::shared_ptr<doris::Tablet>, doris::ReaderType, std::shared_ptr<doris::TabletSchema>, bool, std::vector<unsigned int, std::allocator<unsigned int> > const&, doris::vectorized::RowSourcesBuffer*, std::vector<std::shared_ptr<doris::RowsetReader>, std::allocator<std::shared_ptr<doris::RowsetReader> > > const&, doris::RowsetWriter*, long, doris::Merger::Statistics*, std::vector<unsigned int, std::allocator<unsigned int> >) in /opt/apache-doris/be/lib/doris_be
17# doris::Merger::vertical_merge_rowsets(std::shared_ptr<doris::Tablet>, doris::ReaderType, std::shared_ptr<doris::TabletSchema>, std::vector<std::shared_ptr<doris::RowsetReader>, std::allocator<std::shared_ptr<doris::RowsetReader> > > const&, doris::RowsetWriter*, long, doris::Merger::Statistics*) at /home/zcp/repo_center/doris_release/doris/be/src/olap/merger.cpp:383
18# doris::Compaction::do_compaction_impl(long) at /home/zcp/repo_center/doris_release/doris/be/src/olap/compaction.cpp:371
19# doris::Compaction::do_compaction(long) at /home/zcp/repo_center/doris_release/doris/be/src/olap/compaction.cpp:136
20# doris::CumulativeCompaction::execute_compact_impl() at /home/zcp/repo_center/doris_release/doris/be/src/olap/cumulative_compaction.cpp:79
21# doris::Compaction::execute_compact() at /home/zcp/repo_center/doris_release/doris/be/src/olap/compaction.cpp:118
22# doris::Tablet::execute_compaction(doris::Compaction&) at /home/zcp/repo_center/doris_release/doris/be/src/olap/tablet.cpp:1947
23# std::_Function_handler<void (), doris::StorageEngine::_submit_compaction_task(std::shared_ptr<doris::Tablet>, doris::CompactionType, bool)::$_1>::_M_invoke(std::_Any_data const&) at /var/local/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_function.h:291
24# doris::ThreadPool::dispatch_thread() in /opt/apache-doris/be/lib/doris_be
25# doris::Thread::supervise_thread(void*) at /home/zcp/repo_center/doris_release/doris/be/src/util/thread.cpp:499
26# 0x00007F2C3AD9AAC3 in /lib/x86_64-linux-gnu/libc.so.6
27# 0x00007F2C3AE2C850 in /lib/x86_64-linux-gnu/libc.so.6

/opt/apache-doris/be/bin/start_be.sh: line 367:   724 Aborted                 (core dumped) ${LIMIT:+${LIMIT}} "${DORIS_HOME}/lib/doris_be" "$@" 2>&1 < /dev/null
datenhahn commented 4 months ago

I have more details:

A workaround to access the data is to disable autocompaction (add the property to the be config disable_auto_compaction = true) (this of course is no longterm solution) which prevents any unpredictable read attempts of the poisoned tablet. This allows you to dump your data and recreate the cluster from scratch, or to somehow permanently get rid of the poisoned table and tablet (we did not figure out yet how to get rid of the trash, so the tablet really is gone).

eldenmoon commented 4 months ago

variant doest not support key column, so we forbit it in PR https://github.com/apache/doris/pull/36555

eldenmoon commented 4 months ago

the order of variant is undefined, so we forbid it temporarily

datenhahn commented 4 months ago

Ok thanks :) , just a word of warning: I think it was already forbidden currently, but I think the misfortunate combination of the table schema having a boolean as first value and no key given at all (which then will determine a duplicate key automatically) enable this schema to sneak through the validation.

Please make sure, that indeed this special case (No key given, and first field not usable as key) also is covered by the check (I didn't work myself into the codebase enough to determine if your fix will cover it or not).

Once the fix has made into a selectdb release we will retest it anyway and I will keep you posted, but maybe you can recheck your fix if it covers the special case as well.

CREATE TABLE reproduce_bug
(
    some_value BOOLEAN,
    value1 VARIANT,
    value2 VARIANT,
    value3 VARIANT,
    value4 VARIANT,
    value5 VARIANT,
    value6 VARIANT,
    value7 VARIANT
)
DISTRIBUTED BY HASH(some_value) BUCKETS AUTO;