apache / doris

Apache Doris is an easy-to-use, high performance and unified analytics database.
https://doris.apache.org
Apache License 2.0
12.6k stars 3.25k forks source link

BE crashed when run TPC-DS test #3504

Open acelyc111 opened 4 years ago

acelyc111 commented 4 years ago

Describe the bug BE crashed when run TPC-DS test.

To Reproduce Steps to reproduce the behavior:

  1. load TPC-DS data
  2. run SQL SELECT asceding.rnk, i1.i_product_name best_performing, i2.i_product_name worst_performing FROM (SELECT * FROM (SELECT item_sk, rank() OVER ( ORDER BY rank_col ASC) rnk FROM (SELECT ss_item_sk item_sk, avg(ss_net_profit) rank_col FROM store_sales ss1 WHERE ss_store_sk = 4 GROUP BY ss_item_sk HAVING avg(ss_net_profit) > 0.9 * (SELECT avg(ss_net_profit) rank_col FROM store_sales WHERE ss_store_sk = 4 AND ss_addr_sk IS NULL GROUP BY ss_store_sk)) V1) V11 WHERE rnk < 11) asceding, (SELECT * FROM (SELECT item_sk, rank() OVER ( ORDER BY rank_col DESC) rnk FROM (SELECT ss_item_sk item_sk, avg(ss_net_profit) rank_col FROM store_sales ss1 WHERE ss_store_sk = 4 GROUP BY ss_item_sk HAVING avg(ss_net_profit) > 0.9 * (SELECT avg(ss_net_profit) rank_col FROM store_sales WHERE ss_store_sk = 4 AND ss_addr_sk IS NULL GROUP BY ss_store_sk)) V2) V21 WHERE rnk < 11) descending, item i1, item i2 WHERE asceding.rnk = descending.rnk AND i1.i_item_sk = asceding.item_sk AND i2.i_item_sk = descending.item_sk ORDER BY asceding.rnk LIMIT 100;
  3. At least one of BE will crash

Expected behavior SQL finished and return result normally, no BE crash.

Screenshots backtrace:

(gdb) bt
#0  0x00007f8c0b99dc56 in __memcpy_ssse3_back () from /lib64/libc.so.6
#1  0x000000000102055b in doris::Tuple::deep_copy (this=<optimized out>, desc=..., data=data@entry=0x7f8bb6b54280, offset=offset@entry=0x7f8bb6b5427c, convert_ptrs=convert_ptrs@entry=true) at /root/doris/doris-xiaomi-0.12/be/src/runtime/tuple.cpp:135
#2  0x0000000001014365 in doris::RowBatch::serialize (this=this@entry=0xe603f8c0, output_batch=output_batch@entry=0xdd72b578) at /root/doris/doris-xiaomi-0.12/be/src/runtime/row_batch.cpp:402
#3  0x00000000015e3d4b in doris::DataStreamSender::serialize_batch<doris::PRowBatch> (this=this@entry=0xdd72b520, src=src@entry=0xe603f8c0, dest=0xdd72b578, num_receivers=14) at /root/doris/doris-xiaomi-0.12/be/src/runtime/data_stream_sender.cpp:630
#4  0x00000000015e2489 in doris::DataStreamSender::send (this=0xdd72b520, state=0xdeff9800, batch=0xe603f8c0) at /root/doris/doris-xiaomi-0.12/be/src/runtime/data_stream_sender.cpp:450
#5  0x000000000100b944 in doris::PlanFragmentExecutor::open_internal (this=this@entry=0xdfc4eb70) at /root/doris/doris-xiaomi-0.12/be/src/runtime/plan_fragment_executor.cpp:304
#6  0x000000000100bd9c in doris::PlanFragmentExecutor::open (this=this@entry=0xdfc4eb70) at /root/doris/doris-xiaomi-0.12/be/src/runtime/plan_fragment_executor.cpp:254
#7  0x0000000000fa4a97 in doris::FragmentExecState::execute (this=0xdfc4eb00) at /root/doris/doris-xiaomi-0.12/be/src/runtime/fragment_mgr.cpp:211
#8  0x0000000000fa62d6 in doris::FragmentMgr::exec_actual(std::shared_ptr<doris::FragmentExecState>, std::function<void (doris::PlanFragmentExecutor*)>) (this=0xb28f3c00, exec_state=..., cb=...) at /root/doris/doris-xiaomi-0.12/be/src/runtime/fragment_mgr.cpp:390
#9  0x0000000000face14 in operator() (a2=<error reading variable: access outside bounds of object referenced via synthetic pointer>, a1=..., p=<optimized out>, this=<optimized out>) at /var/local/thirdparty/installed/include/boost/bind/mem_fn_template.hpp:280
#10 operator()<boost::_mfi::mf2<void, doris::FragmentMgr, std::shared_ptr<doris::FragmentExecState>, std::function<void(doris::PlanFragmentExecutor*)> >, boost::_bi::list0> (a=<synthetic pointer>, f=..., this=<optimized out>)
    at /var/local/thirdparty/installed/include/boost/bind/bind.hpp:398
#11 operator() (this=<optimized out>) at /var/local/thirdparty/installed/include/boost/bind/bind.hpp:1294
#12 boost::detail::function::void_function_obj_invoker0<boost::_bi::bind_t<void, boost::_mfi::mf2<void, doris::FragmentMgr, std::shared_ptr<doris::FragmentExecState>, std::function<void (doris::PlanFragmentExecutor*)> >, boost::_bi::list3<boost::_bi::value<doris::FragmentMgr*>, boost::_bi::value<std::shared_ptr<doris::FragmentExecState> >, boost::_bi::value<std::function<void (doris::PlanFragmentExecutor*)> > > >, void>::invoke(boost::detail::function::function_buffer&) (function_obj_ptr=...)
    at /var/local/thirdparty/installed/include/boost/function/function_template.hpp:159
#13 0x0000000000f79d45 in operator() (this=0x7f8bb6b547f8) at /var/local/thirdparty/installed/include/boost/function/function_template.hpp:759
#14 doris::PriorityThreadPool::work_thread (this=0xb28f3c80, thread_id=<optimized out>) at /root/doris/doris-xiaomi-0.12/be/src/util/priority_thread_pool.hpp:138
#15 0x0000000001a176cd in thread_proxy ()
#16 0x00007f8c0b63edc5 in start_thread () from /lib64/libpthread.so.0
#17 0x00007f8c0b94a73d in clone () from /lib64/libc.so.6

Desktop (please complete the following information):

acelyc111 commented 4 years ago
(gdb) f 1
#1  0x000000000101f52b in doris::Tuple::deep_copy (this=<optimized out>, desc=..., data=data@entry=0x7efe86d80280, offset=offset@entry=0x7efe86d8027c, convert_ptrs=convert_ptrs@entry=true) at /home/laiyingchun/ap_doris/be/src/runtime/tuple.cpp:135
135        /home/laiyingchun/ap_doris/be/src/runtime/tuple.cpp: No such file or directory.
(gdb) p *string_v
$2 = {
  static MAX_LENGTH = 1073741824,
  ptr = 0xffffff40e12409c4 <Address 0xffffff40e12409c4 out of bounds>,
  len = 18446744073709551615
}
vagetablechicken commented 4 years ago

The crash plan fragment's log

plan_root=
 conjuncts=[] id=4 type=ASSERT_NUM_ROWS_NODE tuple_ids=[5, ]
  ExchangeNode(#senders=4 conjuncts=[] id=28 type=EXCHANGE_NODE tuple_ids=[6, ])

Then I print the log below. https://github.com/apache/incubator-doris/blob/eefad13107ac74a406212e0f0f57181973ac9c1e/be/src/exec/exec_node.cpp#L340

log in creating ASSERT_NUM_ROWS_NODE:

TPlanNode {
  01: node_id (i32) = 4,
  02: node_type (i32) = 23,
  03: num_children (i32) = 1,
  04: limit (i64) = -1,
  05: row_tuples (list) = list<i32>[1] {
    [0] = 5,
  },
  06: nullable_tuples (list) = list<bool>[1] {
    [0] = false,
  },
  08: compact_data (bool) = false,
  32: assert_num_rows_node (struct) = TAssertNumRowsNode {
    01: desired_num_rows (i64) = 1,
    02: subquery_string (string) = "SELECT avg(`ss_net_profit`) AS `rank_col` FROM `default_cluster:tpcds`.`store_sales` WHERE (`ss_store_sk` = 4) AND (`ss_addr_sk` IS NULL) GROUP BY `ss_store_sk`",
  },
}

log in createing the child ExchangeNode:

TPlanNode {
  01: node_id (i32) = 28,
  02: node_type (i32) = 9,
  03: num_children (i32) = 0,
  04: limit (i64) = -1,
  05: row_tuples (list) = list<i32>[1] {
    [0] = 6,
  },
  06: nullable_tuples (list) = list<bool>[1] {
    [0] = false,
  },
  08: compact_data (bool) = false,
  15: exchange_node (struct) = TExchangeNode {
    01: input_row_tuples (list) = list<i32>[1] {
      [0] = 6,
    },
  },
}

The AssertNumRowsNode get batch from its child, but its row_tuples id is 6. Then it has a sink node, the sink node use the same row_tuples id as root plan node(AssertNumRowsNode), so it is 5.

Thus, the sink node use Tuple(id=5 size=24 slots=[Slot(id=12 type=INT col=-1 offset=4 null=(offset=0 mask=80)), Slot(id=13 type=VARCHAR col=-1 offset=8 null=(offset=0 mask=40))] has_varlen_slots=1) to parse data created in Tuple(id=6 size=24 slots=[Slot(id=14 type=INT col=-1 offset=4 null=(offset=0 mask=80)), Slot(id=15 type=DECIMALV2(9, 0) col=-1 offset=8 null=(offset=0 mask=40))] has_varlen_slots=0)