facebookincubator / velox

A composable and fully extensible C++ execution engine library for data management systems.
https://velox-lib.io/
Apache License 2.0
3.54k stars 1.17k forks source link

TPCDS SF10K Query 6 failed #11632

Open minhancao opened 5 days ago

minhancao commented 5 days ago

Bug description

Ran TPCDS SF10k with 8 Velox workers 128 gb memory each with AsyncDataCache enabled and disabled CoW, failed on query 6 due to https://github.com/facebookincubator/velox/blob/059337fca8170c2b361ea9d89d6c2cdd9e157c4a/velox/exec/HashProbe.cpp#L1142

Worker 3:

E20241122 21:26:29.133782   790 Exceptions.h:66] Line: /prestissimo/velox/velox/exec/HashProbe.cpp:1085, Function:getOutputInternal, Expression: numOut <= outputBatchSize (52 vs. 3), Source: RUNTIME, ErrorCode: INVALID_STATE

Worker 6:

E20241122 21:26:29.129307   195 Exceptions.h:66] Line: /prestissimo/velox/velox/exec/HashProbe.cpp:1085, Function:getOutputInternal, Expression: numOut <= outputBatchSize (69 vs. 3), Source: RUNTIME, ErrorCode: INVALID_STATE
E20241122 21:26:29.129565   782 Exceptions.h:66] Line: /prestissimo/velox/velox/exec/HashProbe.cpp:1085, Function:getOutputInternal, Expression: numOut <= outputBatchSize (13 vs. 3), Source: RUNTIME, ErrorCode: INVALID_STATE

Worker 7:

E20241122 21:26:29.128383   769 Exceptions.h:66] Line: /prestissimo/velox/velox/exec/HashProbe.cpp:1085, Function:getOutputInternal, Expression: numOut <= outputBatchSize (117 vs. 3), Source: RUNTIME, ErrorCode: INVALID_STATE

Worker 8:

E20241122 21:26:29.135933   772 Exceptions.h:66] Line: /prestissimo/velox/velox/exec/HashProbe.cpp:1085, Function:getOutputInternal, Expression: numOut <= outputBatchSize (64 vs. 3), Source: RUNTIME, ErrorCode: INVALID_STATE

Worker 3's stack dump trace:

VeloxRuntimeError: numOut <= outputBatchSize (52 vs. 3) Operator: HashProbe[1922] 1
    at Unknown.# 0  _ZN8facebook5velox7process10StackTraceC1Ei(Unknown Source)
    at Unknown.# 1  _ZN8facebook5velox14VeloxExceptionC2EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bNS1_4TypeES7_(Unknown Source)
    at Unknown.# 2  _ZN8facebook5velox6detail14veloxCheckFailINS0_17VeloxRuntimeErrorERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEEvRKNS1_18VeloxCheckFailArgsET0_(Unknown Source)
    at Unknown.# 3  _ZN8facebook5velox4exec9HashProbe17getOutputInternalEb(Unknown Source)
    at Unknown.# 4  _ZN8facebook5velox4exec9HashProbe9getOutputEv(Unknown Source)
    at Unknown.# 5  _ZZN8facebook5velox4exec6Driver11runInternalERSt10shared_ptrIS2_ERS3_INS1_13BlockingStateEERS3_INS0_9RowVectorEEENKUlvE3_clEv(Unknown Source)
    at Unknown.# 6  _ZN8facebook5velox4exec6Driver11runInternalERSt10shared_ptrIS2_ERS3_INS1_13BlockingStateEERS3_INS0_9RowVectorEE(Unknown Source)
    at Unknown.# 7  _ZN8facebook5velox4exec6Driver3runESt10shared_ptrIS2_E(Unknown Source)
    at Unknown.# 8  _ZN5folly6detail8function5call_IZN8facebook5velox4exec6Driver7enqueueESt10shared_ptrIS6_EEUlvE_Lb1ELb0EvJEEET2_DpT3_RNS1_4DataE(Unknown Source)
    at Unknown.# 9  _ZN5folly6detail8function14FunctionTraitsIFvvEEclEv(Unknown Source)
    at Unknown.# 10 _ZN5folly18ThreadPoolExecutor7runTaskERKSt10shared_ptrINS0_6ThreadEEONS0_4TaskE(Unknown Source)
    at Unknown.# 11 _ZN5folly21CPUThreadPoolExecutor9threadRunESt10shared_ptrINS_18ThreadPoolExecutor6ThreadEE(Unknown Source)
    at Unknown.# 12 _ZSt13__invoke_implIvRMN5folly18ThreadPoolExecutorEFvSt10shared_ptrINS1_6ThreadEEERPS1_JRS4_EET_St21__invoke_memfun_derefOT0_OT1_DpOT2_(Unknown Source)
    at Unknown.# 13 _ZSt8__invokeIRMN5folly18ThreadPoolExecutorEFvSt10shared_ptrINS1_6ThreadEEEJRPS1_RS4_EENSt15__invoke_resultIT_JDpT0_EE4typeEOSC_DpOSD_(Unknown Source)
    at Unknown.# 14 _ZNSt5_BindIFMN5folly18ThreadPoolExecutorEFvSt10shared_ptrINS1_6ThreadEEEPS1_S4_EE6__callIvJEJLm0ELm1EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE(Unknown Source)
    at Unknown.# 15 _ZNSt5_BindIFMN5folly18ThreadPoolExecutorEFvSt10shared_ptrINS1_6ThreadEEEPS1_S4_EEclIJEvEET0_DpOT_(Unknown Source)
    at Unknown.# 16 _ZN5folly6detail8function5call_ISt5_BindIFMNS_18ThreadPoolExecutorEFvSt10shared_ptrINS4_6ThreadEEEPS4_S7_EELb1ELb0EvJEEET2_DpT3_RNS1_4DataE(Unknown Source)
    at Unknown.# 17 0x00000000000dbad4(Unknown Source)
    at Unknown.# 18 start_thread(Unknown Source)
    at Unknown.# 19 clone(Unknown Source)

System information

N/A

Relevant logs

No response

Yuhta commented 1 day ago

Can you check if the issue is still there after #11659? CC: @zhli1142015