facebookincubator / velox

A composable and fully extensible C++ execution engine library for data management systems.
https://velox-lib.io/
Apache License 2.0
3.53k stars 1.16k forks source link

Join Fuzzer - left join returns incorrect data with spilling #9376

Open czentgr opened 7 months ago

czentgr commented 7 months ago

Description

The SEMI LEFT JOIN result for the row type column returns NULL value when run with spilling.

The job result: https://github.com/facebookincubator/velox/actions/runs/8562532207/job/23466827795?pr=9372

Excerpt:

Expected 285, got 285
3 extra rows, 3 missing rows
3 of extra rows:
    [null,null,null,null,null,null,null] | 844798144934762641 | 5400531624461704367
    [null,null,null,null,null,null,null] | 844798144934762641 | 7363836969548706077
    [null,null,null,null,null,null,null] | 6842666543040666692 | 8416365981880162796

3 of missing rows:
    ["1997-04-10T15:21:04.754893720",null,1511352858,1419745511,null,69,null] | 844798144934762641 | 5400531624461704367
    ["1997-04-10T15:21:04.754893720",null,1511352858,1419745511,null,69,null] | 844798144934762641 | 7363836969548706077
    ["1997-04-10T15:21:04.754893720",null,1511352858,1419745511,null,69,null] | 6842666543040666692 | 8416365981880162796

The iteration seed is not in the log (logtostderr is turned on, presumably LOG(INFO) messages should show up there?) Due to the missing seed it is not clear how to repro the issue.

Error Reproduction

./velox_join_fuzzer_test \ --seed ${RANDOM} \ --duration_sec $DURATION \ --logtostderr=1 \ --minloglevel=1 \

The problem is that the iteration seed was not logged and $RANDOM is not logged either.

Relevant logs

E20240405 00:08:00.807075    61 JoinFuzzer.cpp:408] Executing query plan with GROUPED strategy[5 groups]: 
-- HashJoin[LEFT SEMI (FILTER) t0=u0] -> tp1:ROW<f0:TIMESTAMP,f1:TIMESTAMP,f2:INTEGER,f3:INTEGER,f4:TIMESTAMP,f5:TINYINT,f6:DOUBLE>, t0:INTERVAL DAY TO SECOND, tp2:BIGINT
  -- TableScan[table: hive_table] -> t0:INTERVAL DAY TO SECOND, tp1:ROW<f0:TIMESTAMP,f1:TIMESTAMP,f2:INTEGER,f3:INTEGER,f4:TIMESTAMP,f5:TINYINT,f6:DOUBLE>, tp2:BIGINT, tp3:DATE
  -- TableScan[table: hive_table] -> u0:INTERVAL DAY TO SECOND
E20240405 00:08:00.856297    61 JoinFuzzer.cpp:408] Executing query plan with GROUPED strategy[5 groups] and spilling injection: 
-- HashJoin[LEFT SEMI (FILTER) t0=u0] -> tp1:ROW<f0:TIMESTAMP,f1:TIMESTAMP,f2:INTEGER,f3:INTEGER,f4:TIMESTAMP,f5:TINYINT,f6:DOUBLE>, t0:INTERVAL DAY TO SECOND, tp2:BIGINT
  -- TableScan[table: hive_table] -> t0:INTERVAL DAY TO SECOND, tp1:ROW<f0:TIMESTAMP,f1:TIMESTAMP,f2:INTEGER,f3:INTEGER,f4:TIMESTAMP,f5:TINYINT,f6:DOUBLE>, tp2:BIGINT, tp3:DATE
  -- TableScan[table: hive_table] -> u0:INTERVAL DAY TO SECOND
W20240405 00:08:00.907965 13564 Operator.cpp:645] Can't reclaim from memory pool op.2.1.7.HashBuild which is under non-reclaimable section, memory usage: 0B, reservation: 0B
W20240405 00:08:00.915889 13564 HashBuild.cpp:223] Exceeded spill level limit: 4, and disable spilling for memory pool: op.2.1.4.HashBuild
W20240405 00:08:00.917119 13565 HashProbe.cpp:1807] Exceeded spill level limit: 4, and disable spilling for memory pool: op.2.0.5.HashProbe
/__w/velox/velox/velox/velox/exec/tests/utils/QueryAssertions.cpp:1130: Failure
Failed
Expected 285, got 285
3 extra rows, 3 missing rows
3 of extra rows:
    [null,null,null,null,null,null,null] | 844798144934762641 | 5400531624461704367
    [null,null,null,null,null,null,null] | 844798144934762641 | 7363836969548706077
    [null,null,null,null,null,null,null] | 6842666543040666692 | 8416365981880162796

3 of missing rows:
    ["1997-04-10T15:21:04.754893720",null,1511352858,1419745511,null,69,null] | 844798144934762641 | 5400531624461704367
    ["1997-04-10T15:21:04.754893720",null,1511352858,1419745511,null,69,null] | 844798144934762641 | 7363836969548706077
    ["1997-04-10T15:21:04.754893720",null,1511352858,1419745511,null,69,null] | 6842666543040666692 | 8416365981880162796

Unexpected results
E20240405 00:08:00.962553    61 Exceptions.h:69] Line: /__w/velox/velox/velox/velox/exec/tests/JoinFuzzer.cpp:1020, Function:verify, Expression: assertEqualResults({expected}, {actual}) Logically equivalent plans produced different results, Source: RUNTIME, ErrorCode: INVALID_STATE
terminate called after throwing an instance of 'facebook::velox::VeloxRuntimeError'
  what():  Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: Logically equivalent plans produced different results
Retriable: False
Expression: assertEqualResults({expected}, {actual})
Function: verify
File: /__w/velox/velox/velox/velox/exec/tests/JoinFuzzer.cpp
Line: 1020
Stack trace:
# 0  _ZN8facebook5velox7process10StackTraceC1Ei
# 1  _ZN8facebook5velox14VeloxException5State4makeIZNS1_C4EPKcmS5_St17basic_string_viewIcSt11char_traitsIcEES9_S9_S9_bNS1_4TypeES9_EUlRT_E_EESt10shared_ptrIKS2_ESA_SB_
# 2  _ZN8facebook5velox14VeloxExceptionC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bNS1_4TypeES7_
# 3  _ZN8facebook5velox17VeloxRuntimeErrorC2EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bS7_
# 4  _ZN8facebook5velox6detail14veloxCheckFailINS0_17VeloxRuntimeErrorEPKcEEvRKNS1_18VeloxCheckFailArgsET0_
# 5  _ZN8facebook5velox4exec4test12_GLOBAL__N_110JoinFuzzer6verifyENS0_4core8JoinTypeE
# 6  _ZN8facebook5velox4exec4test12_GLOBAL__N_110JoinFuzzer2goEv
# 7  _ZN8facebook5velox4exec4test10joinFuzzerEm
# 8  _ZN16JoinFuzzerRunner3runEm
# 9  main
# 10 __libc_start_main
# 11 _start

*** Aborted at 1712275681 (Unix time, try 'date -d @1712275681') ***
*** Signal 6 (SIGABRT) (0x3d) received by PID 61 (pthread TID 0x7fdb6f61b540) (linux TID 61) (maybe from PID 61, UID 0) (code: -6), stack trace: ***
(error retrieving stack trace)
/__w/_temp/6ab225fb-5dab-44cd-b018-b7cdd56158af.sh: line 10:    61 Aborted                 (core dumped) ./velox_join_fuzzer_test --seed ${RANDOM} --duration_sec $DURATION --logtostderr=1 --minloglevel=1
mbasmanova commented 7 months ago

CC: @kgpai @kagamiori @xiaoxmeng

kgpai commented 7 months ago

@czentgr I can look into this if you havent started already.

czentgr commented 7 months ago

@kgpai Yes, I can take a look. Trying to make sense of things. The fact that --minloglevel=1 would exclude INFO messages. In the current code JoinFuzzer.cpp:408 is a LOG(INFO) message which appears in the above snipped but is tagged as an error message. For example, the log contains

E20240405 00:08:00.807075    61 JoinFuzzer.cpp:408] Executing query plan with GROUPED strategy[5 groups]:

while a local run with --minloglevel=1 does not contain the message and with --minloglevel=0 it does

I20240411 15:20:39.405054 8762369 JoinFuzzer.cpp:408] Executing query plan with UNGROUPED strategy[0 groups]:

Note the prefix E or I. For the same reason the iteration and seed messages are INFO messages and don't appear in the log output - which is sort of expected because --minloglevel=1 which excludes INFO messages.

I tried on Mac and Linux. Same output for the latest code. --minloglevel=0 would generated much more log but the important logs for repro-ing issues are missing. Am I overlooking something?

Anyway, I will try to repro it by running for a while and hopefully hitting an iteration that is causing the issue.

czentgr commented 7 months ago

No repro of the issue so far. I let this run on a Linux VM for the weekend without repro. Each iteration gets a new seed and runs for 100s. Changing it to 900s just like in the test.

kgpai commented 7 months ago

If you havent seen it for a bit, then this could have been fixed (inadvertently or otherwise) by some subsequent PR.

czentgr commented 7 months ago

Right, I've run now 90+ iterations (900s each). Could be a race somewhere that is just not visible on the machine I'm running it on. As for the seed messages that are missing from the container run log - I propose to make a PR to make them warnings (instead of info) in the Join Fuzzer to make sure they appear in the log. Does that sound ok?

kgpai commented 7 months ago

@czentgr That would be great , thank you !

czentgr commented 7 months ago

Created PR https://github.com/facebookincubator/velox/pull/9509

czentgr commented 7 months ago

I'm at 300 iterations. No repro. I suppose we should keep this issue open and with the PR, if it occurs again, we can continue the investigation.