facebookincubator / velox

A C++ vectorized database acceleration library aimed to optimizing query engines and data processing systems.
https://velox-lib.io/
Apache License 2.0
3.47k stars 1.14k forks source link

Window Variance Fuzzer Failure #4350

Open Yuhta opened 1 year ago

Yuhta commented 1 year ago

Description

https://app.circleci.com/pipelines/github/facebookincubator/velox/21749/workflows/8a11cc90-50b4-4fcc-83be-41b3d6b41824/jobs/137038/steps

Error Reproduction

No response

Relevant logs

I0319 04:05:44.611481  6197 AggregationFuzzer.cpp:532] ==============================> Started iteration 6650 (seed: 3070399165)
I0319 04:05:44.618933  6197 AggregationFuzzer.cpp:624] Executing query plan:
-- Window[partition by [p0] order by [s0 ASC NULLS LAST, s1 ASC NULLS LAST, s2 ASC NULLS LAST, s3 ASC NULLS LAST, s4 ASC NULLS LAST] w0 := variance(ROW["c0"]) RANGE between UNBOUNDED PRECEDING and CURRENT ROW] -> c0:INTEGER, p0:BOOLEAN, s0:VARCHAR, s1:SMALLINT, s2:BIGINT, s3:BOOLEAN, s4:BOOLEAN, row_number:BIGINT, w0:DOUBLE
  -- Values[1000 rows in 10 vectors] -> c0:INTEGER, p0:BOOLEAN, s0:VARCHAR, s1:SMALLINT, s2:BIGINT, s3:BOOLEAN, s4:BOOLEAN, row_number:BIGINT
I0319 04:05:44.632791 408456 Task.cpp:721] All drivers (1) finished for task test_cursor 164443 after running for 13 ms.
I0319 04:05:44.632814 408456 Task.cpp:1373] Terminating task test_cursor 164443 with state Finished after running for 13 ms.
I0319 04:05:44.634290  6197 AggregationFuzzer.cpp:639] [ROW ROW<c0:INTEGER,p0:BOOLEAN,s0:VARCHAR,s1:SMALLINT,s2:BIGINT,s3:BOOLEAN,s4:BOOLEAN,row_number:BIGINT,w0:DOUBLE>: 1000 elements, no nulls]
../../velox/exec/tests/utils/QueryAssertions.cpp:1046: Failure
Failed
Expected 1000, got 1000
1 extra rows, 1 missing rows
1 of extra rows:
        1125385961 | true | "s^B#PDzM(c/npuk,4V<^0>3~`,<@B4HUk}<uQ}.H4Mzc+|k}s9gVDI%~PQ]D[D" | 7829 | 3781926527568724721 | true | true | 528 | 374458249129469700

1 of missing rows:
        1125385961 | true | "s^B#PDzM(c/npuk,4V<^0>3~`,<@B4HUk}<uQ}.H4Mzc+|k}s9gVDI%~PQ]D[D" | 7829 | 3781926527568724721 | true | true | 528 | 377387356836932160

Unexpected results
E0319 04:05:44.938828  6197 Exceptions.h:68] Line: ../../velox/exec/tests/AggregationFuzzer.cpp:960, Function:verifyWindow, Expression: assertEqualResults(expectedResult.value(), {resultOrError.result}) Velox and DuckDB results don't match, Source: RUNTIME, ErrorCode: INVALID_STATE
I0319 04:05:44.940961  6197 AggregationFuzzer.cpp:452] Persisted input: /tmp/aggregate_fuzzer_repro/velox_vector_kDCdXD and plan: /tmp/aggregate_fuzzer_repro/velox_plan_ZJJhQb
terminate called after throwing an instance of 'facebook::velox::VeloxRuntimeError'
  what():  Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: Velox and DuckDB results don't match
Retriable: False
Expression: assertEqualResults(expectedResult.value(), {resultOrError.result})
Function: verifyWindow
File: ../../velox/exec/tests/AggregationFuzzer.cpp
Line: 960
Stack trace:
# 0  _ZN8facebook5velox7process10StackTraceC1Ei
# 1  _ZN8facebook5velox14VeloxException5State4makeIZNS1_C4EPKcmS5_St17basic_string_viewIcSt11char_traitsIcEES9_S9_S9_bNS1_4TypeES9_EUlRT_E_EESt10shared_ptrIKS2_ESA_SB_
# 2  _ZN8facebook5velox14VeloxExceptionC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bNS1_4TypeES7_
# 3  _ZN8facebook5velox17VeloxRuntimeErrorC2EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bS7_
# 4  _ZN8facebook5velox6detail14veloxCheckFailINS0_17VeloxRuntimeErrorEPKcEEvRKNS1_18VeloxCheckFailArgsET0_
# 5  _ZN8facebook5velox4exec4test12_GLOBAL__N_117AggregationFuzzer12verifyWindowERKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaISB_EESF_SF_RKS5_ISt10shared_ptrINS0_9RowVectorEESaISI_EEb
# 6  _ZN8facebook5velox4exec4test12_GLOBAL__N_117AggregationFuzzer2goEv
# 7  _ZN8facebook5velox4exec4test15aggregateFuzzerESt13unordered_mapINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt6vectorISt10shared_ptrINS1_26AggregateFunctionSignatureEESaISD_EESt4hashIS9_ESt8equal_toIS9_ESaISt4pairIKS9_SF_EEEmRKS3_IS9_S9_SH_SJ_SaISK_ISL_S9_EEE
# 8  _ZN23AggregationFuzzerRunner3runERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEmRKSt13unordered_setIS5_St4hashIS5_ESt8equal_toIS5_ESaIS5_EERKSt13unordered_mapIS5_S5_SA_SC_SaISt4pairIS6_S5_EEE
# 9  main
# 10 __libc_start_main
# 11 _start
kagamiori commented 1 year ago

Here is another instance of var_pop window function: https://app.circleci.com/pipelines/github/facebookincubator/velox/22264/workflows/f09ec64c-27f9-4e2a-b0cb-6467fb44c5ca/jobs/140580.

mbasmanova commented 1 year ago

CC: @isadikov

isadikov commented 1 year ago

Could you share how I can reproduce this fuzzer error locally?

kagamiori commented 1 year ago

Could you share how I can reproduce this fuzzer error locally?

Hi @isadikov, Thank you for helping! This error can be reproduced via https://github.com/facebookincubator/velox/pull/4514. Alternatively, you can also try velox/exec/tests/velox_aggregation_fuzzer_test --seed 1515939423 on e3bd0d0. Please let me know if you come across any issue with the reproduction.

isadikov commented 1 year ago

I can consistently reproduce it locally. I will debug.

isadikov commented 1 year ago

I have a dataset for this to repro. This is related to https://github.com/facebookincubator/velox/issues/4532 and is being addressed by https://github.com/facebookincubator/velox/pull/5082.

laithsakka commented 1 year ago

another one https://app.circleci.com/pipelines/github/facebookincubator/velox/28625/workflows/7a5b85d1-ecf6-4901-95a4-0d7aecd2eedc/jobs/180379

laithsakka commented 1 year ago

I see that fixing this is dependent on upgrading duck db, I have a couple of questions.

1) is the duckDB issue only in window aggregation or aggregations in general? 2) is it related to specific set of functions that we can skip for now ? 3) would it make sense to split aggregation fuzzer from window fuzzer tests. and only focus internally on aggregation as hi-pri signals that should be green .

laithsakka commented 1 year ago

another one https://github.com/facebookincubator/velox/issues/5724 not sure if its related i would like to understand the duckdb bug

mbasmanova commented 1 year ago

@majetideepak Deepak has a good plan on upgrading DuckDB.

laithsakka commented 1 year ago

@majetideepak @aditi-pandit do you know if the bug in duckdb effect all window aggregations or just specific functions. I recently have two window fuzzer issues in bit_xor and stddev and wonder if those are also potentially fall under the same issue

aditi-pandit commented 1 year ago

I see that fixing this is dependent on upgrading duck db, I have a couple of questions.

  1. is the duckDB issue only in window aggregation or aggregations in general?
  2. is it related to specific set of functions that we can skip for now ?
  3. would it make sense to split aggregation fuzzer from window fuzzer tests. and only focus internally on aggregation as hi-pri signals that should be green .

@laithsakka : For 1), every aggregate function can be used as a window. In Velox, the window function for invoking aggregates is very generic and relies on the singleGroup aggregate APIs https://github.com/facebookincubator/velox/blob/main/velox/exec/AggregateWindow.cpp#L267

On the DuckDB side, the code-paths between regular and window aggregation are quite different as their implementation uses more advanced structures like segment tree etc.

Are the Velox singleGroup code-paths not tested with the other aggregation code-paths in Fuzzer ? If they are and we have full coverage, then maybe we could disable the window aggregation side.

2) The previous issue was in avg function so I didn't want to skip it.

3) About splitting tests, I think it might depend on how you look at 1)

What do you think ?

laithsakka commented 1 year ago

@aditi-pandit I see your point, now that I understand how they are related i do not have strong opinion. splitting will make sure we run window longer though also.

If the duckdb bug applies to all window functions in general, then i think we should disable it until its solved. and assert single group aggregation is tested. do you have context on the duckdb bug?

laithsakka commented 1 year ago

@aditi-pandit answering your question Are the Velox singleGroup code-paths not tested with the other aggregation code-paths in Fuzzer ?

I see that 10% of the time we do that right now // 10% of times use global aggregation (no grouping keys). std::vector groupingKeys; if (vectorFuzzer.coinToss(0.1)) { ++stats.numGlobal; } else { ++stats_.numGroupBy; groupingKeys = generateKeys("g", argNames, argTypes); } ...... verifyAggregation( groupingKeys, {call}, masks, input, customVerification, projections);

laithsakka commented 1 year ago

@aditi-pandit do you mean global aggregation by singleGroup or actually single group?

aditi-pandit commented 1 year ago

@aditi-pandit do you mean global aggregation by singleGroup or actually single group?

@laithsakka : Global aggregation by singleGroup.

aditi-pandit commented 1 year ago

@aditi-pandit I see your point, now that I understand how they are related i do not have strong opinion. splitting will make sure we run window longer though also.

If the duckdb bug applies to all window functions in general, then i think we should disable it until its solved. and assert single group aggregation is tested. do you have context on the duckdb bug?

We had filed this issue with DuckDB https://github.com/duckdb/duckdb/issues/6829