facebookincubator / velox

A composable and fully extensible C++ execution engine library for data management systems.
https://velox-lib.io/
Apache License 2.0
3.53k stars 1.16k forks source link

Spark fuzzer failure: Expression: Timepoint is outside of supported year range: [-32767, 32767], got 4325638 #11462

Open czentgr opened 2 weeks ago

czentgr commented 2 weeks ago

Description

A similar issue was closed a while back: https://github.com/facebookincubator/velox/issues/10989 The issue seems to have started reoccurring.

Error Reproduction

mkdir -p /tmp/spark_fuzzer_repro/logs/ chmod -R 777 /tmp/spark_fuzzer_repro chmod +x spark_expression_fuzzer_test ./spark_expression_fuzzer_test \ --seed ${RANDOM} \ --enable_variadic_signatures \ --lazy_vector_generation_ratio 0.2 \ --velox_fuzzer_enable_column_reuse \ --velox_fuzzer_enable_expression_reuse \ --max_expression_trees_per_step 2 \ --retry_with_try \ --enable_dereference \ --duration_sec $DURATION \ --minloglevel=0 \ --stderrthreshold=2 \ --log_dir=/tmp/spark_fuzzer_repro/logs \ --repro_persist_path=/tmp/spark_fuzzer_repro \ && echo -e "\n\nSpark Fuzzer run finished successfully." shell: bash --noprofile --norc -e -o pipefail {0} env: DURATION: 900 RETENTION: 1 E20241106 17:15:41.925009 40 Exceptions.h:66] Line: /w/velox/velox/velox/velox/type/tz/TimeZoneMap.cpp:259, Function:validateRangeImpl, Expression: Timepoint is outside of supported year range: [-32767, 32767], got 4325638, Source: RUNTIME, ErrorCode: UNSUPPORTED_INPUT_UNCATCHABLE E20241106 17:15:41.933817 40 Exceptions.h:66] Line: /w/velox/velox/velox/velox/type/tz/TimeZoneMap.cpp:259, Function:validateRangeImpl, Expression: Timepoint is outside of supported year range: [-32767, 32767], got 4325638, Source: RUNTIME, ErrorCode: UNSUPPORTED_INPUT_UNCATCHABLE Error: The operation was canceled.

Relevant logs

https://github.com/facebookincubator/velox/actions/runs/11707749878/job/32610598033?pr=10767
mbasmanova commented 2 weeks ago

CC: @pedroerp @kagamiori @kgpai

mbasmanova commented 2 weeks ago

CC: @rui-mo

pedroerp commented 2 weeks ago

This one got merged today and seems suspicious since it touches the validateRange() method:

https://github.com/facebookincubator/velox/pull/11447

@kevinwilfong could you check if this is related?

rui-mo commented 2 weeks ago

I tried the fuzzer test locally and found the exception message was printed out but did not lead to a failure. Thanks.

E1107 18:21:48.668174 3559404 Exceptions.h:66] Line: /home/sparkuser/velox/velox/type/tz/TimeZoneMap.cpp:251, Function:validateRangeImpl, Expression:  Timepoint is outside of supported year range: [-32767, 32767], got 3271991, Source: RUNTIME, ErrorCode: UNSUPPORTED_INPUT_UNCATCHABLE
E1107 18:21:48.808140 3559404 ExpressionFuzzerVerifier.cpp:408] Total iterations: 604141
E1107 18:21:48.808153 3559404 ExpressionFuzzerVerifier.cpp:409] Total failed: 103706
[==========] Running 0 tests from 0 test suites.
[==========] 0 tests from 0 test suites ran. (0 ms total)
[  PASSED  ] 0 tests.
kevinwilfong commented 2 weeks ago

This one got merged today and seems suspicious since it touches the validateRange() method:

11447

@kevinwilfong could you check if this is related?

I don't think this is related, it didn't touch the validateRange() method, it just moved the calls to validateRange() from getShortName and getLongName into a common getName function.

pedroerp commented 2 weeks ago

I tried the fuzzer test locally and found the exception message was printed out but did not lead to a failure.

So it looks like we used to catch this exception and now somehow it is bubbling up?

rui-mo commented 2 weeks ago

@pedroerp The VELOX_FAIL_UNSUPPORTED_INPUT_UNCATCHABLE is designed to be allowed in expression fuzzer test, and should not cause a fuzzer failure (link: 8a6ab15). The error messages are printed out since they are VeloxRuntimeError. It should be fine if we are receiving the error messages and no fuzzer failure.

https://github.com/facebookincubator/velox/blob/c5232cd3174998a7834551783ae95776949c9da8/velox/common/base/Exceptions.h#L65-L66

I notice the relevant log https://github.com/facebookincubator/velox/actions/runs/11707749878/job/32610598033?pr=10767 shows the spark fuzzer test is timeout for some reason, so I wonder if the relevant failure is caused by some timeout issue rather than the function itself. Thanks!