facebookincubator / velox

A composable and fully extensible C++ execution engine library for data management systems.
https://velox-lib.io/
Apache License 2.0
3.5k stars 1.15k forks source link

array_intersect: different element order in results of common vs. simplified expression evaluator #10740

Open kagamiori opened 2 months ago

kagamiori commented 2 months ago

Description

It seems that after the fix https://github.com/facebookincubator/velox/pull/10624 (context in https://github.com/facebookincubator/velox/issues/10561), array_intersect now produces result arrays with different element orders depending on the encoding of the input vectors. Since the simplified evaluator always flatten the input vectors first, the simplified evaluation result can have different element order from the common evaluation result when the first argument is a constant literal.

We would need to add the support for custom result verifier in expression fuzzer that sort the elements before comparing the result arrays from the common evaluator and simplified evaluator.

Error Reproduction

A fuzzer failure can be reproduced via presto-fuzzer-failure-artifacts.

Relevant logs

E20240813 19:27:57.539590    72 Exceptions.h:67] Line: /__w/velox/velox/velox/velox/expression/fuzzer/FuzzerToolkit.cpp:129, Function:operator(), Expression: left->equalValueAt(right.get(), row, row) Different values at idx '17': '[17->37] 2 elements starting at 29 {null, 30}' vs. '2 elements starting at 5 {30, null}', Source: RUNTIME, ErrorCode: INVALID_STATE
kagamiori commented 2 months ago

cc @kevinwilfong @DanielHunte

kevinwilfong commented 2 months ago

I think Daniel is planning to revert the change, it generally results in inconsistent ordering of the result arrays.

kagamiori commented 2 months ago

I think Daniel is planning to revert the change, it generally results in inconsistent ordering of the result arrays.

@kevinwilfong Got it. Curious what other problem does this inconsistent ordering cause (besides fuzzer)?

kevinwilfong commented 2 months ago

While I was going through another part of the code for array_intersect I noticed a comment that mentioned we only optimize for ConstantVectors if they're in the right hand side argument.

I can see how it might surprise users, if they call array_intersect and the order of the result arrays depends on the encoding of the argument Vectors or the size of other arrays in the same batch, it would seem pretty random to them, and given that someone took the time to leave that comment, it seems someone else felt that way and decided it was more important than performance.