Open Light-City opened 1 year ago
This is fundamentally caused by acero strictly evaluating argument expressions before calling a function on those arguments. Refactoring to support more intrusive/lazy evaluation semantics would be a significant change; certainly not one which should be handled as a special case in ExecuteScalarExpression
.
I'd recommend looking at the expression simplification passes (SimplifyWithGuarantee). There's machinery there to pattern match and modify expressions. Currently it is only used to produce more efficient expressions using partition information and other guarantees, but it could also be used to rewrite expressions for safe evaluation:
Expression unsafe = case_when({greater_than(field_ref("j"), literal(0))}, {
call("divide", {field_ref("i"), field_ref("j")}),
field_ref("i"),
});
//...
ARROW_ASSIGN_OR_RAISE(Expression safe, MakeSafe(unsafe));
assert(safe == call("divide", {
field_ref("i"),
call("max", {field_ref("j"), literal(1)}),
}))
Another way (much less intensive) to approach this problem would be writing a new option for the divide compute functions which produces null or zero when dividing by zero instead of raising an error. This could then be used explicitly in situations where division by zero is otherwise inevitable.
This is fundamentally caused by acero strictly evaluating argument expressions before calling a function on those arguments. Refactoring to support more intrusive/lazy evaluation semantics would be a significant change; certainly not one which should be handled as a special case in
ExecuteScalarExpression
.I'd recommend looking at the expression simplification passes (SimplifyWithGuarantee). There's machinery there to pattern match and modify expressions. Currently it is only used to produce more efficient expressions using partition information and other guarantees, but it could also be used to rewrite expressions for safe evaluation:
Expression unsafe = case_when({greater_than(field_ref("j"), literal(0))}, { call("divide", {field_ref("i"), field_ref("j")}), field_ref("i"), }); //... ARROW_ASSIGN_OR_RAISE(Expression safe, MakeSafe(unsafe)); assert(safe == call("divide", { field_ref("i"), call("max", {field_ref("j"), literal(1)}), }))
Another way (much less intensive) to approach this problem would be writing a new option for the divide compute functions which produces null or zero when dividing by zero instead of raising an error. This could then be used explicitly in situations where division by zero is otherwise inevitable.
Yes, the underlying operation needs to be changed. But adding options is a relatively big change.
Describe the bug, including details regarding any error messages, version, and platform.
when we execute casewhen expression, such as:
CASE WHEN i > 0 THEN j / i ELSE j END
If column i has values( 0 ), we will get division overflow problem, j / 0 -> overflow。
Why does this problem occur?
I read the source code and found that the masks for these conditions are calculated separately, and the previous mask is not used to participate in subsequent operations. This will cause the subsequent division to overflow.
Currently, we have two solutions.
Using the previous mask to participate in subsequent operations will cause performance degradation in all cases.
We currently apply this method to the application layer, which constructs a complex expression to avoid division overflow problems.
case when j > 0 then i / j else i end -> case when j > 0 then i / ((case when j > 0 then j else null end)) else i end
Based on the above background, does the community have any good ways to fix this problem?
Component(s)
C++