apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.5k stars 3.52k forks source link

[C++] case when expression divide overflow. #37765

Open Light-City opened 1 year ago

Light-City commented 1 year ago

Describe the bug, including details regarding any error messages, version, and platform.

when we execute casewhen expression, such as:

CASE WHEN i > 0 THEN j / i ELSE j END

If column i has values( 0 ), we will get division overflow problem, j / 0 -> overflow。

Why does this problem occur?

I read the source code and found that the masks for these conditions are calculated separately, and the previous mask is not used to participate in subsequent operations. This will cause the subsequent division to overflow.

Currently, we have two solutions.

  1. Regardless of performance

Using the previous mask to participate in subsequent operations will cause performance degradation in all cases.

for (size_t i = 0; i < arguments.size(); ++i) {
     // get SelectorVector from arguments before computed;
     std::shared_ptr<ArrayData> filtered_mask =
         GetExpressionMask(call->function_name, arguments, i);
     if (filtered_mask != nullptr) {
       auto values = input.values;
       // construct filtered batch
       for (auto& value : values) {
         if (value.is_array()) {
           std::vector<Datum> invert_args;
           invert_args.push_back(filtered_mask);
           ARROW_ASSIGN_OR_RAISE(auto null_mask,
                                 CallFunction("invert", invert_args, exec_context));
           ARROW_ASSIGN_OR_RAISE(
               value, ReplaceWithMask(value, null_mask, MakeNullScalar(value.type())));
         }
       }
       ARROW_ASSIGN_OR_RAISE(auto new_batch, ExecBatch::Make(std::move(values)));
       // execute and pass child expr
       ARROW_ASSIGN_OR_RAISE(
           arguments[i],
           ExecuteScalarExpression(call->arguments[i], new_batch, exec_context));
     } else {
       ARROW_ASSIGN_OR_RAISE(
           arguments[i], ExecuteScalarExpression(call->arguments[i], input, exec_context));
     }
     ARROW_ASSIGN_OR_RAISE(
         arguments[i], ExecuteScalarExpression(call->arguments[i], input, exec_context));
   }
  1. Considering performance, only division expressions are treated specially.

We currently apply this method to the application layer, which constructs a complex expression to avoid division overflow problems.

case when j > 0 then i / j else i end -> case when j > 0 then i / ((case when j > 0 then j else null end)) else i end

Based on the above background, does the community have any good ways to fix this problem?

Component(s)

C++

bkietz commented 1 year ago

This is fundamentally caused by acero strictly evaluating argument expressions before calling a function on those arguments. Refactoring to support more intrusive/lazy evaluation semantics would be a significant change; certainly not one which should be handled as a special case in ExecuteScalarExpression.

I'd recommend looking at the expression simplification passes (SimplifyWithGuarantee). There's machinery there to pattern match and modify expressions. Currently it is only used to produce more efficient expressions using partition information and other guarantees, but it could also be used to rewrite expressions for safe evaluation:

Expression unsafe = case_when({greater_than(field_ref("j"), literal(0))}, {
  call("divide", {field_ref("i"), field_ref("j")}),
  field_ref("i"),
});
//...
ARROW_ASSIGN_OR_RAISE(Expression safe, MakeSafe(unsafe));
assert(safe == call("divide", {
  field_ref("i"),
  call("max", {field_ref("j"), literal(1)}),
}))

Another way (much less intensive) to approach this problem would be writing a new option for the divide compute functions which produces null or zero when dividing by zero instead of raising an error. This could then be used explicitly in situations where division by zero is otherwise inevitable.

Light-City commented 1 year ago

This is fundamentally caused by acero strictly evaluating argument expressions before calling a function on those arguments. Refactoring to support more intrusive/lazy evaluation semantics would be a significant change; certainly not one which should be handled as a special case in ExecuteScalarExpression.

I'd recommend looking at the expression simplification passes (SimplifyWithGuarantee). There's machinery there to pattern match and modify expressions. Currently it is only used to produce more efficient expressions using partition information and other guarantees, but it could also be used to rewrite expressions for safe evaluation:

Expression unsafe = case_when({greater_than(field_ref("j"), literal(0))}, {
  call("divide", {field_ref("i"), field_ref("j")}),
  field_ref("i"),
});
//...
ARROW_ASSIGN_OR_RAISE(Expression safe, MakeSafe(unsafe));
assert(safe == call("divide", {
  field_ref("i"),
  call("max", {field_ref("j"), literal(1)}),
}))

Another way (much less intensive) to approach this problem would be writing a new option for the divide compute functions which produces null or zero when dividing by zero instead of raising an error. This could then be used explicitly in situations where division by zero is otherwise inevitable.

Yes, the underlying operation needs to be changed. But adding options is a relatively big change.