apache / datafusion-comet

Apache DataFusion Comet Spark Accelerator
https://datafusion.apache.org/comet
Apache License 2.0
823 stars 163 forks source link

fix: making is_nullable ansi aware for sum_decimal and avg_decimal #981

Open vaibhawvipul opened 1 month ago

vaibhawvipul commented 1 month ago

Which issue does this PR close?

Closes #961 .

Rationale for this change

making is_nullable ansi aware for sum_decimal and avg_decimal

What changes are included in this PR?

Updated planner.rs, sum_decimal.rs and avg_decimal.rs

How are these changes tested?

CI tests pass

viirya commented 1 month ago

Triggered CI.

vaibhawvipul commented 1 month ago

Where is the error thrown when ansi_mode is enabled and an overflow occurs?

As per issue-ticket's description - SumDecimal currently hardcodes nullable=true, and this is correct when ANSI mode is not enabled, because overflows cause null values. However, in ANSI mode, overflows cause exceptions. If the input is non-nullable then SumDecimal should probably also be non-nullable. The same is true for AvgDecimal.

I thought the scope of this work is only to make nullable either true or false based on ANSI mode. @parthchandra

andygrove commented 1 month ago

Thanks for working on this @vaibhawvipul.

I checked Spark's source code in the master branch, and it also hard codes nullable as true. This makes sense because if we perform a sum on a column that only contains nulls, then we would expect a null output.

It would be good to add a test in CometAggregateSuite to confirm that we have the same behavior as Spark in this case.

Parth raises a good point that we should check that we have the behavior for the ANSI overflow case, but that can be a separate issue/PR.

parthchandra commented 1 month ago

Where is the error thrown when ansi_mode is enabled and an overflow occurs?

As per issue-ticket's description - SumDecimal currently hardcodes nullable=true, and this is correct when ANSI mode is not enabled, because overflows cause null values. However, in ANSI mode, overflows cause exceptions. If the input is non-nullable then SumDecimal should probably also be non-nullable. The same is true for AvgDecimal.

I thought the scope of this work is only to make nullable either true or false based on ANSI mode. @parthchandra

Oh, I didn't realize the scope for this PR was limited. I wonder if we may end up with incorrect results if we change the nullability for ansi mode but do not throw an exception on overflow. If an expression result is nullable, then the corresponding vector has a validity bit and downstream operations would check it before accessing the value. If the expression is not nullable, then a downstream operation can eliminate the check for nullability and access the value directly. But in this case the value would be an overflow value which would be incorrect.

It would be good to add a test in CometAggregateSuite to confirm that we have the same behavior as Spark in this case.

That's a good idea.

vaibhawvipul commented 1 month ago

Thanks for working on this @vaibhawvipul.

I checked Spark's source code in the master branch, and it also hard codes nullable as true. This makes sense because if we perform a sum on a column that only contains nulls, then we would expect a null output.

It would be good to add a test in CometAggregateSuite to confirm that we have the same behavior as Spark in this case.

Parth raises a good point that we should check that we have the behavior for the ANSI overflow case, but that can be a separate issue/PR.

@andygrove I have added a test in CometAggregateSuite and we can observe that we have behaviour parity with spark. I can work on the issue raised by Parth in a separate PR. Please let me know your thoughts.