Open vaibhawvipul opened 1 month ago
Triggered CI.
Where is the error thrown when ansi_mode is enabled and an overflow occurs?
As per issue-ticket's description - SumDecimal currently hardcodes nullable=true, and this is correct when ANSI mode is not enabled, because overflows cause null values. However, in ANSI mode, overflows cause exceptions. If the input is non-nullable then SumDecimal should probably also be non-nullable. The same is true for AvgDecimal.
I thought the scope of this work is only to make nullable either true or false based on ANSI mode. @parthchandra
Thanks for working on this @vaibhawvipul.
I checked Spark's source code in the master branch, and it also hard codes nullable as true. This makes sense because if we perform a sum on a column that only contains nulls, then we would expect a null output.
It would be good to add a test in CometAggregateSuite
to confirm that we have the same behavior as Spark in this case.
Parth raises a good point that we should check that we have the behavior for the ANSI overflow case, but that can be a separate issue/PR.
Where is the error thrown when ansi_mode is enabled and an overflow occurs?
As per issue-ticket's description -
SumDecimal currently hardcodes nullable=true, and this is correct when ANSI mode is not enabled, because overflows cause null values. However, in ANSI mode, overflows cause exceptions. If the input is non-nullable then SumDecimal should probably also be non-nullable. The same is true for AvgDecimal.
I thought the scope of this work is only to make nullable either true or false based on ANSI mode. @parthchandra
Oh, I didn't realize the scope for this PR was limited. I wonder if we may end up with incorrect results if we change the nullability for ansi mode but do not throw an exception on overflow. If an expression result is nullable, then the corresponding vector has a validity bit and downstream operations would check it before accessing the value. If the expression is not nullable, then a downstream operation can eliminate the check for nullability and access the value directly. But in this case the value would be an overflow value which would be incorrect.
It would be good to add a test in CometAggregateSuite to confirm that we have the same behavior as Spark in this case.
That's a good idea.
Thanks for working on this @vaibhawvipul.
I checked Spark's source code in the master branch, and it also hard codes nullable as true. This makes sense because if we perform a sum on a column that only contains nulls, then we would expect a null output.
It would be good to add a test in
CometAggregateSuite
to confirm that we have the same behavior as Spark in this case.Parth raises a good point that we should check that we have the behavior for the ANSI overflow case, but that can be a separate issue/PR.
@andygrove I have added a test in CometAggregateSuite and we can observe that we have behaviour parity with spark. I can work on the issue raised by Parth in a separate PR. Please let me know your thoughts.
Which issue does this PR close?
Closes #961 .
Rationale for this change
making is_nullable ansi aware for sum_decimal and avg_decimal
What changes are included in this PR?
Updated
planner.rs
,sum_decimal.rs
andavg_decimal.rs
How are these changes tested?
CI tests pass