Open wirable23 opened 1 year ago
Yikes. That's definitely a bug. I think the return type of sum should probably maximize the P
parameter.
E.g. SUM(Decimal128<X,Y>) -> Decimal<38,Y>
and SUM(Decimal256<X,Y>) -> Decimal<76,Y>
. This matches the rules for Substrait as well as matches what SQL server does.
@westonpace thanks for taking a look. Is there documentation for scale/precision rules for decimal128 in different API's? For example, I found addition follows these rules: https://docs.aws.amazon.com/redshift/latest/dg/r_numeric_computations201.html are used for binary expressions, but couldn't find scale/precision rules for other API's like sum, concat_table, publically documented. Does this information exist somewhere, or does it need to be found look at internal compute implementation?
but couldn't find scale/precision rules for other API's like sum, concat_table, publically documented. Does this information exist somewhere, or does it need to be found look at internal compute implementation?
I'm not sure there is any particular place that we document these rules. If it isn't found in the function description then it is likely missing.
Where would I go to find these rules? In the internal compute implementation? Should the decimal scale/precision semantics be publically documented as it seems fairly fundamental to the behavior of the API?
Where would I go to find these rules? In the internal compute implementation?
Yes, unfortunately.
Should the decimal scale/precision semantics be publically documented as it seems fairly fundamental to the behavior of the API?
Yes, I'd definitely welcome more documentation on how this happens. There isn't a great spot for it today beyond the function doc string itself (e.g. what you get back from help(pc.sum)
) and this page: https://arrow.apache.org/docs/cpp/compute.html#compute-function-list.
I know this is old, but I was curious about this. It seems that for binary operations, PyArrow follows the rules of Redshift (see #40123). However, for array aggregators (specifically, sum
) it seems to retain the input type (which is definitely wrong). I was curious what some other databases do, and so attempted to put together a short list. The tl;dr is that @westonpace 's suggestion to simply promote the sum
to decimal128(38, scale)
or decimal256(76, scale)
seems as good of a choice as any.
In each case, I ran approximately the following query in the database's dialect (this query is for Clickhouse):
with base_values as (
select toDecimal128(1, 1) as foo
union all
select toDecimal128(1, 1) as foo
)
select toTypeName(sum(foo))
from base_values
Like Arrow, the return type of sum
on a column is the input type, but Clickhouse doesn't retain precision information in its decimal type. So this is basically what @westonpace suggests the solution for Arrow should be.
The return type of sum
appears to be an "unconstrained numeric" type. This was verified by running
create table sum_columns as
select
sum(foo) as the_sum
from base_values;
select
table_name,
column_name,
numeric_precision,
numeric_scale
from information_schema.columns
where table_name in ('sum_columns');
and noting that numeric_precision
and numeric_scale
are both NULL
.
The return type of sum
appears to be numeric(min(65, 22 + precision), scale)
. This was verified in the same way as postgres. N.B. 65 is the maximum width of a decimal in MariaDB
The return type of sum
appears to be decimal(38, s)
no matter its internal representation.
@westonpace I have a PR that addresses this, but as I was working on it, I realized that Arrow also handles the product and mean in the same way that it does for sums. While the mean might make sense, the product definitely does not.
ducksb handles this (for product and avg) by converting everything to doubles. Taking the same approach in Arrow would be quite a radical shift from the current behavior.
Before investing in updating all the tests and whatnot, is there a good place to have a discussion on what these aggregate types should be?
Describe the bug, including details regarding any error messages, version, and platform.
import pyarrow as pa import decimal
The decimal "13.701" cannot fit into a decimal with scale/precision 4/3. An error is raised when trying to create the scalar directly:
So it seems the type of the returned scalar should be Decimal128Type(decimal128(5, 3)), not Decimal128Type(decimal128(4, 3)). It seems sum tries to preserve original scale/precision of array, when likely the sum will not fit in those bounds.
Component(s)
Python