apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.42k stars 3.51k forks source link

pa.compute.sum result for decimal128 doesn't fit into precision/scale #35166

Open wirable23 opened 1 year ago

wirable23 commented 1 year ago

Describe the bug, including details regarding any error messages, version, and platform.

import pyarrow as pa import decimal

>>> arr = pa.array([decimal.Decimal("9.999"), decimal.Decimal("1.234"), decimal.Decimal("1.234"), decimal.Decimal("1.234")], type=pa.decimal128(4,3))
>>> sum_scalar = pa.compute.sum(arr)
>>> sum_scalar
<pyarrow.Decimal128Scalar: Decimal('13.701')>
>>> sum_scalar.type
Decimal128Type(decimal128(4, 3))
>>>

The decimal "13.701" cannot fit into a decimal with scale/precision 4/3. An error is raised when trying to create the scalar directly:

>>> pa.scalar(decimal.Decimal("13.701"), type=pa.decimal128(4,3))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow\scalar.pxi", line 1100, in pyarrow.lib.scalar
  File "pyarrow\error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow\error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Decimal type with precision 5 does not fit into precision inferred from first array element: 4
>>>

So it seems the type of the returned scalar should be Decimal128Type(decimal128(5, 3)), not Decimal128Type(decimal128(4, 3)). It seems sum tries to preserve original scale/precision of array, when likely the sum will not fit in those bounds.

Component(s)

Python

westonpace commented 1 year ago

Yikes. That's definitely a bug. I think the return type of sum should probably maximize the P parameter.

E.g. SUM(Decimal128<X,Y>) -> Decimal<38,Y> and SUM(Decimal256<X,Y>) -> Decimal<76,Y>. This matches the rules for Substrait as well as matches what SQL server does.

rohanjain101 commented 1 year ago

@westonpace thanks for taking a look. Is there documentation for scale/precision rules for decimal128 in different API's? For example, I found addition follows these rules: https://docs.aws.amazon.com/redshift/latest/dg/r_numeric_computations201.html are used for binary expressions, but couldn't find scale/precision rules for other API's like sum, concat_table, publically documented. Does this information exist somewhere, or does it need to be found look at internal compute implementation?

westonpace commented 1 year ago

but couldn't find scale/precision rules for other API's like sum, concat_table, publically documented. Does this information exist somewhere, or does it need to be found look at internal compute implementation?

I'm not sure there is any particular place that we document these rules. If it isn't found in the function description then it is likely missing.

wirable23 commented 1 year ago

Where would I go to find these rules? In the internal compute implementation? Should the decimal scale/precision semantics be publically documented as it seems fairly fundamental to the behavior of the API?

westonpace commented 1 year ago

Where would I go to find these rules? In the internal compute implementation?

Yes, unfortunately.

Should the decimal scale/precision semantics be publically documented as it seems fairly fundamental to the behavior of the API?

Yes, I'd definitely welcome more documentation on how this happens. There isn't a great spot for it today beyond the function doc string itself (e.g. what you get back from help(pc.sum)) and this page: https://arrow.apache.org/docs/cpp/compute.html#compute-function-list.

khwilson commented 1 month ago

I know this is old, but I was curious about this. It seems that for binary operations, PyArrow follows the rules of Redshift (see #40123). However, for array aggregators (specifically, sum) it seems to retain the input type (which is definitely wrong). I was curious what some other databases do, and so attempted to put together a short list. The tl;dr is that @westonpace 's suggestion to simply promote the sum to decimal128(38, scale) or decimal256(76, scale) seems as good of a choice as any.

Data

In each case, I ran approximately the following query in the database's dialect (this query is for Clickhouse):

with base_values as (
  select toDecimal128(1, 1) as foo
  union all
  select toDecimal128(1, 1) as foo
)
select toTypeName(sum(foo))
from base_values

Clickhouse

Like Arrow, the return type of sum on a column is the input type, but Clickhouse doesn't retain precision information in its decimal type. So this is basically what @westonpace suggests the solution for Arrow should be.

Postgres

The return type of sum appears to be an "unconstrained numeric" type. This was verified by running

create table sum_columns as
select 
  sum(foo) as the_sum
from base_values;

select
  table_name,
  column_name,
  numeric_precision,
  numeric_scale
from information_schema.columns
where table_name in ('sum_columns');

and noting that numeric_precision and numeric_scale are both NULL.

MariaDB

The return type of sum appears to be numeric(min(65, 22 + precision), scale). This was verified in the same way as postgres. N.B. 65 is the maximum width of a decimal in MariaDB

duckdb

The return type of sum appears to be decimal(38, s) no matter its internal representation.

khwilson commented 3 weeks ago

@westonpace I have a PR that addresses this, but as I was working on it, I realized that Arrow also handles the product and mean in the same way that it does for sums. While the mean might make sense, the product definitely does not.

ducksb handles this (for product and avg) by converting everything to doubles. Taking the same approach in Arrow would be quite a radical shift from the current behavior.

Before investing in updating all the tests and whatnot, is there a good place to have a discussion on what these aggregate types should be?