apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
13.85k stars 3.37k forks source link

[C++] Overflow in `subtract_checked(timestamp, timestamp)` after casting to pandas and back. #43031

Open randolf-scholz opened 4 days ago

randolf-scholz commented 4 days ago

Describe the bug, including details regarding any error messages, version, and platform.

Cross post from https://github.com/pandas-dev/pandas/issues/59082

from datetime import datetime, timedelta
import pyarrow as pa

timestamps = [None, datetime(2022, 1, 1, 10, 0, 30), datetime(2022, 1, 1, 10, 1, 0)]
x = pa.array(timestamps, type=pa.timestamp("ms"))
pa.compute.subtract_checked(x, x)  # ✅

# convert to pandas and back
y = pa.Array.from_pandas(x.to_pandas())
assert x == y  # ✅
pa.compute.subtract_checked(y, x)  # ✅
pa.compute.subtract_checked(x, y)  # ❌ overflow

Component(s)

Python

jorisvandenbossche commented 3 days ago

@randolf-scholz thanks for the report.

The root cause is that subtract_checked kernel is not ignoring overflow errors for values masked by the validity bitmap (i.e. nulls). And the roundtrip to/from pandas changed this value. Using nanoarrow to quickly inspect the buffers of x and y:

In [17]: na.array(x).inspect()
<ArrowArray timestamp('ms', '')>
- length: 3
- offset: 0
- null_count: 1
- buffers[2]:
  - validity <bool[1 b] 01100000>
  - data <int64[24 b] 0 1641031230000 1641031260000>
- dictionary: NULL
- children[0]:

In [18]: na.array(y).inspect()
<ArrowArray timestamp('ms', '')>
- length: 3
- offset: 0
- null_count: 1
- buffers[2]:
  - validity <bool[1 b] 01100000>
  - data <int64[24 b] -9223372036854775808 1641031230000 1641031260000>
- dictionary: NULL
- children[0]:

So in the original array, pyarrow uses a default of 0 for the values behind the mask. But on roundtrip to pandas, pandas uses NaT as missing value sentinel, and when then converting back to pyarrow the integer representation of this (i.e. the smallest int64) is kept, as it is masked anyway)

jorisvandenbossche commented 3 days ago

And so this is a duplicate of https://github.com/apache/arrow/issues/35088

randolf-scholz commented 3 days ago

Interesting. Could it possibly be related to performance optimizations of nullable integer arithmetic? With the convention that nulls have value 0, subtraction $z←x-y$ of nullable integers can be implemented branchless as:

z.valid = x.valid & y.valid
z.value = z.valid * (x.value - y.value) 
        = -int(z.valid) & (x.value - y.value)  (2's complement -1 = 0b11111111)

We never have to worry about overflow of invalid values. I don't know if that's what arrow is doing, just a hypothesis.