Open randolf-scholz opened 4 days ago
@randolf-scholz thanks for the report.
The root cause is that subtract_checked
kernel is not ignoring overflow errors for values masked by the validity bitmap (i.e. nulls). And the roundtrip to/from pandas changed this value. Using nanoarrow to quickly inspect the buffers of x
and y
:
In [17]: na.array(x).inspect()
<ArrowArray timestamp('ms', '')>
- length: 3
- offset: 0
- null_count: 1
- buffers[2]:
- validity <bool[1 b] 01100000>
- data <int64[24 b] 0 1641031230000 1641031260000>
- dictionary: NULL
- children[0]:
In [18]: na.array(y).inspect()
<ArrowArray timestamp('ms', '')>
- length: 3
- offset: 0
- null_count: 1
- buffers[2]:
- validity <bool[1 b] 01100000>
- data <int64[24 b] -9223372036854775808 1641031230000 1641031260000>
- dictionary: NULL
- children[0]:
So in the original array, pyarrow uses a default of 0
for the values behind the mask. But on roundtrip to pandas, pandas uses NaT as missing value sentinel, and when then converting back to pyarrow the integer representation of this (i.e. the smallest int64) is kept, as it is masked anyway)
And so this is a duplicate of https://github.com/apache/arrow/issues/35088
Interesting. Could it possibly be related to performance optimizations of nullable integer arithmetic? With the convention that nulls have value 0, subtraction $z←x-y$ of nullable integers can be implemented branchless as:
z.valid = x.valid & y.valid
z.value = z.valid * (x.value - y.value)
= -int(z.valid) & (x.value - y.value) (2's complement -1 = 0b11111111)
We never have to worry about overflow of invalid values. I don't know if that's what arrow is doing, just a hypothesis.
Describe the bug, including details regarding any error messages, version, and platform.
Cross post from https://github.com/pandas-dev/pandas/issues/59082
Component(s)
Python