Open mroeschke opened 11 months ago
Thanks for the report @mroeschke. Looking into it, this might actually be a bug in pandas.
In all cases, we currently interpret this pd.Timedelta object as a datetime.timedelta in the type inference (so we currently actually ignore the nano part if you don't manually specify that as the data type; this is a known limitation, eg https://github.com/apache/arrow/issues/18099 and https://github.com/apache/arrow/issues/36035 for the equivalent Timestamp case).
But when putting a breakpoint in our conversion utility, it seems that when I pass a nanosecond-resolution Timedelta we properly get the number of days, but when it is a new seconds-resolution Timedelta, those attributes are not set:
Running arr = pa.array([pd.Timedelta(days=1)])
:
Thread 1 "python" hit Breakpoint 2, arrow::py::internal::PyDelta_to_s (pytimedelta=0x7fffd75d5120) at /home/joris/scipy/repos/arrow/python/pyarrow/src/arrow/python/datetime.h:149
149 return (PyDateTime_DELTA_GET_DAYS(pytimedelta) * 86400LL +
(gdb) p pytimedelta->seconds
$2 = 0
(gdb) p pytimedelta->microseconds
$3 = 0
(gdb) p pytimedelta->days
$4 = 1 # <---- properly set to 1
Running arr = pa.array([pd.Timedelta(days=1).as_unit('s')])
:
Thread 1 "python" hit Breakpoint 1, arrow::py::internal::PyDelta_to_s (pytimedelta=0x7fffd75d51c0) at /home/joris/scipy/repos/arrow/python/pyarrow/src/arrow/python/datetime.h:149
149 return (PyDateTime_DELTA_GET_DAYS(pytimedelta) * 86400LL +
(gdb) p pytimedelta->seconds
$1 = 0
(gdb) p pytimedelta->microseconds
$2 = 0
(gdb) p pytimedelta->days
$3 = 0 # <---- this is now 0 !
So this is actually documented in the pandas' source code:
In the Timedelta constructor (https://github.com/pandas-dev/pandas/blob/43691a2f5d235b08f0f3aa813d8fdcb7c4ce1e47/pandas/_libs/tslibs/timedeltas.pyx#L953-L964):
# For millisecond and second resos, we cannot actually pass int(value) because
# many cases would fall outside of the pytimedelta implementation bounds.
# We pass 0 instead, and override seconds, microseconds, days.
# In principle we could pass 0 for ns and us too.
if reso == NPY_FR_ns:
td_base = _Timedelta.__new__(cls, microseconds=int(value) // 1000)
elif reso == NPY_DATETIMEUNIT.NPY_FR_us:
td_base = _Timedelta.__new__(cls, microseconds=int(value))
elif reso == NPY_DATETIMEUNIT.NPY_FR_ms:
td_base = _Timedelta.__new__(cls, milliseconds=0)
elif reso == NPY_DATETIMEUNIT.NPY_FR_s:
td_base = _Timedelta.__new__(cls, seconds=0)
and then the relevant note in the seconds attribute (https://github.com/pandas-dev/pandas/blob/43691a2f5d235b08f0f3aa813d8fdcb7c4ce1e47/pandas/_libs/tslibs/timedeltas.pyx#L1103-L1104):
# NB: using the python C-API PyDateTime_DELTA_GET_SECONDS will fail
# (or be incorrect)
So pyarrow is exactly using those C APIs, which are now silently incorrect for second and millisecond resolution Timedeltas.
While pyarrow should certainly adapt to properly support pandas.Timedelta (especially for cases outside the range of datetime.timedelta), I think it might be safer for pandas to ensure those C APIs keep working (I think pyarrow will certainly not be the only one using those). It seems possible to me to at least set those for the cases where they are within the range of datetime.timedelta.
Opened https://github.com/pandas-dev/pandas/issues/54682 on the pandas side
(Posting here for reference.) From the Pandas side, it's been decided that we won't support the C APIs, since this isn't possible to support cleanly in the general case. So the fix needs to happen on the PyArrow side.
Describe the bug, including details regarding any error messages, version, and platform.
I would expect all 4 results to return
datetime.timedelta(days=1)
Component(s)
Python