apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.03k stars 3.43k forks source link

[Python] pa.scalar(pd.Timedelta) with second or millisecond unit performs invalid conversion #37291

Open mroeschke opened 11 months ago

mroeschke commented 11 months ago

Describe the bug, including details regarding any error messages, version, and platform.

In [1]: import pandas as pd; import pyarrow as pa

In [9]: pa.scalar(pd.Timedelta(days=1).as_unit("s"))
Out[9]: <pyarrow.DurationScalar: datetime.timedelta(0)>

In [10]: pa.scalar(pd.Timedelta(days=1).as_unit("ms"))
Out[10]: <pyarrow.DurationScalar: datetime.timedelta(0)>

In [11]: pa.scalar(pd.Timedelta(days=1).as_unit("us"))
Out[11]: <pyarrow.DurationScalar: datetime.timedelta(days=1)>

In [12]: pa.scalar(pd.Timedelta(days=1).as_unit("ns"))
Out[12]: <pyarrow.DurationScalar: datetime.timedelta(days=1)>

In [13]: pa.__version__
Out[13]: '12.0.0'  # same result with 13.0.0.dev715

I would expect all 4 results to return datetime.timedelta(days=1)

Component(s)

Python

jorisvandenbossche commented 11 months ago

Thanks for the report @mroeschke. Looking into it, this might actually be a bug in pandas.

In all cases, we currently interpret this pd.Timedelta object as a datetime.timedelta in the type inference (so we currently actually ignore the nano part if you don't manually specify that as the data type; this is a known limitation, eg https://github.com/apache/arrow/issues/18099 and https://github.com/apache/arrow/issues/36035 for the equivalent Timestamp case).

But when putting a breakpoint in our conversion utility, it seems that when I pass a nanosecond-resolution Timedelta we properly get the number of days, but when it is a new seconds-resolution Timedelta, those attributes are not set:

Running arr = pa.array([pd.Timedelta(days=1)]):

Thread 1 "python" hit Breakpoint 2, arrow::py::internal::PyDelta_to_s (pytimedelta=0x7fffd75d5120) at /home/joris/scipy/repos/arrow/python/pyarrow/src/arrow/python/datetime.h:149
149   return (PyDateTime_DELTA_GET_DAYS(pytimedelta) * 86400LL +
(gdb) p pytimedelta->seconds
$2 = 0
(gdb) p pytimedelta->microseconds
$3 = 0
(gdb) p pytimedelta->days
$4 = 1                             # <---- properly set to 1

Running arr = pa.array([pd.Timedelta(days=1).as_unit('s')]):

Thread 1 "python" hit Breakpoint 1, arrow::py::internal::PyDelta_to_s (pytimedelta=0x7fffd75d51c0) at /home/joris/scipy/repos/arrow/python/pyarrow/src/arrow/python/datetime.h:149
149   return (PyDateTime_DELTA_GET_DAYS(pytimedelta) * 86400LL +
(gdb) p pytimedelta->seconds
$1 = 0
(gdb) p pytimedelta->microseconds
$2 = 0
(gdb) p pytimedelta->days
$3 = 0                             # <---- this is now 0 !
jorisvandenbossche commented 11 months ago

So this is actually documented in the pandas' source code:

In the Timedelta constructor (https://github.com/pandas-dev/pandas/blob/43691a2f5d235b08f0f3aa813d8fdcb7c4ce1e47/pandas/_libs/tslibs/timedeltas.pyx#L953-L964):

    # For millisecond and second resos, we cannot actually pass int(value) because
    #  many cases would fall outside of the pytimedelta implementation bounds.
    #  We pass 0 instead, and override seconds, microseconds, days.
    #  In principle we could pass 0 for ns and us too.
    if reso == NPY_FR_ns:
        td_base = _Timedelta.__new__(cls, microseconds=int(value) // 1000)
    elif reso == NPY_DATETIMEUNIT.NPY_FR_us:
        td_base = _Timedelta.__new__(cls, microseconds=int(value))
    elif reso == NPY_DATETIMEUNIT.NPY_FR_ms:
        td_base = _Timedelta.__new__(cls, milliseconds=0)
    elif reso == NPY_DATETIMEUNIT.NPY_FR_s:
        td_base = _Timedelta.__new__(cls, seconds=0)

and then the relevant note in the seconds attribute (https://github.com/pandas-dev/pandas/blob/43691a2f5d235b08f0f3aa813d8fdcb7c4ce1e47/pandas/_libs/tslibs/timedeltas.pyx#L1103-L1104):

        # NB: using the python C-API PyDateTime_DELTA_GET_SECONDS will fail
        #  (or be incorrect)

So pyarrow is exactly using those C APIs, which are now silently incorrect for second and millisecond resolution Timedeltas.

While pyarrow should certainly adapt to properly support pandas.Timedelta (especially for cases outside the range of datetime.timedelta), I think it might be safer for pandas to ensure those C APIs keep working (I think pyarrow will certainly not be the only one using those). It seems possible to me to at least set those for the cases where they are within the range of datetime.timedelta.

jorisvandenbossche commented 11 months ago

Opened https://github.com/pandas-dev/pandas/issues/54682 on the pandas side

BenjaminHelyer commented 9 months ago

(Posting here for reference.) From the Pandas side, it's been decided that we won't support the C APIs, since this isn't possible to support cleanly in the general case. So the fix needs to happen on the PyArrow side.