apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.27k stars 3.47k forks source link

[Python] Add date32 support to __dataframe__ protocol #39539

Open WillAyd opened 8 months ago

WillAyd commented 8 months ago

Describe the enhancement requested

>>> pa.Table.from_pydict({"col": [datetime.date(2024, 1, 1)]}).__dataframe__().get_column(0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/willayd/mambaforge/envs/pantab-dev/lib/python3.12/site-packages/pyarrow/interchange/dataframe.py", line 139, in get_column
    return _PyArrowColumn(self._df.column(i),
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/willayd/mambaforge/envs/pantab-dev/lib/python3.12/site-packages/pyarrow/interchange/column.py", line 239, in __init__
    self._dtype = self._dtype_from_arrowdtype(dtype, bit_width)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/willayd/mambaforge/envs/pantab-dev/lib/python3.12/site-packages/pyarrow/interchange/column.py", line 322, in _dtype_from_arrowdtype
    raise ValueError(
ValueError: Data type date32[day] not supported by interchange protocol

Component(s)

Python

AlenkaF commented 8 months ago

It would also be good to add 64-bit date type, 32 and 64-bit time type plus duration type.

Contributions are more than welcome as there is no immediate plan to work on this. I can guide anybody interested! ❤️

jorisvandenbossche commented 8 months ago

The interchange protocol currently doesn't define a date type AFAIK (https://data-apis.org/dataframe-protocol/latest/API.html#interface), so do you expect it to be written as DATETIME?

AlenkaF commented 8 months ago

Yes, that was my idea. Similar to what polars does: https://github.com/pola-rs/polars/blob/2b43fc1ac1af84ed118ff3f8840d328a12c35510/py-polars/polars/interchange/utils.py#L35-L54

AlenkaF commented 8 months ago

Date and Duration data type classes are added to the staging branch of the protocol: https://github.com/data-apis/dataframe-api/blob/c5f08352e0a1d25387fe1737ffe9cccb36f554f7/spec/API_specification/dataframe_api/dtypes.py#L50

which I guess should be the draft docs page? https://data-apis.org/dataframe-api/draft/API_specification/index.html

But I am not sure if this will move forward soon.

jorisvandenbossche commented 8 months ago

Date and Duration data type classes are added to the staging branch of the protocol: https://github.com/data-apis/dataframe-api/blob/c5f08352e0a1d25387fe1737ffe9cccb36f554f7/spec/API_specification/dataframe_api/dtypes.py#L50

That's for the standard API, though, not for the interchange protocol (I was confused as well, and so wrote a wrong comment on the PR adding it asking for clarification ;) -> https://github.com/data-apis/dataframe-api/pull/197)

Yes, that was my idea. Similar to what polars does: https://github.com/pola-rs/polars/blob/2b43fc1ac1af84ed118ff3f8840d328a12c35510/py-polars/polars/interchange/utils.py#L35-L54

Personally I think it would be better if this was first clarified or added in the interchange protocol. While for date it does make some sense (as you could just see it as a different resolution of datetime), duration is really different. And for example the pandas implementation also wouldn't support consuming duration. And pyarrow only supports consuming datetime as timestamp, not even date.

AlenkaF commented 8 months ago

That's for the standard API, though, not for the interchange protocol (I was confused as well, and so wrote a wrong comment on the PR adding it asking for clarification ;) -> https://github.com/data-apis/dataframe-api/pull/197)

Oooh, sorry for taking you into the wrong direction! I didn't see it at the time.

Personally I think it would be better if this was first clarified or added in the interchange protocol.

That does make sense 👍

WillAyd commented 8 months ago

@jorisvandenbossche my expectation was that the buffer would contain 32 bit integers (date64 would be 64). The consumer would be responsible for interpreting that correctly to the appropriate date based off of the precision defined in the format string

jorisvandenbossche commented 7 months ago

(existing upstream issue about duration/timedelta: https://github.com/data-apis/dataframe-api/issues/329)

jonmmease commented 5 months ago

Let me know if you think this is a distinct issue, but I ran into a different error message when converting a Date32 from Polars through the DataFrame interchange protocol.

import datetime
import polars as pl
from pyarrow.interchange import from_dataframe

from_dataframe(pl.DataFrame({"date": [datetime.date(2024, 3, 22)]}))
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
...
File [.../envs/default/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:563]
in validity_buffer_nan_sentinel(data_pa_buffer, data_type, describe_null, length, offset, allow_copy)
    537 """
    538 Build a PyArrow buffer from NaN or sentinel values.
    539 
   (...)
    560 pa.Buffer
    561 """
    562 kind, bit_width, _, _ = data_type
--> 563 data_dtype = map_date_type(data_type)
    564 null_kind, sentinel_val = describe_null
    566 # Check for float NaN values

File [...envs/default/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:332](http://localhost:8889/lab/workspaces/auto-z/tree/scratch/bugs/~/VegaFusion/repos/altair/.pixi/envs/default/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py#line=331), in map_date_type(data_type)
    329 kind, bit_width, f_string, _ = data_type
    331 if kind == DtypeKind.DATETIME:
--> 332     unit, tz = parse_datetime_format_str(f_string)
    333     return pa.timestamp(unit, tz=tz)
    334 else:

File [.../envs/default/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:324](http://localhost:8889/lab/workspaces/auto-z/tree/scratch/bugs/~/VegaFusion/repos/altair/.pixi/envs/default/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py#line=323), in parse_datetime_format_str(format_str)
    320         unit += "s"
    322     return unit, tz
--> 324 raise NotImplementedError(f"DateTime kind is not supported: {format_str}")

NotImplementedError: DateTime kind is not supported: tdD

What's the current thinking on the best way forward to support this?

AlenkaF commented 5 months ago

Thank you for contributing to the discussion @jonmmease. I see that libraries are working around this by defining date and time types as protocol DATETIME data type with Apache Arrow C Data Interface format string (example tdD for date32, tdm for date64 etc, see Polars code and pandas code).

I do not mind going about it in similar way in PyArrow until date is added to the dataframe protocol spec. Also adding the option to consume this data type. It would be ideal, though, that this is clarified and set in the protocol first.

@jorisvandenbossche, what do you think?

jorisvandenbossche commented 5 months ago

I would still prefer someone to first do a PR to the spec to add this. If it is just clarifying that the existing DATETIME dtype kind can also be used for other Arrow date and time dtypes, that should relatively easy.

I see that libraries are working around this by defining date and time types as protocol DATETIME data type with Apache Arrow C Data Interface format string (example tdD for date32, tdm for date64 etc, see Polars code and pandas code).

AFAIK pandas doesn't actually support this for duration, at least not for the default timedelta dtype (from testing with pandas main):

In [7]: from pyarrow.interchange import from_dataframe

In [8]: from_dataframe(pd.DataFrame({'a': pd.timedelta_range(0, "1 days", freq='s')}))
...
File ~/scipy/repos/pandas/pandas/core/interchange/utils.py:147, in dtype_to_arrow_c_fmt(dtype)
    144 elif isinstance(dtype, DatetimeTZDtype):
    145     return ArrowCTypes.TIMESTAMP.format(resolution=dtype.unit[0], tz=dtype.tz)
--> 147 raise NotImplementedError(
    148     f"Conversion of {dtype} to Arrow C format string is not implemented."
    149 )

NotImplementedError: Conversion of timedelta64[ns] to Arrow C format string is not implemented.

FWIW, my proposal to add support for the Arrow PyCapsule protocol to the interchange standard (https://github.com/data-apis/dataframe-api/pull/342) would also solve this for the case of polars and pyarrow, as both are Arrow-memory based, and could interchange easily those data types. (although that of course requires polars to implement it, and based on https://github.com/pola-rs/polars/issues/12530 that is still WIP I think)

We could start checking for that protocol in pyarrow.interchange.from_dataframe, although that would also be an extension not covered by the official spec.

AlenkaF commented 5 months ago

Thank you for clarification Joris!

I propose we start with a PR to the dataframe protocol specification to add that the existing DATETIME dtype kind can also be used for other Arrow date and time dtype (not duration). I will do this today/tomorrow.

The proposal to add support for the Arrow PyCapsule protocol to the interchange standard would be great in my opinion. I hope it will move forward otherwise the libs involved will start checking for the protocol by themselves like you have suggested.