Open WillAyd opened 11 months ago
It would also be good to add 64-bit date type, 32 and 64-bit time type plus duration type.
Contributions are more than welcome as there is no immediate plan to work on this. I can guide anybody interested! ❤️
The interchange protocol currently doesn't define a date type AFAIK (https://data-apis.org/dataframe-protocol/latest/API.html#interface), so do you expect it to be written as DATETIME?
Yes, that was my idea. Similar to what polars does: https://github.com/pola-rs/polars/blob/2b43fc1ac1af84ed118ff3f8840d328a12c35510/py-polars/polars/interchange/utils.py#L35-L54
Date and Duration data type classes are added to the staging branch of the protocol: https://github.com/data-apis/dataframe-api/blob/c5f08352e0a1d25387fe1737ffe9cccb36f554f7/spec/API_specification/dataframe_api/dtypes.py#L50
which I guess should be the draft docs page? https://data-apis.org/dataframe-api/draft/API_specification/index.html
But I am not sure if this will move forward soon.
Date and Duration data type classes are added to the staging branch of the protocol: https://github.com/data-apis/dataframe-api/blob/c5f08352e0a1d25387fe1737ffe9cccb36f554f7/spec/API_specification/dataframe_api/dtypes.py#L50
That's for the standard API, though, not for the interchange protocol (I was confused as well, and so wrote a wrong comment on the PR adding it asking for clarification ;) -> https://github.com/data-apis/dataframe-api/pull/197)
Yes, that was my idea. Similar to what polars does: https://github.com/pola-rs/polars/blob/2b43fc1ac1af84ed118ff3f8840d328a12c35510/py-polars/polars/interchange/utils.py#L35-L54
Personally I think it would be better if this was first clarified or added in the interchange protocol. While for date it does make some sense (as you could just see it as a different resolution of datetime), duration is really different. And for example the pandas implementation also wouldn't support consuming duration. And pyarrow only supports consuming datetime as timestamp, not even date.
That's for the standard API, though, not for the interchange protocol (I was confused as well, and so wrote a wrong comment on the PR adding it asking for clarification ;) -> https://github.com/data-apis/dataframe-api/pull/197)
Oooh, sorry for taking you into the wrong direction! I didn't see it at the time.
Personally I think it would be better if this was first clarified or added in the interchange protocol.
That does make sense 👍
@jorisvandenbossche my expectation was that the buffer would contain 32 bit integers (date64 would be 64). The consumer would be responsible for interpreting that correctly to the appropriate date based off of the precision defined in the format string
(existing upstream issue about duration/timedelta: https://github.com/data-apis/dataframe-api/issues/329)
Let me know if you think this is a distinct issue, but I ran into a different error message when converting a Date32 from Polars through the DataFrame interchange protocol.
import datetime
import polars as pl
from pyarrow.interchange import from_dataframe
from_dataframe(pl.DataFrame({"date": [datetime.date(2024, 3, 22)]}))
---------------------------------------------------------------------------
NotImplementedError Traceback (most recent call last)
...
File [.../envs/default/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:563]
in validity_buffer_nan_sentinel(data_pa_buffer, data_type, describe_null, length, offset, allow_copy)
537 """
538 Build a PyArrow buffer from NaN or sentinel values.
539
(...)
560 pa.Buffer
561 """
562 kind, bit_width, _, _ = data_type
--> 563 data_dtype = map_date_type(data_type)
564 null_kind, sentinel_val = describe_null
566 # Check for float NaN values
File [...envs/default/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:332](http://localhost:8889/lab/workspaces/auto-z/tree/scratch/bugs/~/VegaFusion/repos/altair/.pixi/envs/default/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py#line=331), in map_date_type(data_type)
329 kind, bit_width, f_string, _ = data_type
331 if kind == DtypeKind.DATETIME:
--> 332 unit, tz = parse_datetime_format_str(f_string)
333 return pa.timestamp(unit, tz=tz)
334 else:
File [.../envs/default/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py:324](http://localhost:8889/lab/workspaces/auto-z/tree/scratch/bugs/~/VegaFusion/repos/altair/.pixi/envs/default/lib/python3.11/site-packages/pyarrow/interchange/from_dataframe.py#line=323), in parse_datetime_format_str(format_str)
320 unit += "s"
322 return unit, tz
--> 324 raise NotImplementedError(f"DateTime kind is not supported: {format_str}")
NotImplementedError: DateTime kind is not supported: tdD
What's the current thinking on the best way forward to support this?
Thank you for contributing to the discussion @jonmmease. I see that libraries are working around this by defining date and time types as protocol DATETIME data type with Apache Arrow C Data Interface format string (example tdD
for date32
, tdm
for date64
etc, see Polars code and pandas code).
I do not mind going about it in similar way in PyArrow until date is added to the dataframe protocol spec. Also adding the option to consume this data type. It would be ideal, though, that this is clarified and set in the protocol first.
@jorisvandenbossche, what do you think?
I would still prefer someone to first do a PR to the spec to add this. If it is just clarifying that the existing DATETIME
dtype kind can also be used for other Arrow date and time dtypes, that should relatively easy.
I see that libraries are working around this by defining date and time types as protocol DATETIME data type with Apache Arrow C Data Interface format string (example
tdD
fordate32
,tdm
fordate64
etc, see Polars code and pandas code).
AFAIK pandas doesn't actually support this for duration, at least not for the default timedelta dtype (from testing with pandas main):
In [7]: from pyarrow.interchange import from_dataframe
In [8]: from_dataframe(pd.DataFrame({'a': pd.timedelta_range(0, "1 days", freq='s')}))
...
File ~/scipy/repos/pandas/pandas/core/interchange/utils.py:147, in dtype_to_arrow_c_fmt(dtype)
144 elif isinstance(dtype, DatetimeTZDtype):
145 return ArrowCTypes.TIMESTAMP.format(resolution=dtype.unit[0], tz=dtype.tz)
--> 147 raise NotImplementedError(
148 f"Conversion of {dtype} to Arrow C format string is not implemented."
149 )
NotImplementedError: Conversion of timedelta64[ns] to Arrow C format string is not implemented.
FWIW, my proposal to add support for the Arrow PyCapsule protocol to the interchange standard (https://github.com/data-apis/dataframe-api/pull/342) would also solve this for the case of polars and pyarrow, as both are Arrow-memory based, and could interchange easily those data types. (although that of course requires polars to implement it, and based on https://github.com/pola-rs/polars/issues/12530 that is still WIP I think)
We could start checking for that protocol in pyarrow.interchange.from_dataframe
, although that would also be an extension not covered by the official spec.
Thank you for clarification Joris!
I propose we start with a PR to the dataframe protocol specification to add that the existing DATETIME dtype kind can also be used for other Arrow date and time dtype (not duration). I will do this today/tomorrow.
The proposal to add support for the Arrow PyCapsule protocol to the interchange standard would be great in my opinion. I hope it will move forward otherwise the libs involved will start checking for the protocol by themselves like you have suggested.
Describe the enhancement requested
Component(s)
Python