[Python] Interchange object data buffer has the wrong dtype / `from_dataframe` incorrect

stinodego commented 1 year ago

Describe the bug, including details regarding any error messages, version, and platform.

Code example

from datetime import datetime
import pyarrow as pa

string = pa.array(["a", "bc", None])
datetime = pa.array([datetime(2022, 1, 1), datetime(2022, 1, 2), datetime(2022, 1, 3)])
categorical = pa.DictionaryArray.from_arrays(pa.array([0, 1, 0]), pa.array(["a", "b"]))

df = pa.Table.from_arrays(
    [string, datetime, categorical],
    names=["string", "datetime", "categorical"],
)

dfi = df.__dataframe__()

col = dfi.get_column_by_name("string")
print(col.dtype)
# (<DtypeKind.STRING: 21>, 8, 'u', '=')
print(col.get_buffers()["data"][1])
# (<DtypeKind.STRING: 21>, 8, 'u', '=') -> SHOULD  BE: (<DtypeKind.UINT: 1>, 8, 'C', '=')

col = dfi.get_column_by_name("datetime")
print(col.dtype)
# (<DtypeKind.DATETIME: 22>, 64, 'tsu:', '=')
print(col.get_buffers()["data"][1])
# (<DtypeKind.DATETIME: 22>, 64, 'tsu:', '=') -> SHOULD BE: (<DtypeKind.INT: 0>, 64, 'l', '=')

col = dfi.get_column_by_name("categorical")
print(col.dtype)
# (<DtypeKind.CATEGORICAL: 23>, 64, 'L', '=')
print(col.get_buffers()["data"][1])
# (<DtypeKind.INT: 0>, 64, 'l', '=') -> CORRECT!

Issue description

As you can see, the dtype of the data buffer is the same as the dtype of the column. This is only correct for integers and floats. Categoricals, strings, and datetime types have a some integer as their physical representation. The data buffer should have this physical data type associated with it.

The dtype of the Column object should provide information on how to interpret the various buffers. The dtype associated with each buffer should be the dtype of the actual data in that buffer. This is the second part of the issue: the implementation of from_dataframe is incorrect - it should use the column dtype rather than the data buffer dtype.

Fix

Fixing the get_buffers implementation should be relatively simple. However, this will break any from_dataframe implementation (also from other libraries) that rely on the data buffer having the column dtype.

So fixing this should ideally go in three steps:

[x] Fix the from_dataframe implementation to use the column dtype rather than the data buffer dtype to interpret the buffers.
[x] Make sure other libraries have also updated their from_dataframe implementation. See https://github.com/pandas-dev/pandas/issues/54781 for the pandas issue.
[ ] Fix the data buffer dtypes.

Tagging @AlenkaF as I know you've been working on the protocol for pyarrow.

Component(s)

Python

AlenkaF commented 1 year ago

Thank you for bringing this to our attention! And thanks for putting effort to update the behaviour across all the libraries implementing the protocol.

One thought that comes to mind is that this should be made more explicit in the __dataframe__ protocol spec. Is there and issue opened as I haven't found one?

stinodego commented 1 year ago

One thought that comes to mind is that this should be made more explicit in the dataframe protocol spec. Is there and issue opened as I haven't found one?

There is no issue on this. I thought about opening one, but I would say that the text for the get_buffers method is quite clear already (see below). And from_dataframe is not explicitly part of the API, so there is no real documentation on how that should work.

Thanks so much for picking this up!

    def get_buffers(self) -> ColumnBuffers:
        """
        Return a dictionary containing the underlying buffers.

        The returned dictionary has the following contents:

            - "data": a two-element tuple whose first element is a buffer
                      containing the data and whose second element is the data
                      buffer's associated dtype.
            - "validity": a two-element tuple whose first element is a buffer
                          containing mask values indicating missing data and
                          whose second element is the mask value buffer's
                          associated dtype. None if the null representation is
                          not a bit or byte mask.
            - "offsets": a two-element tuple whose first element is a buffer
                         containing the offset values for variable-size binary
                         data (e.g., variable-length strings) and whose second
                         element is the offsets buffer's associated dtype. None
                         if the data buffer does not have an associated offsets
                         buffer.
        """

pitrou commented 1 year ago

There is no issue on this. I thought about opening one, but I would say that the text for the get_buffers method is quite clear already (see below).

If several implementations got this wrong, then the doc is certainly not clear enough IMHO.

stinodego commented 1 year ago

There is no issue on this. I thought about opening one, but I would say that the text for the get_buffers method is quite clear already (see below).

If several implementations got this wrong, then the doc is certainly not clear enough IMHO.

Not necessarily. I think everyone just followed along with the first implementation (pandas?) without thinking too hard about it.

But of course you could open a PR at the DataFrame API repo with an improved docstring.

pitrou commented 1 year ago

But of course you could open a PR at the DataFrame API repo with an improved docstring.

I'll try to. Though I already opened 5 issues and never got a single response...

jorisvandenbossche commented 1 year ago

And I think also several people involved in the protocol spec repo itself (me included) are clearly confusing this (apart from the implementations). Seeing https://github.com/data-apis/dataframe-api/pull/87, where "I assume that the dtype should be provided as the same tuple returned from Column.dtype" was confirmed. And the PR referenced there to add the protocol to StaticFrame did exactly the same "mistake".

jorisvandenbossche commented 1 year ago

So yes, I think although strictly speaking the current docstring is clear and unambiguous, calling it out explicitly would clearly be worth it.

jorisvandenbossche commented 1 year ago

And every implementation will have to handle both ways of specifying the dtype on the short term ...

stinodego commented 1 year ago

@AlenkaF Thanks so much for addressing this issue! Should we perhaps create a follow-up issue for updating the data buffer dtype?

AlenkaF commented 1 year ago

Ah, I was thinking of keeping this one open until all 3 points from the description are resolved.

@jorisvandenbossche if you would rather see me opening a new one, I am fine with that. Otherwise I will reopen this one.

AlenkaF commented 1 year ago

@jorisvandenbossche do you know if the https://github.com/apache/arrow/pull/37986 PR will get into 14.0.0 if this issue is not closed? If not, then it may be better to close this issue and open a follow-up for the third item from the list ... 😬 As we need to sync these changes across multiple libraries, the third item might take a while.

jorisvandenbossche commented 1 year ago

Yes, the release branch is not yet made, so everything that is in main right now will be included in the release regardless of the issue.

AlenkaF commented 11 months ago

@stinodego looking at two last items from the list:

[ ] Make sure other libraries have also updated their from_dataframe implementation. See https://github.com/pandas-dev/pandas/issues/54781 for the pandas issue.

[ ] Fix the data buffer dtypes.

Do you think we could proceed and fix the data buffer dtypes in pyarrow and pandas implementation? Pandas already fixed the first item from the list (Fix the from_dataframe implementation to use the column dtype rather than the data buffer dtype to interpret the buffers.), the fix on our side was included in 14.0.0 release. I would like to see the last item added to 15.0.0 in pyarrow if possible.

stinodego commented 11 months ago

I plan to include the fix in the Polars 1.0.0 release which comes somewhere in January/February. So I'd say go ahead and get the fix done for pyarrow.

I know @MarcoGorelli has expressed some concern about compatibility though. He noted that this means people on older pandas versions can no longer longer use from_dataframe to read pyarrow frames created on newer pyarrow versions. That's a valid point. Decision is up to you!

apache / arrow