Open stinodego opened 1 year ago
Thank you for bringing this to our attention! And thanks for putting effort to update the behaviour across all the libraries implementing the protocol.
One thought that comes to mind is that this should be made more explicit in the __dataframe__
protocol spec. Is there and issue opened as I haven't found one?
One thought that comes to mind is that this should be made more explicit in the dataframe protocol spec. Is there and issue opened as I haven't found one?
There is no issue on this. I thought about opening one, but I would say that the text for the get_buffers
method is quite clear already (see below). And from_dataframe
is not explicitly part of the API, so there is no real documentation on how that should work.
Thanks so much for picking this up!
def get_buffers(self) -> ColumnBuffers:
"""
Return a dictionary containing the underlying buffers.
The returned dictionary has the following contents:
- "data": a two-element tuple whose first element is a buffer
containing the data and whose second element is the data
buffer's associated dtype.
- "validity": a two-element tuple whose first element is a buffer
containing mask values indicating missing data and
whose second element is the mask value buffer's
associated dtype. None if the null representation is
not a bit or byte mask.
- "offsets": a two-element tuple whose first element is a buffer
containing the offset values for variable-size binary
data (e.g., variable-length strings) and whose second
element is the offsets buffer's associated dtype. None
if the data buffer does not have an associated offsets
buffer.
"""
There is no issue on this. I thought about opening one, but I would say that the text for the
get_buffers
method is quite clear already (see below).
If several implementations got this wrong, then the doc is certainly not clear enough IMHO.
There is no issue on this. I thought about opening one, but I would say that the text for the
get_buffers
method is quite clear already (see below).If several implementations got this wrong, then the doc is certainly not clear enough IMHO.
Not necessarily. I think everyone just followed along with the first implementation (pandas?) without thinking too hard about it.
But of course you could open a PR at the DataFrame API repo with an improved docstring.
But of course you could open a PR at the DataFrame API repo with an improved docstring.
I'll try to. Though I already opened 5 issues and never got a single response...
And I think also several people involved in the protocol spec repo itself (me included) are clearly confusing this (apart from the implementations). Seeing https://github.com/data-apis/dataframe-api/pull/87, where "I assume that the dtype should be provided as the same tuple returned from Column.dtype
" was confirmed.
And the PR referenced there to add the protocol to StaticFrame did exactly the same "mistake".
So yes, I think although strictly speaking the current docstring is clear and unambiguous, calling it out explicitly would clearly be worth it.
And every implementation will have to handle both ways of specifying the dtype on the short term ...
@AlenkaF Thanks so much for addressing this issue! Should we perhaps create a follow-up issue for updating the data buffer dtype?
Ah, I was thinking of keeping this one open until all 3 points from the description are resolved.
@jorisvandenbossche if you would rather see me opening a new one, I am fine with that. Otherwise I will reopen this one.
@jorisvandenbossche do you know if the https://github.com/apache/arrow/pull/37986 PR will get into 14.0.0 if this issue is not closed? If not, then it may be better to close this issue and open a follow-up for the third item from the list ... 😬 As we need to sync these changes across multiple libraries, the third item might take a while.
Yes, the release branch is not yet made, so everything that is in main right now will be included in the release regardless of the issue.
@stinodego looking at two last items from the list:
- [ ] Make sure other libraries have also updated their from_dataframe implementation. See https://github.com/pandas-dev/pandas/issues/54781 for the pandas issue.
- [ ] Fix the data buffer dtypes.
Do you think we could proceed and fix the data buffer dtypes in pyarrow and pandas implementation? Pandas already fixed the first item from the list (Fix the from_dataframe implementation to use the column dtype rather than the data buffer dtype to interpret the buffers.), the fix on our side was included in 14.0.0 release. I would like to see the last item added to 15.0.0 in pyarrow if possible.
I plan to include the fix in the Polars 1.0.0 release which comes somewhere in January/February. So I'd say go ahead and get the fix done for pyarrow.
I know @MarcoGorelli has expressed some concern about compatibility though. He noted that this means people on older pandas versions can no longer longer use from_dataframe
to read pyarrow frames created on newer pyarrow versions. That's a valid point. Decision is up to you!
Describe the bug, including details regarding any error messages, version, and platform.
Code example
Issue description
As you can see, the dtype of the data buffer is the same as the dtype of the column. This is only correct for integers and floats. Categoricals, strings, and datetime types have a some integer as their physical representation. The data buffer should have this physical data type associated with it.
The dtype of the Column object should provide information on how to interpret the various buffers. The dtype associated with each buffer should be the dtype of the actual data in that buffer. This is the second part of the issue: the implementation of
from_dataframe
is incorrect - it should use the column dtype rather than the data buffer dtype.Fix
Fixing the
get_buffers
implementation should be relatively simple. However, this will break anyfrom_dataframe
implementation (also from other libraries) that rely on the data buffer having the column dtype.So fixing this should ideally go in three steps:
from_dataframe
implementation to use the column dtype rather than the data buffer dtype to interpret the buffers.from_dataframe
implementation. See https://github.com/pandas-dev/pandas/issues/54781 for the pandas issue.Tagging @AlenkaF as I know you've been working on the protocol for pyarrow.
Component(s)
Python