data-apis / dataframe-api

RFC document, tooling and other content related to the dataframe API standard
https://data-apis.org/dataframe-api/draft/index.html
MIT License
97 stars 20 forks source link

Pandas implementation and BooleanDtype #52

Open AlenkaF opened 2 years ago

AlenkaF commented 2 years ago

When researching all possible dtypes with missing values in Vaex and observing how this is handled in Pandas implementation I found that there is a BooleanDtype in Pandas that gives an error.

def test_bool():
    df = pd.DataFrame({"A": [True, False, False, True]})
    df["B"] = pd.array([True, False, pd.NA, True], dtype="boolean")
    df2 = from_dataframe(df)
    tm.assert_frame_equal(df, df2)

My question is: when thinking of all possible entries into Vaex dataframe should one stick to the common or should one dissect all possibilities on this level?

rgommers commented 2 years ago

Ah this is an ExtensionArray, those have not been tested. Third-party extension arrays are too hard to support I'd say, but ones with builtin dtypes would be nice to support.

In this case it looks like BooleanDtype has itemsize and kind attributes, but not str. Hence the error:

AttributeError: 'BooleanDtype' object has no attribute 'str'

The dtype seems to wrap the underlying NumPy dtype which does have the needed attribute:

>>> df['B'].dtype.numpy_dtype.str
'|b1'

The second thing I noticed in the code is that it uses NumPy format strings, while the docs for Column.dtype specify it must use the format string from the Apache Arrow C Data Interface (similar but slightly different). So we need a utility to map NumPy to Arrow format here. This should say 'b' not |b1':

>>> df.__dataframe__().get_column_by_name('A').dtype
(<_DtypeKind.BOOL: 20>, 8, '|b1', '|')
AlenkaF commented 2 years ago

Thank you for the clarification @rgommers. Should I make a separate issue for the mapping of format strings?

rgommers commented 2 years ago

Should I make a separate issue for the mapping of format strings?

That would be helpful, thanks!

rgommers commented 1 year ago

Now that Pandas 1.5.0 has a release candidate, I checked this:

>>> import pandas as pd
>>> pd.__version__
'1.5.0rc0'
>>> df = pd.DataFrame({"A": [True, False, False, True]})
>>> df["B"] = pd.array([True, False, pd.NA, True], dtype="boolean")
>>> df2 = pd.core.interchange.from_dataframe.from_dataframe(df)
>>> df
       A      B
0   True   True
1  False  False
2  False   <NA>
3   True   True
>>> df2
       A      B
0   True   True
1  False  False
2  False   <NA>
3   True   True
>>> pd.testing.assert_frame_equal(df, df2)  # doesn't raise, so all good

So all good, let's close this. Thanks again @AlenkaF

jorisvandenbossche commented 1 year ago

@rgommers the reason this roundtrip works for pandas is because from_dataframe has a a special case for pandas dataframes to not go through __dataframe__ (https://github.com/pandas-dev/pandas/blob/5514aa3713b66f531f3abfc9cfe726a1dac638ff/pandas/core/interchange/from_dataframe.py#L47-L48):

    if isinstance(df, pd.DataFrame):
        return df

Accessing the buffer actually doesn't work yet (and so the roundtrip neither, I suppose):

In [5]: df.__dataframe__().get_column_by_name("B").get_buffers()
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
Input In [5], in <cell line: 1>()
----> 1 df.__dataframe__().get_column_by_name("B").get_buffers()

File ~/scipy/pandas/pandas/core/interchange/column.py:239, in PandasColumn.get_buffers(self)
    219 def get_buffers(self) -> ColumnBuffers:
    220     """
    221     Return a dictionary containing the underlying buffers.
    222     The returned dictionary has the following contents:
   (...)
    236                      buffer.
    237     """
    238     buffers: ColumnBuffers = {
--> 239         "data": self._get_data_buffer(),
    240         "validity": None,
    241         "offsets": None,
    242     }
    244     try:
    245         buffers["validity"] = self._get_validity_buffer()

File ~/scipy/pandas/pandas/core/interchange/column.py:262, in PandasColumn._get_data_buffer(self)
    256 def _get_data_buffer(
    257     self,
    258 ) -> tuple[PandasBuffer, Any]:  # Any is for self.dtype tuple
    259     """
    260     Return the buffer containing the data and the buffer's associated dtype.
    261     """
--> 262     if self.dtype[0] in (
    263         DtypeKind.INT,
    264         DtypeKind.UINT,
    265         DtypeKind.FLOAT,
    266         DtypeKind.BOOL,
    267         DtypeKind.DATETIME,
    268     ):
    269         buffer = PandasBuffer(self._col.to_numpy(), allow_copy=self._allow_copy)
    270         dtype = self.dtype

File ~/scipy/pandas/pandas/_libs/properties.pyx:36, in pandas._libs.properties.CachedProperty.__get__()

File ~/scipy/pandas/pandas/core/interchange/column.py:126, in PandasColumn.dtype(self)
    124     raise NotImplementedError("Non-string object dtypes are not supported yet")
    125 else:
--> 126     return self._dtype_from_pandasdtype(dtype)

File ~/scipy/pandas/pandas/core/interchange/column.py:141, in PandasColumn._dtype_from_pandasdtype(self, dtype)
    137 if kind is None:
    138     # Not a NumPy dtype. Check if it's a categorical maybe
    139     raise ValueError(f"Data type {dtype} not supported by interchange protocol")
--> 141 return kind, dtype.itemsize * 8, dtype_to_arrow_c_fmt(dtype), dtype.byteorder

File ~/scipy/pandas/pandas/core/interchange/utils.py:89, in dtype_to_arrow_c_fmt(dtype)
     86     resolution = re.findall(r"\[(.*)\]", typing.cast(np.dtype, dtype).str)[0][:1]
     87     return ArrowCTypes.TIMESTAMP.format(resolution=resolution, tz="")
---> 89 raise NotImplementedError(
     90     f"Conversion of {dtype} to Arrow C format string is not implemented."
     91 )

NotImplementedError: Conversion of boolean to Arrow C format string is not implemented.
rgommers commented 1 year ago

Ah, thanks for pointing that out. Let's reopen this issue then.