Open AlenkaF opened 3 years ago
Ah this is an ExtensionArray
, those have not been tested. Third-party extension arrays are too hard to support I'd say, but ones with builtin dtypes would be nice to support.
In this case it looks like BooleanDtype
has itemsize
and kind
attributes, but not str
. Hence the error:
AttributeError: 'BooleanDtype' object has no attribute 'str'
The dtype seems to wrap the underlying NumPy dtype which does have the needed attribute:
>>> df['B'].dtype.numpy_dtype.str
'|b1'
The second thing I noticed in the code is that it uses NumPy format strings, while the docs for Column.dtype
specify it must use the format string from the Apache Arrow C Data Interface (similar but slightly different). So we need a utility to map NumPy to Arrow format here. This should say 'b'
not |b1'
:
>>> df.__dataframe__().get_column_by_name('A').dtype
(<_DtypeKind.BOOL: 20>, 8, '|b1', '|')
Thank you for the clarification @rgommers. Should I make a separate issue for the mapping of format strings?
Should I make a separate issue for the mapping of format strings?
That would be helpful, thanks!
Now that Pandas 1.5.0 has a release candidate, I checked this:
>>> import pandas as pd
>>> pd.__version__
'1.5.0rc0'
>>> df = pd.DataFrame({"A": [True, False, False, True]})
>>> df["B"] = pd.array([True, False, pd.NA, True], dtype="boolean")
>>> df2 = pd.core.interchange.from_dataframe.from_dataframe(df)
>>> df
A B
0 True True
1 False False
2 False <NA>
3 True True
>>> df2
A B
0 True True
1 False False
2 False <NA>
3 True True
>>> pd.testing.assert_frame_equal(df, df2) # doesn't raise, so all good
So all good, let's close this. Thanks again @AlenkaF
@rgommers the reason this roundtrip works for pandas is because from_dataframe
has a a special case for pandas dataframes to not go through __dataframe__
(https://github.com/pandas-dev/pandas/blob/5514aa3713b66f531f3abfc9cfe726a1dac638ff/pandas/core/interchange/from_dataframe.py#L47-L48):
if isinstance(df, pd.DataFrame):
return df
Accessing the buffer actually doesn't work yet (and so the roundtrip neither, I suppose):
In [5]: df.__dataframe__().get_column_by_name("B").get_buffers()
---------------------------------------------------------------------------
NotImplementedError Traceback (most recent call last)
Input In [5], in <cell line: 1>()
----> 1 df.__dataframe__().get_column_by_name("B").get_buffers()
File ~/scipy/pandas/pandas/core/interchange/column.py:239, in PandasColumn.get_buffers(self)
219 def get_buffers(self) -> ColumnBuffers:
220 """
221 Return a dictionary containing the underlying buffers.
222 The returned dictionary has the following contents:
(...)
236 buffer.
237 """
238 buffers: ColumnBuffers = {
--> 239 "data": self._get_data_buffer(),
240 "validity": None,
241 "offsets": None,
242 }
244 try:
245 buffers["validity"] = self._get_validity_buffer()
File ~/scipy/pandas/pandas/core/interchange/column.py:262, in PandasColumn._get_data_buffer(self)
256 def _get_data_buffer(
257 self,
258 ) -> tuple[PandasBuffer, Any]: # Any is for self.dtype tuple
259 """
260 Return the buffer containing the data and the buffer's associated dtype.
261 """
--> 262 if self.dtype[0] in (
263 DtypeKind.INT,
264 DtypeKind.UINT,
265 DtypeKind.FLOAT,
266 DtypeKind.BOOL,
267 DtypeKind.DATETIME,
268 ):
269 buffer = PandasBuffer(self._col.to_numpy(), allow_copy=self._allow_copy)
270 dtype = self.dtype
File ~/scipy/pandas/pandas/_libs/properties.pyx:36, in pandas._libs.properties.CachedProperty.__get__()
File ~/scipy/pandas/pandas/core/interchange/column.py:126, in PandasColumn.dtype(self)
124 raise NotImplementedError("Non-string object dtypes are not supported yet")
125 else:
--> 126 return self._dtype_from_pandasdtype(dtype)
File ~/scipy/pandas/pandas/core/interchange/column.py:141, in PandasColumn._dtype_from_pandasdtype(self, dtype)
137 if kind is None:
138 # Not a NumPy dtype. Check if it's a categorical maybe
139 raise ValueError(f"Data type {dtype} not supported by interchange protocol")
--> 141 return kind, dtype.itemsize * 8, dtype_to_arrow_c_fmt(dtype), dtype.byteorder
File ~/scipy/pandas/pandas/core/interchange/utils.py:89, in dtype_to_arrow_c_fmt(dtype)
86 resolution = re.findall(r"\[(.*)\]", typing.cast(np.dtype, dtype).str)[0][:1]
87 return ArrowCTypes.TIMESTAMP.format(resolution=resolution, tz="")
---> 89 raise NotImplementedError(
90 f"Conversion of {dtype} to Arrow C format string is not implemented."
91 )
NotImplementedError: Conversion of boolean to Arrow C format string is not implemented.
Ah, thanks for pointing that out. Let's reopen this issue then.
When researching all possible dtypes with missing values in Vaex and observing how this is handled in Pandas implementation I found that there is a BooleanDtype in Pandas that gives an error.
My question is: when thinking of all possible entries into Vaex dataframe should one stick to the common or should one dissect all possibilities on this level?