Closed honno closed 1 year ago
I think the reason for the failing test is a bug in the Vaex implementation. It seems to create a bitmask where a missing value should be defined with 0
but is in fact defined by 1
/True
:
>>> import pyarrow as pa
>>> table = pa.table([[False]], names=["A"])
>>> table
pyarrow.Table
A: bool
----
A: [[false]]
>>> import vaex
>>> from vaex.dataframe_protocol import from_dataframe_to_vaex
>>> vaex_df = from_dataframe_to_vaex(table)
>>> vaex_df
# A
0 False
>>> vaex_df_p = vaex_df.__dataframe__()
>>> vaex_col_p = vaex_df_p.get_column_by_name("A")
>>> vaex_col_p.describe_null
(3, 0)
>>> vaex_col_p.get_buffers()
{'data': (VaexBuffer({'bufsize': 1, 'ptr': 105553125834912, 'device': 'CPU'}), (<_DtypeKind.BOOL: 20>, 8, '|b1', '|')), 'validity': (VaexBuffer({'bufsize': 1, 'ptr': 105553125865376, 'device': 'CPU'}), (<_DtypeKind.BOOL: 20>, 8, '|b1', '|')), 'offsets': None}
>>> vaex_col_p._get_validity_buffer()[0]._x
array([False])
In the above example describe_null
gives (3, 0)
but the valid value has a validity buffer value equal to 0
/False
which would mean it is missing.
FWIW I am not sure it is very important to test both Table and RecordBatch (the implementation is almost fully shared under the hood, and certainly when creating the Table in a simple way, it will typically consist of a single batch, and then it is fully the same), in case that complicates the testing code
In the end it looks like pyarrow
has the most robust protocol implementation heh.
FWIW I am not sure it is very important to test both Table and RecordBatch (the implementation is almost fully shared under the hood, and certainly when creating the Table in a simple way, it will typically consist of a single batch, and then it is fully the same), in case that complicates the testing code
Good shout, fortunately its been pretty simple to test RecordBatch
with the existing machinery for Table
!
Great news! 🎉 And great work @honno, thank you!
Should resolve #20. Wrapped
pa.Table
but haven't gone through all the current failures, and will need to do the same withpa.RecordBatch
.