data-apis / dataframe-interchange-tests

Test suite for the dataframe interchange protocol
MIT License
3 stars 2 forks source link

Arrow support #21

Closed honno closed 1 year ago

honno commented 1 year ago

Should resolve #20. Wrapped pa.Table but haven't gone through all the current failures, and will need to do the same with pa.RecordBatch.

AlenkaF commented 1 year ago

I think the reason for the failing test is a bug in the Vaex implementation. It seems to create a bitmask where a missing value should be defined with 0 but is in fact defined by 1/True:

>>> import pyarrow as pa
>>> table = pa.table([[False]], names=["A"])
>>> table
pyarrow.Table
A: bool
----
A: [[false]]

>>> import vaex
>>> from vaex.dataframe_protocol import from_dataframe_to_vaex
>>> vaex_df = from_dataframe_to_vaex(table)
>>> vaex_df
  #  A
  0  False

>>> vaex_df_p = vaex_df.__dataframe__()
>>> vaex_col_p = vaex_df_p.get_column_by_name("A")           
>>> vaex_col_p.describe_null
(3, 0)

>>> vaex_col_p.get_buffers()
{'data': (VaexBuffer({'bufsize': 1, 'ptr': 105553125834912, 'device': 'CPU'}), (<_DtypeKind.BOOL: 20>, 8, '|b1', '|')), 'validity': (VaexBuffer({'bufsize': 1, 'ptr': 105553125865376, 'device': 'CPU'}), (<_DtypeKind.BOOL: 20>, 8, '|b1', '|')), 'offsets': None}
>>> vaex_col_p._get_validity_buffer()[0]._x
array([False])

In the above example describe_null gives (3, 0) but the valid value has a validity buffer value equal to 0/False which would mean it is missing.

jorisvandenbossche commented 1 year ago

FWIW I am not sure it is very important to test both Table and RecordBatch (the implementation is almost fully shared under the hood, and certainly when creating the Table in a simple way, it will typically consist of a single batch, and then it is fully the same), in case that complicates the testing code

honno commented 1 year ago

In the end it looks like pyarrow has the most robust protocol implementation heh.

FWIW I am not sure it is very important to test both Table and RecordBatch (the implementation is almost fully shared under the hood, and certainly when creating the Table in a simple way, it will typically consist of a single batch, and then it is fully the same), in case that complicates the testing code

Good shout, fortunately its been pretty simple to test RecordBatch with the existing machinery for Table!

AlenkaF commented 1 year ago

Great news! 🎉 And great work @honno, thank you!