data-apis / dataframe-api

RFC document, tooling and other content related to the dataframe API standard
https://data-apis.org/dataframe-api/draft/index.html
MIT License
101 stars 20 forks source link

Columns with bit/bytemask null representation should be able to return None for validity buffer when there is no missing values #90

Open AlenkaF opened 1 year ago

AlenkaF commented 1 year ago

I am currently working on the implementation of the dataframe interchange protocol for PyArrow. After testing the current PyArrow implementation for producing a __dataframe__ object with Pandas implementation for consuming I have noticed that columns that use bit/bytemask null representation, but do not have missing values, error.

The reason for this is that Apache Arrow does not create a mask buffer when there are no missing values present. Therefore the result of calling .get_buffers()["validity"] on the PyArrow __dataframe__ object without missing values is None which is currently not handled by the protocol specification. See: https://github.com/pandas-dev/pandas/blob/5c66e65d7b9fef47ccb585ce2fd0b3ea18dc82ea/pandas/core/interchange/from_dataframe.py#L502

For now we are checking for columns without missing values and in that case describe that column as non-nullable. But we think there should be an option for nullable columns with bit/bytemasks null representation to return None instead of a buffer.

honno commented 1 year ago

For now we are checking for columns without missing values and in that case describe that column as non-nullable. But we think there should be an option for nullable columns with bit/bytemasks null representation to return None instead of a buffer.

If a column ultimately doesn't have a mask when there are no missing values, I'm wondering if that's just fine? Like even it may be incorrect to describe an interchange column as having a bit/byte-mask when it doesn't have a bit/byte-mask.


For onlookers, the relevant docs for what buf, dtype = Column.get_buffers()["validity"] currently should contain

https://github.com/data-apis/dataframe-api/blob/aa6fe7d7bc4fd6fd24b8dd6b4dfb8c58cac2d8b9/protocol/dataframe_protocol.py#L353-L357

jorisvandenbossche commented 1 year ago

For now we are checking for columns without missing values and in that case describe that column as non-nullable. But we think there should be an option for nullable columns with bit/bytemasks null representation to return None instead of a buffer.

If a column ultimately doesn't have a mask when there are no missing values, I'm wondering if that's just fine? Like even it may be incorrect to describe an interchange column as having a bit/byte-mask when it doesn't have a bit/byte-mask.

That's certainly a possible solution, but I personally find that it feels a bit wrong. The column is nullable, in the meaning that it "can" have nulls (that's typically how "nullable" is interpreted, I think). The null count just happens to be 0, in which case arrow can optimize this by not allocating the bitmask. Also for a datetime64 column, you probably won't change the null type from USE_SENTINEL to NON_NULLABLE if there are no nulls (NaT) present (although of course here it has no impact on the memory layout).

One corner case where this fallback to non-nullable doesn't necessarily work optimally is that a column can have multiple chunks, and in pyarrow, one chunk might have a null bitmap, and a next chunk might not have one.