Open zmoon opened 4 months ago
Thanks for notifying me, sounds like a metadata parsing thing. Whilst it should be easy to fix, I'm not sure when I will get to it.
Interestingly, with the fastparquet API, you can always assert that a give column should be a category type with categories=
, but I don't think pyarrow can do that.
arrow produces
{'column_indexes': [{'field_name': None, 'metadata': {'encoding': 'UTF-8'}, 'name': None, 'numpy_type': 'object', 'pandas_type': 'unicode'}],
'columns': [{'field_name': 'cat', 'metadata': {'num_categories': 3, 'ordered': False}, 'name': 'cat', 'numpy_type': 'int8', 'pandas_type': 'categorical'}],
'creator': {'library': 'pyarrow', 'version': '11.0.0'},
'index_columns': [{'kind': 'range', 'name': None, 'start': 0, 'step': 1, 'stop': 3}],
'pandas_version': '2.1.4'}
fastparquet produces
{'column_indexes': [{'field_name': None, 'metadata': {'encoding': 'UTF-8'}, 'name': None, 'numpy_type': 'object', 'pandas_type': 'unicode'}],
'columns': [{'field_name': 'cat', 'metadata': {'num_categories': 3, 'ordered': False}, 'name': 'cat', 'numpy_type': 'int8', 'pandas_type': 'categorical'}],
'creator': {'library': 'pyarrow', 'version': '11.0.0'},
'index_columns': [{'kind': 'range', 'name': None, 'start': 0, 'step': 1, 'stop': 3}],
'pandas_version': '2.1.4'}
So I can only suppose arrow doesn't trust categories not made by arrow - it's their fault?
Describe the issue: Not sure if this is a fastparquet or pyarrow (or pandas) issue, but I noticed that a column with pandas categorical dtype is read as object dtype if the Parquet file is created by the fastparquet engine and then read by the pyarrow engine. The other three cases preserve the dtype.
Minimal Complete Verifiable Example:
Anything else we need to know?:
Environment: