Categorical dtype not preserved with fastparquet-write, pyarrow-read

zmoon commented 4 months ago

Describe the issue: Not sure if this is a fastparquet or pyarrow (or pandas) issue, but I noticed that a column with pandas categorical dtype is read as object dtype if the Parquet file is created by the fastparquet engine and then read by the pyarrow engine. The other three cases preserve the dtype.

Minimal Complete Verifiable Example:

import itertools

import pandas as pd

df = pd.Series(["a", "b", "c"]).rename("cat").astype("category").to_frame()

fn = "cat.parquet"
data = []
for write, read in itertools.product(["pyarrow", "fastparquet"], repeat=2):
    df.to_parquet(fn, engine=write)
    df_ = pd.read_parquet(fn, engine=read)
    data.append((write, read, df_["cat"].dtype))

res = pd.DataFrame(data, columns=["write", "read", "dtype"])
print(res)

         write         read     dtype
0      pyarrow      pyarrow  category
1      pyarrow  fastparquet  category
2  fastparquet      pyarrow    object
3  fastparquet  fastparquet  category

Anything else we need to know?:

Environment:

Dask version:
Python version: 3.11.3
Operating System:
Install method (conda, pip, source): pip
fastparquet 2024.2.0, pyarrow 15.0.0, pandas 2.2.0

martindurant commented 4 months ago

Thanks for notifying me, sounds like a metadata parsing thing. Whilst it should be easy to fix, I'm not sure when I will get to it.

Interestingly, with the fastparquet API, you can always assert that a give column should be a category type with categories=, but I don't think pyarrow can do that.

martindurant commented 4 months ago

arrow produces
{'column_indexes': [{'field_name': None, 'metadata': {'encoding': 'UTF-8'}, 'name': None, 'numpy_type': 'object', 'pandas_type': 'unicode'}],
 'columns': [{'field_name': 'cat', 'metadata': {'num_categories': 3, 'ordered': False}, 'name': 'cat', 'numpy_type': 'int8', 'pandas_type': 'categorical'}],
 'creator': {'library': 'pyarrow', 'version': '11.0.0'},
 'index_columns': [{'kind': 'range', 'name': None, 'start': 0, 'step': 1, 'stop': 3}],
 'pandas_version': '2.1.4'}

fastparquet produces
{'column_indexes': [{'field_name': None, 'metadata': {'encoding': 'UTF-8'}, 'name': None, 'numpy_type': 'object', 'pandas_type': 'unicode'}],
 'columns': [{'field_name': 'cat', 'metadata': {'num_categories': 3, 'ordered': False}, 'name': 'cat', 'numpy_type': 'int8', 'pandas_type': 'categorical'}],
 'creator': {'library': 'pyarrow', 'version': '11.0.0'},
 'index_columns': [{'kind': 'range', 'name': None, 'start': 0, 'step': 1, 'stop': 3}],
 'pandas_version': '2.1.4'}

So I can only suppose arrow doesn't trust categories not made by arrow - it's their fault?

dask / fastparquet

Categorical dtype not preserved with fastparquet-write, pyarrow-read #920