dask / fastparquet

python implementation of the parquet columnar file format.
Apache License 2.0
746 stars 172 forks source link

Categorical dtype not preserved with fastparquet-write, pyarrow-read #920

Open zmoon opened 4 months ago

zmoon commented 4 months ago

Describe the issue: Not sure if this is a fastparquet or pyarrow (or pandas) issue, but I noticed that a column with pandas categorical dtype is read as object dtype if the Parquet file is created by the fastparquet engine and then read by the pyarrow engine. The other three cases preserve the dtype.

Minimal Complete Verifiable Example:

import itertools

import pandas as pd

df = pd.Series(["a", "b", "c"]).rename("cat").astype("category").to_frame()

fn = "cat.parquet"
data = []
for write, read in itertools.product(["pyarrow", "fastparquet"], repeat=2):
    df.to_parquet(fn, engine=write)
    df_ = pd.read_parquet(fn, engine=read)
    data.append((write, read, df_["cat"].dtype))

res = pd.DataFrame(data, columns=["write", "read", "dtype"])
print(res)
         write         read     dtype
0      pyarrow      pyarrow  category
1      pyarrow  fastparquet  category
2  fastparquet      pyarrow    object
3  fastparquet  fastparquet  category

Anything else we need to know?:

Environment:

martindurant commented 4 months ago

Thanks for notifying me, sounds like a metadata parsing thing. Whilst it should be easy to fix, I'm not sure when I will get to it.

Interestingly, with the fastparquet API, you can always assert that a give column should be a category type with categories=, but I don't think pyarrow can do that.

martindurant commented 4 months ago
arrow produces
{'column_indexes': [{'field_name': None, 'metadata': {'encoding': 'UTF-8'}, 'name': None, 'numpy_type': 'object', 'pandas_type': 'unicode'}],
 'columns': [{'field_name': 'cat', 'metadata': {'num_categories': 3, 'ordered': False}, 'name': 'cat', 'numpy_type': 'int8', 'pandas_type': 'categorical'}],
 'creator': {'library': 'pyarrow', 'version': '11.0.0'},
 'index_columns': [{'kind': 'range', 'name': None, 'start': 0, 'step': 1, 'stop': 3}],
 'pandas_version': '2.1.4'}

fastparquet produces
{'column_indexes': [{'field_name': None, 'metadata': {'encoding': 'UTF-8'}, 'name': None, 'numpy_type': 'object', 'pandas_type': 'unicode'}],
 'columns': [{'field_name': 'cat', 'metadata': {'num_categories': 3, 'ordered': False}, 'name': 'cat', 'numpy_type': 'int8', 'pandas_type': 'categorical'}],
 'creator': {'library': 'pyarrow', 'version': '11.0.0'},
 'index_columns': [{'kind': 'range', 'name': None, 'start': 0, 'step': 1, 'stop': 3}],
 'pandas_version': '2.1.4'}

So I can only suppose arrow doesn't trust categories not made by arrow - it's their fault?