[C++][Parquet] Preserve the bithwidth of the integer dictionary indices on rountrip to Parquet?

apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics

Apache License 2.0

14.48k stars 3.52k forks source link

When converting from a pandas dataframe to a table, categorical variables are by default given an index type int8 (presumably because there are fewer than 128 categories) in the schema. When this is written to a parquet file, the schema changes such that the index type is int32 instead. This causes an inconsistency between the schemas of tables derived from pandas and those read from disk.

A minimal recreation of the issue is as follows:


import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

df = pd.DataFrame({"A": [1, 2, 3, 4, 5], "B": ["a", "a", "b", "c", "b"]})
dtypes = {
    "A": np.dtype("int8"),
    "B": pd.CategoricalDtype(categories=["a", "b", "c"], ordered=None),
}
df = df.astype(dtypes)

tbl = pa.Table.from_pandas(
    df, 
)  
where = "tmp.parquet"
filesystem = pa.fs.LocalFileSystem()

pq.write_table(
    tbl,
    filesystem.open_output_stream(
        where,
        compression=None,
    ),
    version="2.0",
)

schema = tbl.schema

read_schema = pq.ParquetFile(
    filesystem.open_input_file(where),
).schema_arrow

By printing schema and read_schema, you can the inconsistency.

I have workarounds in place for this, but am raising the issue anyway so that you can resolve it properly.

Environment: NAME="CentOS Linux" VERSION="7 (Core)" Reporter: Gavin

_{Note: This issue was originally created as ARROW-14767. Please see the migration documentation for further details.}

In [29]: parquet_metadata = pq.ParquetFile( ...: filesystem.open_input_file(where), ...: ) In [30]: parquet_metadata.schema Out[30]: <pyarrow._parquet.ParquetSchema object at 0x7f60be97fe00> required group field_id=-1 schema { optional int32 field_id=-1 A (Int(bitWidth=8, isSigned=true)); optional binary field_id=-1 B (String); }

apache / arrow

[C++][Parquet] Preserve the bithwidth of the integer dictionary indices on rountrip to Parquet? #30302