apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.48k stars 3.52k forks source link

[C++][Parquet] Preserve the bithwidth of the integer dictionary indices on rountrip to Parquet? #30302

Open asfimport opened 2 years ago

asfimport commented 2 years ago

When converting from a pandas dataframe to a table, categorical variables are by default given an index type int8 (presumably because there are fewer than 128 categories) in the schema. When this is written to a parquet file, the schema changes such that the index type is int32 instead. This causes an inconsistency between the schemas of tables derived from pandas and those read from disk.

A minimal recreation of the issue is as follows:


import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

df = pd.DataFrame({"A": [1, 2, 3, 4, 5], "B": ["a", "a", "b", "c", "b"]})
dtypes = {
    "A": np.dtype("int8"),
    "B": pd.CategoricalDtype(categories=["a", "b", "c"], ordered=None),
}
df = df.astype(dtypes)

tbl = pa.Table.from_pandas(
    df, 
)  
where = "tmp.parquet"
filesystem = pa.fs.LocalFileSystem()

pq.write_table(
    tbl,
    filesystem.open_output_stream(
        where,
        compression=None,
    ),
    version="2.0",
)

schema = tbl.schema

read_schema = pq.ParquetFile(
    filesystem.open_input_file(where),
).schema_arrow

By printing schema and read_schema, you can the inconsistency.

I have workarounds in place for this, but am raising the issue anyway so that you can resolve it properly.

Environment: NAME="CentOS Linux" VERSION="7 (Core)" Reporter: Gavin

Note: This issue was originally created as ARROW-14767. Please see the migration documentation for further details.

asfimport commented 2 years ago

Joris Van den Bossche / @jorisvandenbossche: Thanks for the report!

I can reproduce your issue. Looking at the written Parquet file's schema, you can see that the column "B" is indicated to be of type "String":


In [29]: parquet_metadata = pq.ParquetFile(
    ...:     filesystem.open_input_file(where),
    ...: )

In [30]: parquet_metadata.schema
Out[30]: 
<pyarrow._parquet.ParquetSchema object at 0x7f60be97fe00>
required group field_id=-1 schema {
  optional int32 field_id=-1 A (Int(bitWidth=8, isSigned=true));
  optional binary field_id=-1 B (String);
}

Parquet still uses dictionary encoding for compression reasons, but it is not a separate type in their type system.

Arrow has the ability to read such dictionary encoded string columns directly into an Arrow dictionary type. But at that point it doesn't preserve the int8 type for the indices, since that is not part of the parquet spec, and rather uses Arrow's default of int32 indices.

Now, we do write the Arrow schema into the Parquet file metadata, so we can use that information to ensure a more faithful roundtrip of Arrow tables to Parquet and back.

That doesn't happen here, but in principle could be done (although that will probably require an additional cast after the fact)