apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.64k stars 807 forks source link

Add Option To Coerce Map Type on Parquet Write #6213

Open ion-elgreco opened 3 months ago

ion-elgreco commented 3 months ago

Describe the bug Creating a recordbatch with arrow map types will have different field names then parquet spec wants. When you write a parquet with datafusion, the parquet spec is simply ignored and the data is written as-is with the same field names in the parquet. Which violates the parquet spec.

The parquet file has this schema:

<pyarrow._parquet.ParquetSchema object at 0x7f5393f683c0>
required group field_id=-1 arrow_schema {
  optional group field_id=-1 map_type (Map) {
    repeated group field_id=-1 entries {
      required binary field_id=-1 key (String);
      optional binary field_id=-1 value;
    }
  }
}

instead of

<pyarrow._parquet.ParquetSchema object at 0x7f5393f9cd40>
required group field_id=-1 arrow_schema {
  optional group field_id=-1 map_type (Map) {
    repeated group field_id=-1 key_value {
      required binary field_id=-1 key (String);
      optional binary field_id=-1 value;
    }
  }
}

Pyarrow parquet writer doesn't do this, and follows the parquet spec when writing. See here:

import pyarrow as pa
import pyarrow.parquet as pq

pylist = [{"map_type":{'1':b"M"}}]
schema = pa.schema(
    [
        pa.field("map_type", pa.map_(pa.large_string(), pa.large_binary())),
    ]
)
table = pa.Table.from_pylist(pylist, schema=schema)

# table.schema
#
# map_type: map<large_string, large_binary>
#   child 0, entries: struct<key: large_string not null, value: large_binary> not null
#       child 0, key: large_string not null
#       child 1, value: large_binary

pq.write_table(table, "test.parquet")
pq.read_metadata("test.parquet").schema

<pyarrow._parquet.ParquetSchema object at 0x7f53cc5116c0>
required group field_id=-1 schema {
  optional group field_id=-1 map_type (Map) {
    repeated group field_id=-1 key_value {
      required binary field_id=-1 key (String);
      optional binary field_id=-1 value;
    }
  }
}

You can see entries got written as key_value properly. Also interesting to note PyArrow uses "key","value", arrow-rs uses "keys","values",

alamb commented 3 months ago

I marked this as an enhancement (rather than a bug) but the distinction is likely not all that useful

It would be great to have the ArrowWriter / Reader follow the same convention as pyarrow when reading/writing maps (or the standard if there is a standard that address this particular point)

tustvold commented 2 days ago

This boils down to the same issue as https://github.com/apache/arrow-rs/issues/6733 namely that arrow has different naming conventions to parquet. As stated on that linked ticket the first step would be to add an option to coerce the schema on write, once that is added we can have discussions about changing this default, but it must remain possible to keep the current behaviour.