Open ion-elgreco opened 3 months ago
I marked this as an enhancement (rather than a bug) but the distinction is likely not all that useful
It would be great to have the ArrowWriter
/ Reader
follow the same convention as pyarrow when reading/writing maps (or the standard if there is a standard that address this particular point)
This boils down to the same issue as https://github.com/apache/arrow-rs/issues/6733 namely that arrow has different naming conventions to parquet. As stated on that linked ticket the first step would be to add an option to coerce the schema on write, once that is added we can have discussions about changing this default, but it must remain possible to keep the current behaviour.
Describe the bug Creating a recordbatch with arrow map types will have different field names then parquet spec wants. When you write a parquet with datafusion, the parquet spec is simply ignored and the data is written as-is with the same field names in the parquet. Which violates the parquet spec.
The parquet file has this schema:
instead of
Pyarrow parquet writer doesn't do this, and follows the parquet spec when writing. See here:
You can see entries got written as key_value properly. Also interesting to note PyArrow uses "key","value", arrow-rs uses "keys","values",