Open ggreco opened 6 days ago
So I don't think this is a bug per-se, the parquet writer converts the arrow schema faithfully into parquet, preserving the field name of the list elements.
The problem arises because the default within the arrow ecosystem is to call this "item" and not "element".
>>> import pyarrow as pa
>>> pa.list_(pa.string())
ListType(list<item: string>)
>>> pa.list_(pa.string()).field(0).name
'item'
The reason this matters is because the parquet schema is authoritative, that is when reading back a parquet file with a field name of "element", the arrow schema should reflect this. Therefore if we coerced to "element" on write the schema would not roundtrip as people might expect.
In the short-term, if you adjust your arrow schema to use "element", the parquet writer will follow this.
Longer term I think the way to handle this is probably #1938, where we add an option to coerce arrow types to be more compatible with parquet's type system, with the understanding that things may not always roundtrip completely faithfully.
BTW I tried to generate an arrow document in C++ and:
simply publish the parquet file with c++
In the first path the lists contain struct objects called item
, while in the second path the lists contain struct objects called element
, so the C++ parquet writing logic already does this conversion to better support the parquet standard.
I've found that at least a major parquet reader created following the parquet specs and not the arrow ones ( https://aloneguid.github.io/parquet-dotnet/starter-topic.html ) cannot properly read the nested parquet files created with arrow-rs
, so I think the bug label should probably stay.
That is a bug in that parquet implementation, it should handle this case:
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules
I agree the situation is unfortunate, parquet should have defined these things from the start, but it didn't, and efforts are in fact still ongoing to refine the specification w.r.t nested schemas.
Edit: TBC the reason the distinction matters is if this was a bug we could change this behaviour in a patch release, we cannot do this as it will break people's schemas
Describe the bug arrow-rs generated .parquet files where the schema implies a nested structure should call the list item
element
as of parquet specifications: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists... the files generated are instead using
item
, probably some legacy code was used to build the code.A similar issue has been recently fixed in polars-rs: https://github.com/pola-rs/polars/pull/17803
Pyarrow let you use
item
instead ofelement
(default) to support legacy files, but IMHO arrow-rs should not generate legacy parquet files ( https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html ):The code in arrow-rs that implement this is: https://github.com/apache/arrow-rs/blob/master/arrow-schema/src/field.rs#L147
IMHO the fix will just involve a single line change, I can create a PR, but I want to be sure I'm not reading the specs in the wrong way or there is a reason for hardcoding
item
since it seems too simple...To Reproduce Generate a nested parquet file, or use the one attached to this issue and verify (with an hex editor, parquet-schema from this REPO or with a GUI tool that shows the parquet schema like "parquet floor"), that the type name associated to the list item is always
item
instead ofelement
.Using example_parquet.zip the file attached to this ticket that follow the schema will be reported by
arrow-schema
:the expected value was:
I can get
parquet-schema
to outputelement
instead ofitem
when generating the parquet file from python or .net.In the hex editor you will see
students.list.item.name
instead of the expectedstudents.list.element.name
.