To Reproduce
Run parquet-rewrite on the file old_list_structure.parquet in parquet-testing/data and print the schema from the resulting file.
% parquet-rewrite -i old_list_structure.parquet -o old.pq
% parquet-schema old.pq
Metadata for file: old.pq
version: 1
num of rows: 1
created by: parquet-rs version 53.2.0
metadata:
parquet.avro.schema: {"type":"record","name":"my_record","fields":[{"name":"a","type":{"type":"array","items":{"type":"array","items":"int"}}}]}
writer.model.name: avro
ARROW:schema: /////wABAAAQAAAAAAAKAAwACgAJAAQACgAAABAAAAAAAQQACAAIAAAABAAIAAAABAAAAAEAAAAEAAAAnP///xgAAAAMAAAAAAAADLQAAAABAAAACAAAAMD///+8////GAAAAAwAAAAAAAANiAAAAAEAAAAIAAAA4P///9z///8cAAAADAAAAAAAAAxcAAAAAQAAABwAAAAEAAQABAAAABAAFAAQAAAADwAEAAAACAAQAAAAGAAAACAAAAAAAAACHAAAAAgADAAEAAsACAAAACAAAAAAAAABAAAAAAUAAABhcnJheQAAAAUAAABhcnJheQAAAAUAAABhcnJheQAAAAEAAABhAAAA
message arrow_schema {
REQUIRED group a (LIST) {
REPEATED group list {
REQUIRED group array {
REQUIRED group array (LIST) {
REPEATED group list {
REQUIRED INT32 array;
}
}
}
}
}
}
Expected behavior
The test file should be read as nested lists and produce the following schema:
message arrow_schema {
REQUIRED group a (LIST) {
REPEATED group list {
REQUIRED group array (LIST) {
REPEATED group list {
REQUIRED INT32 array;
}
}
}
}
}
Additional context
The root cause is the naming of the repeated group as "array". This causes the code that handles legacy lists to use a rule which states:
If the repeated field is a group with one field and is named either array or uses the LIST-annotated group's name with _tuple appended then the repeated type is the element type and elements are required.
This rule should not apply due to a) the child of the repeated group "array" also having repeated repetition, and b) the LIST annotation on the repeated group.
Describe the bug A file with the schema
is currently read by arrow-rs as a
list<struct<list<int32>>
, i.e. a list of a one-tuple encapsulating a list of integers. Consensus is forming around the notion that this should instead be a nested list of integer lists (see https://github.com/apache/parquet-format/pull/466 and https://github.com/apache/arrow/pull/43995).To Reproduce Run parquet-rewrite on the file
old_list_structure.parquet
inparquet-testing/data
and print the schema from the resulting file.Expected behavior The test file should be read as nested lists and produce the following schema:
Additional context The root cause is the naming of the repeated group as "array". This causes the code that handles legacy lists to use a rule which states:
This rule should not apply due to a) the child of the repeated group "array" also having
repeated
repetition, and b) theLIST
annotation on the repeated group.