apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.62k stars 802 forks source link

Parquet readers incorrectly interpret legacy nested lists #6756

Open etseidl opened 2 days ago

etseidl commented 2 days ago

Describe the bug A file with the schema

message my_record {
  REQUIRED group a (LIST) {
    REPEATED group array (LIST) {
      REPEATED INT32 array;
    }
  }
}

is currently read by arrow-rs as a list<struct<list<int32>>, i.e. a list of a one-tuple encapsulating a list of integers. Consensus is forming around the notion that this should instead be a nested list of integer lists (see https://github.com/apache/parquet-format/pull/466 and https://github.com/apache/arrow/pull/43995).

To Reproduce Run parquet-rewrite on the file old_list_structure.parquet in parquet-testing/data and print the schema from the resulting file.

% parquet-rewrite -i old_list_structure.parquet -o old.pq
% parquet-schema old.pq
Metadata for file: old.pq

version: 1
num of rows: 1
created by: parquet-rs version 53.2.0
metadata:
  parquet.avro.schema: {"type":"record","name":"my_record","fields":[{"name":"a","type":{"type":"array","items":{"type":"array","items":"int"}}}]}
  writer.model.name: avro
  ARROW:schema: /////wABAAAQAAAAAAAKAAwACgAJAAQACgAAABAAAAAAAQQACAAIAAAABAAIAAAABAAAAAEAAAAEAAAAnP///xgAAAAMAAAAAAAADLQAAAABAAAACAAAAMD///+8////GAAAAAwAAAAAAAANiAAAAAEAAAAIAAAA4P///9z///8cAAAADAAAAAAAAAxcAAAAAQAAABwAAAAEAAQABAAAABAAFAAQAAAADwAEAAAACAAQAAAAGAAAACAAAAAAAAACHAAAAAgADAAEAAsACAAAACAAAAAAAAABAAAAAAUAAABhcnJheQAAAAUAAABhcnJheQAAAAUAAABhcnJheQAAAAEAAABhAAAA
message arrow_schema {
  REQUIRED group a (LIST) {
    REPEATED group list {
      REQUIRED group array {
        REQUIRED group array (LIST) {
          REPEATED group list {
            REQUIRED INT32 array;
          }
        }
      }
    }
  }
}

Expected behavior The test file should be read as nested lists and produce the following schema:

message arrow_schema {
  REQUIRED group a (LIST) {
    REPEATED group list {
      REQUIRED group array (LIST) {
        REPEATED group list {
          REQUIRED INT32 array;
        }
      }
    }
  }
}

Additional context The root cause is the naming of the repeated group as "array". This causes the code that handles legacy lists to use a rule which states:

If the repeated field is a group with one field and is named either array or uses the LIST-annotated group's name with _tuple appended then the repeated type is the element type and elements are required.

This rule should not apply due to a) the child of the repeated group "array" also having repeated repetition, and b) the LIST annotation on the repeated group.