Closed zeevm closed 2 weeks ago
repeated_no_list.parquet.gz The attached file has two REPEATED primitive fields that aren't read correctly with the current implementation: {Int32: 0, String: "foo"} {Int32: 1, String: "zero"} {Int32: 2, String: "one"} {Int32: 3, String: "two"}
The fix yields the proper result: {Int32: [0, 1, 2, 3], String: ["foo", "zero", "one", "two"]} {Int32: [], String: ["three"]} {Int32: [4], String: ["four"]} {Int32: [5, 6, 7, 8], String: ["five", "six", "seven", "eight"]}
TIL, thanks! One question, however. The spec actual says A repeated field that is neither contained by a LIST- or MAP-annotated group nor annotated by LIST or MAP should be interpreted as a required list of required elements where the element type is the type of the field. So is it allowable for the max def level to be 1 here? If the elements are required, shouldn't max_def be 0? TBH I'm not sure why the spec is worded that way...nulls are clearly detectable so I'd think the required vs optional could be deduced from the max def level. In fact, arrow-cpp/pyarrow seems to read the given file just fine (although the arrow schema seems do indicate all elements are not null).
>>> df = pq.read_table('repeated_no_list.parquet')
>>> df
pyarrow.Table
Int32: list<Int32: int32 not null> not null
child 0, Int32: int32 not null
String: list<String: string not null> not null
child 0, String: string not null
----
Int32: [[[0,1,2,3],[],[4],[5,6,7,8]]]
String: [[["foo","zero","one","two"],["three"],["four"],["five","six","seven","eight"]]]
Then again, lists confuse me. So even if required, is it that a REPEATED field has 0 to many entries?
FYI, We currently meet a similiar problem and issue a discussion here[1], also refer to [2]
[1] https://lists.apache.org/thread/s6b25j3x26009v054yqjov0f1z49ctqj [2] https://github.com/apache/parquet-format/pull/466
This is not a normal LIST type specified by the Parquet spec where a LIST annotation on a group type is required. Should we disable writing lists as this in parquet-rs? We should follow the three level list encoding.
But IIUC this form is permitted by the spec. It just can't be mixed with the LIST annotation.
I believe this is specific to the record reader, i.e. the non arrow reader, and so have updated the title to reflect this. Let me know if I am mistaken
https://github.com/apache/arrow-rs/issues/2394 is also related
@etseidl
So even if required, is it that a REPEATED field has 0 to many entries?
I think so, it aligns with the definition of REPEATED in the thrift:
" /* The field is repeated and can contain 0 or more values / REPEATED = 2;"
@tustvold
I believe this is specific to the record reader, i.e. the non arrow reader, and so have updated the title to reflect this. Let me know if I am mistaken
2394 is also related
Correct, this is only for the record reader. TBH I totally forgot I've opened the same issue you've mentioned from two years ago :)
Every time we encounter such a file with one of our customers I'm looking into it again, finally decided to just fix it.
@wgtmac
This is not a normal LIST type specified by the Parquet spec where a LIST annotation on a group type is required. Should we disable writing lists as this in parquet-rs? We should follow the three level list encoding.
This is allowed by the format, other writers sometimes generate such files (e.g. the customer file that tripped our code was created with parquet-mr)
Thanks @zeevm! I think the provided file is worth adding to https://github.com/apache/parquet-testing as a good example for interoperability test.
It seems that parquet-cli (backed by parquet-mr) cannot read it:
> parquet-cli meta repeated_no_list.parquet
File path: repeated_no_list.parquet
Created by: parquet-rs version 53.2.0
Properties: (none)
Schema:
message schema {
repeated int32 Int32;
repeated binary String (STRING);
}
Row group 0: count: 4 65.75 B records start: 4 total(compressed): 263 B total(uncompressed):263 B
--------------------------------------------------------------------------------
type encodings count avg size nulls min / max
Int32 INT32 _ RR_ 10 10.90 B 1 "0" / "8"
String BINARY _ RR_ 10 15.40 B 0 "eight" / "zero"
> parquet-cli cat repeated_no_list.parquet
Unknown error
java.lang.RuntimeException: Failed on record 0 in file repeated_no_list.parquet
at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:89)
at org.apache.parquet.cli.Main.run(Main.java:163)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
at org.apache.parquet.cli.Main.main(Main.java:191)
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file file:/Users/gangwu/Downloads/repeated_no_list.parquet
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:280)
at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136)
at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:140)
at org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:356)
at org.apache.parquet.cli.BaseCommand$1$1.<init>(BaseCommand.java:337)
at org.apache.parquet.cli.BaseCommand$1.iterator(BaseCommand.java:335)
at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:76)
... 3 more
Caused by: org.apache.parquet.io.ParquetDecodingException: The requested schema is not compatible with the file schema. incompatible types: required group Int32 (LIST) {
repeated int32 array;
} != repeated int32 Int32
at org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.incompatibleSchema(ColumnIOFactory.java:104)
at org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visitChildren(ColumnIOFactory.java:81)
at org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:57)
at org.apache.parquet.schema.MessageType.accept(MessageType.java:52)
at org.apache.parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:167)
at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:155)
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:245)
... 9 more
It seems that parquet-cli (backed by parquet-mr) cannot read it:
Strange. An old parquet-tools 1.11.1 jar I have can read it, however.
row group 0
--------------------------------------------------------------------------------
Int32: INT32 UNCOMPRESSED DO:4 FPO:54 SZ:109/109/1.00 VC:10 ENC:PLAIN,RLE,RLE_DICTIONARY ST:[min: 0, max: 8, num_nulls: 1]
String: BINARY UNCOMPRESSED DO:113 FPO:208 SZ:154/154/1.00 VC:10 ENC:PLAIN,RLE,RLE_DICTIONARY ST:[min: eight, max: zero, num_nulls: 0]
Int32 TV=10 RL=1 DL=1 DS: 9 DE:PLAIN
----------------------------------------------------------------------------
page 0: DLE:RLE RLE:RLE VLE:RLE_DICTIONARY ST:[min: 0, max: 8, num_nulls: 1] CRC:[none] SZ:24 VC:10
String TV=10 RL=1 DL=1 DS: 10 DE:PLAIN
----------------------------------------------------------------------------
page 0: DLE:RLE RLE:RLE VLE:RLE_DICTIONARY ST:[min: eight, max: zero, num_nulls: 0] CRC:[none] SZ:23 VC:10
INT32 Int32
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 10 ***
value 1: R:0 D:1 V:0
value 2: R:1 D:1 V:1
value 3: R:1 D:1 V:2
value 4: R:1 D:1 V:3
value 5: R:0 D:0 V:<null>
value 6: R:0 D:1 V:4
value 7: R:0 D:1 V:5
value 8: R:1 D:1 V:6
value 9: R:1 D:1 V:7
value 10: R:1 D:1 V:8
BINARY String
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 10 ***
value 1: R:0 D:1 V:foo
value 2: R:1 D:1 V:zero
value 3: R:1 D:1 V:one
value 4: R:1 D:1 V:two
value 5: R:0 D:1 V:three
value 6: R:0 D:1 V:four
value 7: R:0 D:1 V:five
value 8: R:1 D:1 V:six
value 9: R:1 D:1 V:seven
value 10: R:1 D:1 V:eight
label_issue.py
automatically added labels {'parquet'} from #6649
Describe the bug Primitive REPEATED fields not contained in LIST annotated groups should be read as lists according to the format but aren't.
To Reproduce
Expected behavior
Additional context