apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.62k stars 802 forks source link

Primitive REPEATED fields not contained in LIST annotated groups aren't read as lists by record reader #6648

Closed zeevm closed 2 weeks ago

zeevm commented 3 weeks ago

Describe the bug Primitive REPEATED fields not contained in LIST annotated groups should be read as lists according to the format but aren't.

To Reproduce

Expected behavior

Additional context

zeevm commented 3 weeks ago

repeated_no_list.parquet.gz The attached file has two REPEATED primitive fields that aren't read correctly with the current implementation: {Int32: 0, String: "foo"} {Int32: 1, String: "zero"} {Int32: 2, String: "one"} {Int32: 3, String: "two"}

The fix yields the proper result: {Int32: [0, 1, 2, 3], String: ["foo", "zero", "one", "two"]} {Int32: [], String: ["three"]} {Int32: [4], String: ["four"]} {Int32: [5, 6, 7, 8], String: ["five", "six", "seven", "eight"]}

etseidl commented 3 weeks ago

TIL, thanks! One question, however. The spec actual says A repeated field that is neither contained by a LIST- or MAP-annotated group nor annotated by LIST or MAP should be interpreted as a required list of required elements where the element type is the type of the field. So is it allowable for the max def level to be 1 here? If the elements are required, shouldn't max_def be 0? TBH I'm not sure why the spec is worded that way...nulls are clearly detectable so I'd think the required vs optional could be deduced from the max def level. In fact, arrow-cpp/pyarrow seems to read the given file just fine (although the arrow schema seems do indicate all elements are not null).

>>> df = pq.read_table('repeated_no_list.parquet')
>>> df
pyarrow.Table
Int32: list<Int32: int32 not null> not null
  child 0, Int32: int32 not null
String: list<String: string not null> not null
  child 0, String: string not null
----
Int32: [[[0,1,2,3],[],[4],[5,6,7,8]]]
String: [[["foo","zero","one","two"],["three"],["four"],["five","six","seven","eight"]]]

Then again, lists confuse me. So even if required, is it that a REPEATED field has 0 to many entries?

mapleFU commented 3 weeks ago

FYI, We currently meet a similiar problem and issue a discussion here[1], also refer to [2]

[1] https://lists.apache.org/thread/s6b25j3x26009v054yqjov0f1z49ctqj [2] https://github.com/apache/parquet-format/pull/466

wgtmac commented 3 weeks ago

This is not a normal LIST type specified by the Parquet spec where a LIST annotation on a group type is required. Should we disable writing lists as this in parquet-rs? We should follow the three level list encoding.

etseidl commented 3 weeks ago

But IIUC this form is permitted by the spec. It just can't be mixed with the LIST annotation.

tustvold commented 3 weeks ago

I believe this is specific to the record reader, i.e. the non arrow reader, and so have updated the title to reflect this. Let me know if I am mistaken

https://github.com/apache/arrow-rs/issues/2394 is also related

zeevm commented 3 weeks ago

@etseidl

So even if required, is it that a REPEATED field has 0 to many entries?

I think so, it aligns with the definition of REPEATED in the thrift:

" /* The field is repeated and can contain 0 or more values / REPEATED = 2;"

zeevm commented 3 weeks ago

@tustvold

I believe this is specific to the record reader, i.e. the non arrow reader, and so have updated the title to reflect this. Let me know if I am mistaken

2394 is also related

Correct, this is only for the record reader. TBH I totally forgot I've opened the same issue you've mentioned from two years ago :)

Every time we encounter such a file with one of our customers I'm looking into it again, finally decided to just fix it.

zeevm commented 3 weeks ago

@wgtmac

This is not a normal LIST type specified by the Parquet spec where a LIST annotation on a group type is required. Should we disable writing lists as this in parquet-rs? We should follow the three level list encoding.

This is allowed by the format, other writers sometimes generate such files (e.g. the customer file that tripped our code was created with parquet-mr)

wgtmac commented 3 weeks ago

Thanks @zeevm! I think the provided file is worth adding to https://github.com/apache/parquet-testing as a good example for interoperability test.

wgtmac commented 3 weeks ago

It seems that parquet-cli (backed by parquet-mr) cannot read it:

> parquet-cli meta repeated_no_list.parquet

File path:  repeated_no_list.parquet
Created by: parquet-rs version 53.2.0
Properties: (none)
Schema:
message schema {
  repeated int32 Int32;
  repeated binary String (STRING);
}

Row group 0:  count: 4  65.75 B records  start: 4  total(compressed): 263 B total(uncompressed):263 B
--------------------------------------------------------------------------------
        type      encodings count     avg size   nulls   min / max
Int32   INT32     _ RR_     10        10.90 B    1       "0" / "8"
String  BINARY    _ RR_     10        15.40 B    0       "eight" / "zero"

> parquet-cli cat repeated_no_list.parquet
Unknown error
java.lang.RuntimeException: Failed on record 0 in file repeated_no_list.parquet
    at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:89)
    at org.apache.parquet.cli.Main.run(Main.java:163)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
    at org.apache.parquet.cli.Main.main(Main.java:191)
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file file:/Users/gangwu/Downloads/repeated_no_list.parquet
    at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:280)
    at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136)
    at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:140)
    at org.apache.parquet.cli.BaseCommand$1$1.advance(BaseCommand.java:356)
    at org.apache.parquet.cli.BaseCommand$1$1.<init>(BaseCommand.java:337)
    at org.apache.parquet.cli.BaseCommand$1.iterator(BaseCommand.java:335)
    at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:76)
    ... 3 more
Caused by: org.apache.parquet.io.ParquetDecodingException: The requested schema is not compatible with the file schema. incompatible types: required group Int32 (LIST) {
  repeated int32 array;
} != repeated int32 Int32
    at org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.incompatibleSchema(ColumnIOFactory.java:104)
    at org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visitChildren(ColumnIOFactory.java:81)
    at org.apache.parquet.io.ColumnIOFactory$ColumnIOCreatorVisitor.visit(ColumnIOFactory.java:57)
    at org.apache.parquet.schema.MessageType.accept(MessageType.java:52)
    at org.apache.parquet.io.ColumnIOFactory.getColumnIO(ColumnIOFactory.java:167)
    at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:155)
    at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:245)
    ... 9 more
etseidl commented 3 weeks ago

It seems that parquet-cli (backed by parquet-mr) cannot read it:

Strange. An old parquet-tools 1.11.1 jar I have can read it, however.

row group 0 
--------------------------------------------------------------------------------
Int32:   INT32 UNCOMPRESSED DO:4 FPO:54 SZ:109/109/1.00 VC:10 ENC:PLAIN,RLE,RLE_DICTIONARY ST:[min: 0, max: 8, num_nulls: 1]
String:  BINARY UNCOMPRESSED DO:113 FPO:208 SZ:154/154/1.00 VC:10 ENC:PLAIN,RLE,RLE_DICTIONARY ST:[min: eight, max: zero, num_nulls: 0]

    Int32 TV=10 RL=1 DL=1 DS:  9 DE:PLAIN
    ----------------------------------------------------------------------------
    page 0:                     DLE:RLE RLE:RLE VLE:RLE_DICTIONARY ST:[min: 0, max: 8, num_nulls: 1] CRC:[none] SZ:24 VC:10

    String TV=10 RL=1 DL=1 DS: 10 DE:PLAIN
    ----------------------------------------------------------------------------
    page 0:                     DLE:RLE RLE:RLE VLE:RLE_DICTIONARY ST:[min: eight, max: zero, num_nulls: 0] CRC:[none] SZ:23 VC:10

INT32 Int32 
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 10 *** 
value 1:  R:0 D:1 V:0
value 2:  R:1 D:1 V:1
value 3:  R:1 D:1 V:2
value 4:  R:1 D:1 V:3
value 5:  R:0 D:0 V:<null>
value 6:  R:0 D:1 V:4
value 7:  R:0 D:1 V:5
value 8:  R:1 D:1 V:6
value 9:  R:1 D:1 V:7
value 10: R:1 D:1 V:8

BINARY String 
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 10 *** 
value 1:  R:0 D:1 V:foo
value 2:  R:1 D:1 V:zero
value 3:  R:1 D:1 V:one
value 4:  R:1 D:1 V:two
value 5:  R:0 D:1 V:three
value 6:  R:0 D:1 V:four
value 7:  R:0 D:1 V:five
value 8:  R:1 D:1 V:six
value 9:  R:1 D:1 V:seven
value 10: R:1 D:1 V:eight
alamb commented 6 days ago

label_issue.py automatically added labels {'parquet'} from #6649