apache / parquet-java

Apache Parquet Java
https://parquet.apache.org/
Apache License 2.0
2.55k stars 1.39k forks source link

Parquet reader throws error "Reading past RLE/BitPacking stream" for parquet file with null values #1555

Open asfimport opened 5 years ago

asfimport commented 5 years ago

Recently moved from parquet 1.8.x to 1.12 recently.

Dataset has > 20k null values to be written to a complex type. Earlier with 1.8.x, it would create single page but with 1.12 it creates 20 pages (parquet - 1414). Writing nulls to complex types has been optimised to be cached (null cache) that would be flushed on next non null encounter or explicit flush/close. With 1.8, it would have encountered explicit close and flush the null cache and write the page. But with 1.12, after encountering 20k values, the page is written prematurely.

 

Below is the metadata dump in both cases.

1.8 :

index._id TV=111396 RL=0 DL=2 ---------------------------------------------------------------------------- page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN ST:[num_nulls: 111396, min/max not defined] SZ:8 VC:111396

 

1.12 :

index._index TV=111396 RL=0 DL=2 ---------------------------------------------------------------------------- page 0: DLE:RLE RLE:BIT_PACKED VLE:PLAIN ST:[no stats for this column] SZ:4 VC:0 ...... page 19: DLE:RLE RLE:BIT_PACKED VLE:PLAIN ST:[no stats for this column] SZ:8 VC:111396

All the pages in 1.12 except the last page have same metadata. Now the issue is when the parquet reader kicks in, it sees that the RLE is bit packed and reads 8 bytes which goes beyond the stream as the size is only 4 (Reading past RLE/BitPacking stream).

Reporter: shyam narayan singh / @shyambits2004

Note: This issue was originally created as PARQUET-1575. Please see the migration documentation for further details.

asfimport commented 5 years ago

Gabor Szadovszky / @gszadovszky: parquet-mr 1.11 is not released yet so 1.12 is not even planned. Could you please provide the exact commit id you have tested with? I was not able to reproduce the issue. Could you provide more details (e.g. the schema of the file, exact number of records etc.) or a unit test for reproduction?

asfimport commented 5 years ago

shyam narayan singh / @shyambits2004: I actually tested with master branch that gives me 1.12.0-SNAPSHOT jars. So is the reason for 1.12. Let me test it with 1.10 too. Looking at the code (1.10.x), it should repro the issue. Will provide a unit test case once tested with 1.10.