apache / parquet-java

Apache Parquet Java
https://parquet.apache.org/
Apache License 2.0
2.48k stars 1.37k forks source link

Old Parquet files with wrong Compressed Size not Readable #2926

Open pyckle opened 1 week ago

pyckle commented 1 week ago

In certain circumstances, the CLI will fail to read old (perhaps ancient) parquet files that have an incorrect compressed_size field set in the column metadata that does not include the dictionary page (at least according to the comment in the code). The code that is supposed to handle this does not flip the byte buffer it reads the extra bytes into. It appears to have been broken for a few years now.

I have written a PR that includes a defective parquet file with this issue, wrote a unit test that fails without the additional flip, and validated that the code works afterwards.

This is a trivial minor issue that was from learning the code rather than actually addressing a production issue, so there's no urgency.