Closed ConeyLiu closed 2 months ago
I have the same issue with a corrupted file due to overflow in this field; it was created using the Rust parquet crate, which uses unsigned ints for this field (link). Also, the file is usable with pyarrow
. I'm wondering if this specific field could be treated as unsigned in Java as well, since it doesn't seem to be referenced as i32
in the format specification.
using parquet-cli 1.14.1:
$ tail -c 64 ~/Downloads/enwiki/20240620/enwiki_20240620.parquet | xxd -g 4
00000000: 41414141 41414141 41454141 41414141 AAAAAAAAAEAAAAAA
00000010: 67414141 476c6b41 41413d00 18197061 gAAAGlkAAA=...pa
00000020: 72717565 742d7273 20766572 73696f6e rquet-rs version
00000030: 2033342e 302e3000 e755eb8a 50415231 34.0.0..U..PAR1
$ parquet pages ~/Downloads/enwiki/20240620/enwiki_20240620.parquet
Unknown error
java.lang.RuntimeException: corrupted file: the footer index is not within the file: 39975304334
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:608)
at org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:902)
at org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:659)
at org.apache.parquet.cli.commands.ShowPagesCommand.run(ShowPagesCommand.java:93)
at org.apache.parquet.cli.Main.run(Main.java:163)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
at org.apache.parquet.cli.Main.main(Main.java:191)
$ python -c "print($(stat -c %s ~/Downloads/enwiki/20240620/enwiki_20240620.parquet) - 8 - (-0x10000_0000 + 0x8aeb_55e7))"
39975304334
$ python -c 'import pyarrow.parquet as pq; f = pq.ParquetFile("~/Downloads/enwiki/20240620/enwiki_20240620.parquet"); print(f.metadata)'
<pyarrow._parquet.FileMetaData object at 0x729a06892a70>
created_by: parquet-rs version 34.0.0
num_columns: 6
num_rows: 23802888
num_row_groups: 238062
format_version: 1.0
serialized_size: 2330678759
Sounds reasonable, let me investigate it.
Describe the bug, including details regarding any error messages, version, and platform.
The footer size is assumed as an int:
This force casting is not safe. For example, we could write out a file with a size exceeding the Int max value and get a corrupted file:
Component(s)
No response