apache / parquet-java

Apache Parquet Java
https://parquet.apache.org/
Apache License 2.0
2.65k stars 1.41k forks source link

Prevent writing a file with a footer size exceeding the Int max value #2986

Closed ConeyLiu closed 2 months ago

ConeyLiu commented 3 months ago

Describe the bug, including details regarding any error messages, version, and platform.

The footer size is assumed as an int:

BytesUtils.writeIntLittleEndian(out, (int) (out.getPos() - footerIndex));

This force casting is not safe. For example, we could write out a file with a size exceeding the Int max value and get a corrupted file:

java.lang.RuntimeException: corrupted file: the footer index is not within the file: 10200584257
        |       at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:571)
        |       at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:799)
        |       at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:666)
        |       at org.apache.iceberg.parquet.ReadConf.newReader(ReadConf.java:240)
        |       at org.apache.iceberg.parquet.ReadConf.<init>(ReadConf.java:81)
        |       at org.apache.iceberg.parquet.VectorizedParquetReader.init(VectorizedParquetReader.java:90)
        |       at org.apache.iceberg.parquet.VectorizedParquetReader.iterator(VectorizedParquetReader.java:99)
        |       at org.apache.iceberg.spark.source.BatchDataReader.open(BatchDataReader.java:195)
        |       at org.apache.iceberg.spark.source.BatchDataReader.open(BatchDataReader.java:49)
        |       at org.apache.iceberg.spark.source.BaseReader.next(BaseReader.java:150)
        |       at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:119)
        |       at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:156)
        |       at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1(DataSourceRDD.scala:63)
        |       at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1$adapted(DataSourceRDD.scala:63)
        |       at scala.Option.exists(Option.scala:376)
        |       at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63)
        |       at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.advanceToNextIter(DataSourceRDD.scala:97)
        |       at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:63)
        |       at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)

Component(s)

No response

dylanburati commented 2 months ago

I have the same issue with a corrupted file due to overflow in this field; it was created using the Rust parquet crate, which uses unsigned ints for this field (link). Also, the file is usable with pyarrow. I'm wondering if this specific field could be treated as unsigned in Java as well, since it doesn't seem to be referenced as i32 in the format specification.

using parquet-cli 1.14.1:

$ tail -c 64 ~/Downloads/enwiki/20240620/enwiki_20240620.parquet | xxd -g 4                                                                                                                        
00000000: 41414141 41414141 41454141 41414141  AAAAAAAAAEAAAAAA                                                                                                                                    
00000010: 67414141 476c6b41 41413d00 18197061  gAAAGlkAAA=...pa                                                                                                                                    
00000020: 72717565 742d7273 20766572 73696f6e  rquet-rs version                                                                                                                                    
00000030: 2033342e 302e3000 e755eb8a 50415231   34.0.0..U..PAR1                                                                                                                                    

$ parquet pages ~/Downloads/enwiki/20240620/enwiki_20240620.parquet                                                                                                                                
Unknown error                                   
java.lang.RuntimeException: corrupted file: the footer index is not within the file: 39975304334                                                                                                   
        at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:608)                                                                                                      
        at org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:902)                                                                                                          
        at org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:659)                                                                                                            
        at org.apache.parquet.cli.commands.ShowPagesCommand.run(ShowPagesCommand.java:93)                                                                                                          
        at org.apache.parquet.cli.Main.run(Main.java:163)                                        
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)                                                                                                                               
        at org.apache.parquet.cli.Main.main(Main.java:191)                                                                                                                                         

$ python -c "print($(stat -c %s ~/Downloads/enwiki/20240620/enwiki_20240620.parquet) - 8 - (-0x10000_0000 + 0x8aeb_55e7))"                                                                         
39975304334                                     

$ python -c 'import pyarrow.parquet as pq; f = pq.ParquetFile("~/Downloads/enwiki/20240620/enwiki_20240620.parquet"); print(f.metadata)'
<pyarrow._parquet.FileMetaData object at 0x729a06892a70>
  created_by: parquet-rs version 34.0.0
  num_columns: 6
  num_rows: 23802888
  num_row_groups: 238062
  format_version: 1.0
  serialized_size: 2330678759
ConeyLiu commented 2 months ago

Sounds reasonable, let me investigate it.