apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.46k stars 729 forks source link

Footer parsing fails for very large parquet file. #4592

Closed Berrysoft closed 1 year ago

Berrysoft commented 1 year ago

Describe the bug The footer parser of parquet reader uses i32 incorrectly to get the size of footer.

To Reproduce It's not that easy to generate such a large parquet file. I have a 44G parquet file generated by pyarrow.

Expected behavior It should be opened successfully because it is generated by pyarrow:)

Additional context https://github.com/apache/arrow-rs/blob/16744e5ac08d9ead6c51ff6e08d8b91e87460c52/parquet/src/file/footer.rs#L106-L112

It should be u32 here, and we don't need the error below.

tustvold commented 1 year ago

Looking at the parquet-mr repository, the closest thing to an authoritative parquet implementation, the footer length is an i32 - https://github.com/apache/parquet-mr/blob/d2c3c6d2e761a17b75d56fd356a37a2f754072f7/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L1342C30-L1342C30

This is consistent with parquet in general, which as a result of its Java pedigree tends to use signed quantities everywhere. I suspect arrow-cpp should probably not let you write such a file

Berrysoft commented 1 year ago

parquet-cpp uses u32, thus pyarrow also uses u32.

https://github.com/apache/parquet-cpp/blob/642da055adf009652689b20e68a198cffb857651/src/parquet/file_reader.cc#L188-L189

https://github.com/apache/parquet-cpp/blob/642da055adf009652689b20e68a198cffb857651/src/parquet/file_writer.cc#L368

I didn't face any warnings or errors when writing such a file with pyarrow.

mapleFU commented 1 year ago

parquet-cpp is moved to arrow, and pyarrow shares the same underlying implemention with parquet cpp.

I'm ok for change it to u32, but I guess a 44Gib is too large for a parquet file, you may also need a huge thrift container for that

Berrysoft commented 1 year ago

It's really a huge filešŸ¤£, and we have hundreds of such files to query.

pyarrow needs to specify a larger thrift container buffer size. Both pyarrow and arrow-rs needs a long time to parse the metadata.

tustvold commented 1 year ago

label_issue.py automatically added labels {'parquet'} from #4599