Closed Berrysoft closed 1 year ago
Looking at the parquet-mr repository, the closest thing to an authoritative parquet implementation, the footer length is an i32 - https://github.com/apache/parquet-mr/blob/d2c3c6d2e761a17b75d56fd356a37a2f754072f7/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java#L1342C30-L1342C30
This is consistent with parquet in general, which as a result of its Java pedigree tends to use signed quantities everywhere. I suspect arrow-cpp should probably not let you write such a file
parquet-cpp uses u32, thus pyarrow also uses u32.
I didn't face any warnings or errors when writing such a file with pyarrow.
parquet-cpp is moved to arrow, and pyarrow shares the same underlying implemention with parquet cpp.
I'm ok for change it to u32, but I guess a 44Gib is too large for a parquet file, you may also need a huge thrift container for that
It's really a huge fileš¤£, and we have hundreds of such files to query.
pyarrow needs to specify a larger thrift container buffer size. Both pyarrow and arrow-rs needs a long time to parse the metadata.
label_issue.py
automatically added labels {'parquet'} from #4599
Describe the bug The footer parser of parquet reader uses i32 incorrectly to get the size of footer.
To Reproduce It's not that easy to generate such a large parquet file. I have a 44G parquet file generated by pyarrow.
Expected behavior It should be opened successfully because it is generated by pyarrow:)
Additional context https://github.com/apache/arrow-rs/blob/16744e5ac08d9ead6c51ff6e08d8b91e87460c52/parquet/src/file/footer.rs#L106-L112
It should be
u32
here, and we don't need the error below.