apache / parquet-java

Apache Parquet Java
https://parquet.apache.org/
Apache License 2.0
2.55k stars 1.39k forks source link

Unable to set dictionary_page_offset when encoding_stats are missing #2962

Open mothukur opened 1 month ago

mothukur commented 1 month ago

Describe the bug, including details regarding any error messages, version, and platform.

I am facing an issue while splitting a parquet file into multiple files using the ParquetFileWriter.appendRowGroups API. It is failing to set the dictionary page offsets correctly in the new files. When investigated further, I observed that the API ParquetMetadataConverter.addRowGroup has an assumption on the availability of EncodingStats always. As per the format specification, it is not mandatory to have the encoding_stats. Is it possible to remove this requirement? 

https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L559

https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L826

Component(s)

No response

wgtmac commented 1 month ago

Thanks for reporting the issue! I think there is a similar effort to resolve this issue but it looks more complicated than it appears: https://github.com/apache/parquet-java/pull/1340