apache / parquet-java

Apache Parquet Java
https://parquet.apache.org/
Apache License 2.0
2.65k stars 1.41k forks source link

Unable to set dictionary_page_offset when encoding_stats are missing #2962

Closed mothukur closed 2 months ago

mothukur commented 4 months ago

Describe the bug, including details regarding any error messages, version, and platform.

I am facing an issue while splitting a parquet file into multiple files using the ParquetFileWriter.appendRowGroups API. It is failing to set the dictionary page offsets correctly in the new files. When investigated further, I observed that the API ParquetMetadataConverter.addRowGroup has an assumption on the availability of EncodingStats always. As per the format specification, it is not mandatory to have the encoding_stats. Is it possible to remove this requirement? 

https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L559

https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L826

Component(s)

No response

wgtmac commented 4 months ago

Thanks for reporting the issue! I think there is a similar effort to resolve this issue but it looks more complicated than it appears: https://github.com/apache/parquet-java/pull/1340

mothukur commented 2 months ago

I've submitted a PR with the fix. Could you please review it?