toParquetMetadata method converts org.apache.parquet.hadoop.metadata.ParquetMetadata to org.apache.parquet.format.FileMetaData but this does not set the dictionary page offset bit in FileMetaData.
When a FileMetaData object is serialized while writing to the footer and then deserialized, the dictionary offset is lost as the dictionary page offset bit was never set.
PARQUET-1850 tried to fix this but it did only a partial fix.
It sets setDictionary_page_offset only if getEncodingStats are present
if (columnMetaData.getEncodingStats() != null
&& columnMetaData.getEncodingStats().hasDictionaryPages())
{ metaData.setDictionary_page_offset(columnMetaData.getDictionaryPageOffset()); }
However, it should setDictionary_page_offset even when getEncodingStats are not present but encodings are present.
It should use the implementation in ColumnChunkMetatdata below:
toParquetMetadata method converts org.apache.parquet.hadoop.metadata.ParquetMetadata to org.apache.parquet.format.FileMetaData but this does not set the dictionary page offset bit in FileMetaData.
When a FileMetaData object is serialized while writing to the footer and then deserialized, the dictionary offset is lost as the dictionary page offset bit was never set.
PARQUET-1850 tried to fix this but it did only a partial fix.
It sets setDictionary_page_offset only if getEncodingStats are present
However, it should setDictionary_page_offset even when getEncodingStats are not present but encodings are present.
It should use the implementation in ColumnChunkMetatdata below:
So new change in ParquetMetadataCOnvertor should be like:
Reporter: Abhishek Dixit
PRs and other links:
Note: This issue was originally created as PARQUET-2464. Please see the migration documentation for further details.