apache / parquet-java

Apache Parquet Java
https://parquet.apache.org/
Apache License 2.0
2.49k stars 1.37k forks source link

toParquetMetadata method in ParquetMetadataConverter does not set dictionary page offset bit #2901

Open asfimport opened 2 months ago

asfimport commented 2 months ago

toParquetMetadata method converts org.apache.parquet.hadoop.metadata.ParquetMetadata to org.apache.parquet.format.FileMetaData but this does not set the dictionary page offset bit in FileMetaData.

When a FileMetaData object is serialized while writing to the footer and then deserialized, the dictionary offset is lost as the dictionary page offset bit was never set.

PARQUET-1850  tried to fix this but it did only a partial fix.

It sets setDictionary_page_offset only if getEncodingStats are present


if (columnMetaData.getEncodingStats() != null
&& columnMetaData.getEncodingStats().hasDictionaryPages())
{ metaData.setDictionary_page_offset(columnMetaData.getDictionaryPageOffset()); } 

However, it should setDictionary_page_offset even when getEncodingStats are not present but encodings are present.

It should use the implementation in ColumnChunkMetatdata below:


public boolean hasDictionaryPage() {
EncodingStats stats = getEncodingStats();
if (stats != null) { 
return stats.hasDictionaryPages() && stats.hasDictionaryEncodedPages(); 
}

Set<Encoding> encodings = getEncodings();
return (encodings.contains(PLAIN_DICTIONARY) || encodings.contains(RLE_DICTIONARY));
} 

So new change in ParquetMetadataCOnvertor should be like:

 


if (columnMetaData.hasDictionaryPage()) { metaData.setDictionary_page_offset(columnMetaData.getDictionaryPageOffset()); } 

Reporter: Abhishek Dixit

PRs and other links:

Note: This issue was originally created as PARQUET-2464. Please see the migration documentation for further details.