toParquetMetadata method in ParquetMetadataConverter does not set dictionary page offset bit

toParquetMetadata method converts org.apache.parquet.hadoop.metadata.ParquetMetadata to org.apache.parquet.format.FileMetaData but this does not set the dictionary page offset bit in FileMetaData.

When a FileMetaData object is serialized while writing to the footer and then deserialized, the dictionary offset is lost as the dictionary page offset bit was never set.

PARQUET-1850 tried to fix this but it did only a partial fix.

It sets setDictionary_page_offset only if getEncodingStats are present


if (columnMetaData.getEncodingStats() != null
&& columnMetaData.getEncodingStats().hasDictionaryPages())
{ metaData.setDictionary_page_offset(columnMetaData.getDictionaryPageOffset()); }

However, it should setDictionary_page_offset even when getEncodingStats are not present but encodings are present.

It should use the implementation in ColumnChunkMetatdata below:


public boolean hasDictionaryPage() {
EncodingStats stats = getEncodingStats();
if (stats != null) { 
return stats.hasDictionaryPages() && stats.hasDictionaryEncodedPages(); 
}

Set<Encoding> encodings = getEncodings();
return (encodings.contains(PLAIN_DICTIONARY) || encodings.contains(RLE_DICTIONARY));
}

So new change in ParquetMetadataCOnvertor should be like:


if (columnMetaData.hasDictionaryPage()) { metaData.setDictionary_page_offset(columnMetaData.getDictionaryPageOffset()); }

Reporter: Abhishek Dixit

PRs and other links:

GitHub Pull Request #1340

_{Note: This issue was originally created as PARQUET-2464. Please see the migration documentation for further details.}

apache / parquet-java

toParquetMetadata method in ParquetMetadataConverter does not set dictionary page offset bit #2901

PRs and other links: