apache / parquet-java

Apache Parquet Java
https://parquet.apache.org/
Apache License 2.0
2.49k stars 1.37k forks source link

PARQUET-2464: Fix dictionaryPageOffset flag setting in toParquetMetadata method #1340

Open abhishekd0907 opened 2 months ago

abhishekd0907 commented 2 months ago

Issue

toParquetMetadata method converts org.apache.parquet.hadoop.metadata.ParquetMetadata to org.apache.parquet.format.FileMetaData but this does not set the dictionary page offset bit in FileMetaData.

When a FileMetaData object is serialized while writing to the footer and then deserialized, the dictionary offset is lost as the dictionary page offset bit was never set.

PARQUET-1850  tried to fix this but it did only a partial fix.

It sets setDictionary_page_offset only if getEncodingStats are present

if (columnMetaData.getEncodingStats() != null
&& columnMetaData.getEncodingStats().hasDictionaryPages())
{ metaData.setDictionary_page_offset(columnMetaData.getDictionaryPageOffset()); } 

Fix

However, it should setDictionary_page_offset even when getEncodingStats are not present but encodings are present.

It should use the implementation in ColumnChunkMetatdata below:

public boolean hasDictionaryPage() {
EncodingStats stats = getEncodingStats();
if (stats != null) { 
return stats.hasDictionaryPages() && stats.hasDictionaryEncodedPages(); 
}

Set<Encoding> encodings = getEncodings();
return (encodings.contains(PLAIN_DICTIONARY) || encodings.contains(RLE_DICTIONARY));
} 

So new change in ParquetMetadataCOnvertor should be like:

if (columnMetaData.hasDictionaryPage()) { metaData.setDictionary_page_offset(columnMetaData.getDictionaryPageOffset()); }

Test

Added a new UT for this scenario. All existing UTs should also pass.

wgtmac commented 1 month ago
Error:  Failed to execute goal com.diffplug.spotless:spotless-maven-plugin:2.30.0:check (default) on project parquet-hadoop: The following files had format violations:
Error:      src/test/java/org/apache/parquet/format/converter/TestParquetMetadataConverter.java
Error:          @@ -1277,9 +1277,8 @@
Error:           ····return·createParquetMetaData(dicEncoding,·dataEncoding,·true);
Error:           ··}
Error:           
Error:          -
Error:          -··private·static·ParquetMetadata·createParquetMetaData(Encoding·dicEncoding,·Encoding·dataEncoding,
Error:          -·······················································boolean·includeDicStats)·{
Error:          +··private·static·ParquetMetadata·createParquetMetaData(
Error:          +······Encoding·dicEncoding,·Encoding·dataEncoding,·boolean·includeDicStats)·{
Error:           ····MessageType·schema·=·parseMessageType("message·schema·{·optional·int32·col·(INT_32);·}");
Error:           ····org.apache.parquet.hadoop.metadata.FileMetaData·fileMetaData·=
Error:           ········new·org.apache.parquet.hadoop.metadata.FileMetaData(schema,·new·HashMap<String,·String>(),·null);
Error:  Run 'mvn spotless:apply' to fix these violations.
Error:  -> [Help 1]

Please make the CI happy.