Open abhishekd0907 opened 2 months ago
Error: Failed to execute goal com.diffplug.spotless:spotless-maven-plugin:2.30.0:check (default) on project parquet-hadoop: The following files had format violations:
Error: src/test/java/org/apache/parquet/format/converter/TestParquetMetadataConverter.java
Error: @@ -1277,9 +1277,8 @@
Error: ····return·createParquetMetaData(dicEncoding,·dataEncoding,·true);
Error: ··}
Error:
Error: -
Error: -··private·static·ParquetMetadata·createParquetMetaData(Encoding·dicEncoding,·Encoding·dataEncoding,
Error: -·······················································boolean·includeDicStats)·{
Error: +··private·static·ParquetMetadata·createParquetMetaData(
Error: +······Encoding·dicEncoding,·Encoding·dataEncoding,·boolean·includeDicStats)·{
Error: ····MessageType·schema·=·parseMessageType("message·schema·{·optional·int32·col·(INT_32);·}");
Error: ····org.apache.parquet.hadoop.metadata.FileMetaData·fileMetaData·=
Error: ········new·org.apache.parquet.hadoop.metadata.FileMetaData(schema,·new·HashMap<String,·String>(),·null);
Error: Run 'mvn spotless:apply' to fix these violations.
Error: -> [Help 1]
Please make the CI happy.
Issue
toParquetMetadata method converts org.apache.parquet.hadoop.metadata.ParquetMetadata to org.apache.parquet.format.FileMetaData but this does not set the dictionary page offset bit in FileMetaData.
When a FileMetaData object is serialized while writing to the footer and then deserialized, the dictionary offset is lost as the dictionary page offset bit was never set.
PARQUET-1850 tried to fix this but it did only a partial fix.
It sets setDictionary_page_offset only if getEncodingStats are present
Fix
However, it should setDictionary_page_offset even when getEncodingStats are not present but encodings are present.
It should use the implementation in ColumnChunkMetatdata below:
So new change in ParquetMetadataCOnvertor should be like:
Test
Added a new UT for this scenario. All existing UTs should also pass.