apache / parquet-java

Apache Parquet Java
https://parquet.apache.org/
Apache License 2.0
2.65k stars 1.41k forks source link

GH-3026: ParquetRewriter fails when you try to nullify and encrypt 2 different columns #3027

Closed MaxNevermind closed 1 month ago

MaxNevermind commented 1 month ago

GitHub issue: ParquetRewriter fails when you try to nullify and encrypt 2 different columns #3026 This issue was previously reported in PR: PARQUET-2430: Add parquet joiner v2 #1335

Issue description

When you try to nullify and encrypt different columns using ParquetRewriter it fails. There is a related test but it nullifies and encrypts the same column which doesn't reproduce a bug. The bug can be reproduced by changing a single line in ParquetRewriterTest from maskColumns.put("DocId", MaskMode.NULLIFY); to maskColumns.put("Links.Forward", MaskMode.NULLIFY); in testNullifyAndEncryptColumn() method, If you do that the test start to fail with bellow exception:

org.apache.parquet.crypto.ParquetCryptoRuntimeException: Column ordinal doesnt match [Links, Forward]: 0, 6

    at org.apache.parquet.crypto.InternalFileEncryptor.getColumnSetup(InternalFileEncryptor.java:92)
    at org.apache.parquet.hadoop.ColumnChunkPageWriteStore.<init>(ColumnChunkPageWriteStore.java:634)
    at org.apache.parquet.hadoop.rewrite.ParquetRewriter.nullifyColumn(ParquetRewriter.java:889)
    at org.apache.parquet.hadoop.rewrite.ParquetRewriter.processBlock(ParquetRewriter.java:445)
    at org.apache.parquet.hadoop.rewrite.ParquetRewriter.processBlocks(ParquetRewriter.java:395)
    at org.apache.parquet.hadoop.rewrite.ParquetRewriterTest.testNullifyAndEncryptColumn(ParquetRewriterTest.java:474)
    at org.apache.parquet.hadoop.rewrite.ParquetRewriterTest.testNullifyEncryptSingleFile(ParquetRewriterTest.java:521)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

Issue root cause

The reason of a failure is that during the nullification we create a single column schema MessageType newSchema = newSchema(schema, descriptor), this is needed because we need to nullify only a specified column, so we create a custom schema for that purpose. But we can't reuse a default encryptor created during ParquetRewriter construction with that new custom schema because default encryptor performs encrypted columns metadata checks internally and when it does it fails because of schema discrepancy.

Close #3026

MaxNevermind commented 1 month ago

@wgtmac This is a fix you asked for here: https://github.com/apache/parquet-java/pull/1335#issuecomment-2331821689