When you try to nullify and encrypt different columns using ParquetRewriter it fails. There is a related test but it nullifies and encrypts the same column which doesn't reproduce a bug. The bug can be reproduced by changing a single line in ParquetRewriterTest from maskColumns.put("DocId", MaskMode.NULLIFY); to maskColumns.put("Links.Forward", MaskMode.NULLIFY); in testNullifyAndEncryptColumn() method, If you do that the test start to fail with bellow exception:
org.apache.parquet.crypto.ParquetCryptoRuntimeException: Column ordinal doesnt match [Links, Forward]: 0, 6
at org.apache.parquet.crypto.InternalFileEncryptor.getColumnSetup(InternalFileEncryptor.java:92)
at org.apache.parquet.hadoop.ColumnChunkPageWriteStore.<init>(ColumnChunkPageWriteStore.java:634)
at org.apache.parquet.hadoop.rewrite.ParquetRewriter.nullifyColumn(ParquetRewriter.java:889)
at org.apache.parquet.hadoop.rewrite.ParquetRewriter.processBlock(ParquetRewriter.java:445)
at org.apache.parquet.hadoop.rewrite.ParquetRewriter.processBlocks(ParquetRewriter.java:395)
at org.apache.parquet.hadoop.rewrite.ParquetRewriterTest.testNullifyAndEncryptColumn(ParquetRewriterTest.java:474)
at org.apache.parquet.hadoop.rewrite.ParquetRewriterTest.testNullifyEncryptSingleFile(ParquetRewriterTest.java:521)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
Issue root cause
The reason of a failure is that during the nullification we create a single column schema MessageType newSchema = newSchema(schema, descriptor), this is needed because we need to nullify only a specified column, so we create a custom schema for that purpose. But we can't reuse a default encryptor created during ParquetRewriter construction with that new custom schema because default encryptor performs encrypted columns metadata checks internally and when it does it fails because of schema discrepancy.
GitHub issue: ParquetRewriter fails when you try to nullify and encrypt 2 different columns #3026 This issue was previously reported in PR: PARQUET-2430: Add parquet joiner v2 #1335
Issue description
When you try to nullify and encrypt different columns using ParquetRewriter it fails. There is a related test but it nullifies and encrypts the same column which doesn't reproduce a bug. The bug can be reproduced by changing a single line in
ParquetRewriterTest
frommaskColumns.put("DocId", MaskMode.NULLIFY);
tomaskColumns.put("Links.Forward", MaskMode.NULLIFY);
intestNullifyAndEncryptColumn()
method, If you do that the test start to fail with bellow exception:Issue root cause
The reason of a failure is that during the nullification we create a single column schema
MessageType newSchema = newSchema(schema, descriptor)
, this is needed because we need to nullify only a specified column, so we create a custom schema for that purpose. But we can't reuse a default encryptor created during ParquetRewriter construction with that new custom schema because default encryptor performs encrypted columns metadata checks internally and when it does it fails because of schema discrepancy.Close #3026