When you try to nullify and encrypt different columns using ParquetRewriter it fails. There is a related test but it nullifies and encrypts the same column which doesn't reproduce a bug. The bug can be reproduced by changing a single line in ParquetRewriterTest from maskColumns.put("DocId", MaskMode.NULLIFY); to maskColumns.put("Links.Forward", MaskMode.NULLIFY); in testNullifyAndEncryptColumn() method, If you do that the test start to fail with bellow exception:
org.apache.parquet.crypto.ParquetCryptoRuntimeException: Column ordinal doesnt match [Links, Forward]: 0, 6
at org.apache.parquet.crypto.InternalFileEncryptor.getColumnSetup(InternalFileEncryptor.java:92)
at org.apache.parquet.hadoop.ColumnChunkPageWriteStore.<init>(ColumnChunkPageWriteStore.java:634)
at org.apache.parquet.hadoop.rewrite.ParquetRewriter.nullifyColumn(ParquetRewriter.java:889)
at org.apache.parquet.hadoop.rewrite.ParquetRewriter.processBlock(ParquetRewriter.java:445)
at org.apache.parquet.hadoop.rewrite.ParquetRewriter.processBlocks(ParquetRewriter.java:395)
at org.apache.parquet.hadoop.rewrite.ParquetRewriterTest.testNullifyAndEncryptColumn(ParquetRewriterTest.java:474)
at org.apache.parquet.hadoop.rewrite.ParquetRewriterTest.testNullifyEncryptSingleFile(ParquetRewriterTest.java:521)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
Issue root cause
The reason of a failure is that during the nullification we create a single column schema MessageType newSchema = newSchema(schema, descriptor), this is needed because we need to nullify only a specified column, so we create a custom schema for that purpose. But we can't reuse a default encryptor created during ParquetRewriter construction with that new custom schema because default encryptor except main output schema used during ParquetRewriter construction, InternalFileEncryptor perform schema checks and it fails because of schema discrepancy.
Describe the bug, including details regarding any error messages, version, and platform.
This issue was previously reported in PR: PARQUET-2430: Add parquet joiner v2 #1335
Issue description
When you try to nullify and encrypt different columns using ParquetRewriter it fails. There is a related test but it nullifies and encrypts the same column which doesn't reproduce a bug. The bug can be reproduced by changing a single line in
ParquetRewriterTest
frommaskColumns.put("DocId", MaskMode.NULLIFY);
tomaskColumns.put("Links.Forward", MaskMode.NULLIFY);
intestNullifyAndEncryptColumn()
method, If you do that the test start to fail with bellow exception:Issue root cause
The reason of a failure is that during the nullification we create a single column schema
MessageType newSchema = newSchema(schema, descriptor)
, this is needed because we need to nullify only a specified column, so we create a custom schema for that purpose. But we can't reuse a default encryptor created during ParquetRewriter construction with that new custom schema because default encryptor except main output schema used during ParquetRewriter construction,InternalFileEncryptor
perform schema checks and it fails because of schema discrepancy.Component(s)
Core