apache / parquet-java

Apache Parquet Java
https://parquet.apache.org/
Apache License 2.0
2.65k stars 1.41k forks source link

ParquetRewriter fails when you try to nullify and encrypt 2 different columns #3026

Closed MaxNevermind closed 1 month ago

MaxNevermind commented 1 month ago

Describe the bug, including details regarding any error messages, version, and platform.

This issue was previously reported in PR: PARQUET-2430: Add parquet joiner v2 #1335

Issue description

When you try to nullify and encrypt different columns using ParquetRewriter it fails. There is a related test but it nullifies and encrypts the same column which doesn't reproduce a bug. The bug can be reproduced by changing a single line in ParquetRewriterTest from maskColumns.put("DocId", MaskMode.NULLIFY); to maskColumns.put("Links.Forward", MaskMode.NULLIFY); in testNullifyAndEncryptColumn() method, If you do that the test start to fail with bellow exception:

org.apache.parquet.crypto.ParquetCryptoRuntimeException: Column ordinal doesnt match [Links, Forward]: 0, 6

    at org.apache.parquet.crypto.InternalFileEncryptor.getColumnSetup(InternalFileEncryptor.java:92)
    at org.apache.parquet.hadoop.ColumnChunkPageWriteStore.<init>(ColumnChunkPageWriteStore.java:634)
    at org.apache.parquet.hadoop.rewrite.ParquetRewriter.nullifyColumn(ParquetRewriter.java:889)
    at org.apache.parquet.hadoop.rewrite.ParquetRewriter.processBlock(ParquetRewriter.java:445)
    at org.apache.parquet.hadoop.rewrite.ParquetRewriter.processBlocks(ParquetRewriter.java:395)
    at org.apache.parquet.hadoop.rewrite.ParquetRewriterTest.testNullifyAndEncryptColumn(ParquetRewriterTest.java:474)
    at org.apache.parquet.hadoop.rewrite.ParquetRewriterTest.testNullifyEncryptSingleFile(ParquetRewriterTest.java:521)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

Issue root cause

The reason of a failure is that during the nullification we create a single column schema MessageType newSchema = newSchema(schema, descriptor), this is needed because we need to nullify only a specified column, so we create a custom schema for that purpose. But we can't reuse a default encryptor created during ParquetRewriter construction with that new custom schema because default encryptor except main output schema used during ParquetRewriter construction, InternalFileEncryptor perform schema checks and it fails because of schema discrepancy.

Component(s)

Core