apache / parquet-java

Apache Parquet Java
https://parquet.apache.org/
Apache License 2.0
2.55k stars 1.39k forks source link

Encrypted parquet files can't have more than 32767 pages per chunk: 32768 #1687

Open asfimport opened 7 months ago

asfimport commented 7 months ago

When we were writing an encrypted file, we encountered the following error:


Encrypted parquet files can't have more than 32767 pages per chunk: 32768

 

Error Stack:


org.apache.parquet.crypto.ParquetCryptoRuntimeException: Encrypted parquet files can't have more than 32767 pages per chunk: 32768

        at org.apache.parquet.crypto.AesCipher.quickUpdatePageAAD(AesCipher.java:131)
        at org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writePage(ColumnChunkPageWriteStore.java:178)
        at org.apache.parquet.column.impl.ColumnWriterV1.writePage(ColumnWriterV1.java:67)
        at org.apache.parquet.column.impl.ColumnWriterBase.writePage(ColumnWriterBase.java:392)
        at org.apache.parquet.column.impl.ColumnWriteStoreBase.sizeCheck(ColumnWriteStoreBase.java:231)
        at org.apache.parquet.column.impl.ColumnWriteStoreBase.endRecord(ColumnWriteStoreBase.java:216)
        at org.apache.parquet.column.impl.ColumnWriteStoreV1.endRecord(ColumnWriteStoreV1.java:29)
        at org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.endMessage(MessageColumnIO.java:295)

 

Reasons: The getBufferedSize method of FallbackValuesWriter returns raw data size to decide if we want to flush the page,  so the actual size of the page written could be much more smaller due to dictionary encoding. This prevents page being too big when fallback happens, but can also produce too many pages in a single column chunk. On the other side, the encryption module only supports up to  32767 pages per chunk, as we use Short to store page ordinal as a part of AAD.    Reproduce: reproduce.zip

Reporter: Ence Wang

Original Issue Attachments:

Note: This issue was originally created as PARQUET-2424. Please see the migration documentation for further details.

asfimport commented 7 months ago

Gang Wu / @wgtmac: This is by design. Could you try to limit max size or number of rows per row group?

asfimport commented 7 months ago

Ence Wang: Yes, tuning parameters can be a solution, just wondering if we can avoid it in advance. It happens occasionally, depends on the data distribution. 

asfimport commented 7 months ago

Gang Wu / @wgtmac: Do you have any suggestion? @ggershinsky  

asfimport commented 7 months ago

Gidon Gershinsky / @ggershinsky: Sure, I'll have a look.

asfimport commented 7 months ago

Gidon Gershinsky / @ggershinsky: We might be able to double the limit to 64K (will need to check), but the question is, will it be sufficient for your usecase [~encewang] ? Can you find / estimate the max number of pages per column chunk in your data? (without encryption)

asfimport commented 7 months ago

Ence Wang: @ggershinsky For my case, 64K is still not sufficient, there are 102K pages in a single column chunk. 

 


# reproduce.zip/test-input.parquet

row group 0 
--------------------------------------------------------------------------------
task_log:  BINARY UNCOMPRESSED DO:4 FPO:30669 SZ:3399502/3399502/1.00  [more]... ST:[no stats for this column]    task_log TV=10208503 RL=0 DL=1 DS: 2 DE:PLAIN_DICTIONARY
    ----------------------------------------------------------------------------
    page 0:                             DLE:RLE RLE:BIT_PACKED VLE:PLA [more]... CRC:[verified] VC:100
    page 1:                             DLE:RLE RLE:BIT_PACKED VLE:PLA [more]... CRC:[verified] VC:100
    page 2:                             DLE:RLE RLE:BIT_PACKED VLE:PLA [more]... CRC:[verified] VC:100
    page 3:                             DLE:RLE RLE:BIT_PACKED VLE:PLA [more]... CRC:[verified] VC:100    

    ....

    page 102080:                        DLE:RLE RLE:BIT_PACKED VLE:PLA [more]... CRC:[verified] VC:100
    page 102081:                        DLE:RLE RLE:BIT_PACKED VLE:PLA [more]... CRC:[verified] VC:100
    page 102082:                        DLE:RLE RLE:BIT_PACKED VLE:PLA [more]... CRC:[verified] VC:100
    page 102083:                        DLE:RLE RLE:BIT_PACKED VLE:PLA [more]... CRC:[verified] VC:100
    page 102084:                        DLE:RLE RLE:BIT_PACKED VLE:PLA [more]... CRC:[verified] VC:100
    page 102085:                        DLE:RLE RLE:BIT_PACKED VLE:PLA [more]... CRC:[verified] VC:3 

 

 

asfimport commented 7 months ago

Gidon Gershinsky / @ggershinsky: Thank you, I see. What is the average size of a page?

asfimport commented 7 months ago

Ence Wang: The average physical size of a page is about 40 bytes. (The final file size is 4.5M, which contains only one column, divided into 102K pages)

asfimport commented 7 months ago

Gidon Gershinsky / @ggershinsky: Yep, makes sense.  The encryption performance is dependent on the page size. It runs the fastest with 100KB or larger pages, at a few gigabytes / sec speeds. When a page is a few dozen bytes, the encryption throughput drops to something like 70 megabytes / sec - two orders of magnitude slower.

Additionally, there are size implications, unrelated to encryption. Each page has a page header, which is also a few dozen bytes. Plus some column index metadata. So using very small pages basically means doubling the file size.

A support of 100K+ pages per column chunk would require changing the Parquet format specification, and updating the code in multiple implementations. Not impossible, but still challenging. Given the performance issues, triggered by using this number of pages, I think a better course of action would be to configure the workload to create larger / fewer pages.

asfimport commented 7 months ago

Gidon Gershinsky / @ggershinsky: Btw, there could a workaround that enables encryption with this page size - using multiple row groups. So each column chunk has less than 32K pages. But again, I'd recommend enlarging the pages instead.

asfimport commented 7 months ago

Ence Wang: Yes, I increased the parquet.page.size from 1M to 10M for this case and the error is gone.   Additionally, I think FallbackValuesWriter is flawed to some extent, because it makes the page size too small just for preventing them being too big in case of fallback happens.    If we ignore the fallback concerns temporarily, and change the FallbackValuesWriter::getBufferedSize to return the actual encoded size, the result file is tested to have larger pages and a much smaller file size (192K) compared to the current one (4.5M).    But If we take fallback into account, it would be difficult to take care of the both sides under the current writing framework.

asfimport commented 7 months ago

Gidon Gershinsky / @ggershinsky: If the file size becomes much smaller, then probably the most of the 40 bytes is taken by the page header, and the page itself is only a few bytes. 

As for the fallback - I'm less familiar with this mechanism. Cc @wgtmac

asfimport commented 7 months ago

Gang Wu / @wgtmac: I'm not sure if I understand correctly. The FallbackValuesWriter accumulates rawDataByteSize as if it is PLAIN-encoded, which looks correct to me. Did you mean FallbackValuesWriter uses the plain-encoded size to estimate the page size even if finally pages are dictionary-encoded?

asfimport commented 7 months ago

Ence Wang: Yes, that's why the pages are so small, because the page size is over-estimated.

asfimport commented 7 months ago

Gang Wu / @wgtmac: Thanks for confirming! I'm not sure if we can fix this by using min(dict_encoded_size, fallback_to_plain_encoded_size) for each page limit check. Do you want to try to fix this? [~encewang]  

asfimport commented 7 months ago

Ence Wang: If we use min(dict_encoded_size, fallback_to_plain_encoded_size) for each page limit check, it should work fine when no fallback happens.

But if the fallback actually happens, it will bring the risk of OOM, because the values encoded with dict will be re-encoded to plain, and the in-memory buffer might expand significantly. That's why the current design choose to over-estimated the page size, which is a preventive strategy to avoid OOM when fallback happens. 

To solve this issue completely, I think we need to redesign the current fallback mechanism, to estimate the page size precisely while getting rid of the OOM risk.

I will try to find some quick-fix first to avoid this error without user awareness.