Open asfimport opened 9 months ago
Gang Wu / @wgtmac: This is by design. Could you try to limit max size or number of rows per row group?
Ence Wang: Yes, tuning parameters can be a solution, just wondering if we can avoid it in advance. It happens occasionally, depends on the data distribution.
Gidon Gershinsky / @ggershinsky: Sure, I'll have a look.
Gidon Gershinsky / @ggershinsky:
We might be able to double the limit to 64K (will need to check), but the question is, will it be sufficient for your usecase [~encewang]
? Can you find / estimate the max number of pages per column chunk in your data? (without encryption)
Ence Wang: @ggershinsky For my case, 64K is still not sufficient, there are 102K pages in a single column chunk.
# reproduce.zip/test-input.parquet
row group 0
--------------------------------------------------------------------------------
task_log: BINARY UNCOMPRESSED DO:4 FPO:30669 SZ:3399502/3399502/1.00 [more]... ST:[no stats for this column] task_log TV=10208503 RL=0 DL=1 DS: 2 DE:PLAIN_DICTIONARY
----------------------------------------------------------------------------
page 0: DLE:RLE RLE:BIT_PACKED VLE:PLA [more]... CRC:[verified] VC:100
page 1: DLE:RLE RLE:BIT_PACKED VLE:PLA [more]... CRC:[verified] VC:100
page 2: DLE:RLE RLE:BIT_PACKED VLE:PLA [more]... CRC:[verified] VC:100
page 3: DLE:RLE RLE:BIT_PACKED VLE:PLA [more]... CRC:[verified] VC:100
....
page 102080: DLE:RLE RLE:BIT_PACKED VLE:PLA [more]... CRC:[verified] VC:100
page 102081: DLE:RLE RLE:BIT_PACKED VLE:PLA [more]... CRC:[verified] VC:100
page 102082: DLE:RLE RLE:BIT_PACKED VLE:PLA [more]... CRC:[verified] VC:100
page 102083: DLE:RLE RLE:BIT_PACKED VLE:PLA [more]... CRC:[verified] VC:100
page 102084: DLE:RLE RLE:BIT_PACKED VLE:PLA [more]... CRC:[verified] VC:100
page 102085: DLE:RLE RLE:BIT_PACKED VLE:PLA [more]... CRC:[verified] VC:3
Gidon Gershinsky / @ggershinsky: Thank you, I see. What is the average size of a page?
Ence Wang: The average physical size of a page is about 40 bytes. (The final file size is 4.5M, which contains only one column, divided into 102K pages)
Gidon Gershinsky / @ggershinsky: Yep, makes sense. The encryption performance is dependent on the page size. It runs the fastest with 100KB or larger pages, at a few gigabytes / sec speeds. When a page is a few dozen bytes, the encryption throughput drops to something like 70 megabytes / sec - two orders of magnitude slower.
Additionally, there are size implications, unrelated to encryption. Each page has a page header, which is also a few dozen bytes. Plus some column index metadata. So using very small pages basically means doubling the file size.
A support of 100K+ pages per column chunk would require changing the Parquet format specification, and updating the code in multiple implementations. Not impossible, but still challenging. Given the performance issues, triggered by using this number of pages, I think a better course of action would be to configure the workload to create larger / fewer pages.
Gidon Gershinsky / @ggershinsky: Btw, there could a workaround that enables encryption with this page size - using multiple row groups. So each column chunk has less than 32K pages. But again, I'd recommend enlarging the pages instead.
Ence Wang:
Yes, I increased the parquet.page.size
from 1M to 10M for this case and the error is gone.
Additionally, I think FallbackValuesWriter
is flawed to some extent, because it makes the page size too small just for preventing them being too big in case of fallback happens.
If we ignore the fallback concerns temporarily, and change the FallbackValuesWriter::getBufferedSize
to return the actual encoded size, the result file is tested to have larger pages and a much smaller file size (192K) compared to the current one (4.5M).
But If we take fallback into account, it would be difficult to take care of the both sides under the current writing framework.
Gidon Gershinsky / @ggershinsky: If the file size becomes much smaller, then probably the most of the 40 bytes is taken by the page header, and the page itself is only a few bytes.
As for the fallback - I'm less familiar with this mechanism. Cc @wgtmac
Gang Wu / @wgtmac: I'm not sure if I understand correctly. The FallbackValuesWriter accumulates rawDataByteSize as if it is PLAIN-encoded, which looks correct to me. Did you mean FallbackValuesWriter uses the plain-encoded size to estimate the page size even if finally pages are dictionary-encoded?
Ence Wang: Yes, that's why the pages are so small, because the page size is over-estimated.
Gang Wu / @wgtmac:
Thanks for confirming! I'm not sure if we can fix this by using min(dict_encoded_size, fallback_to_plain_encoded_size) for each page limit check. Do you want to try to fix this? [~encewang]
Ence Wang:
If we use min(dict_encoded_size, fallback_to_plain_encoded_size)
for each page limit check, it should work fine when no fallback happens.
But if the fallback actually happens, it will bring the risk of OOM, because the values encoded with dict will be re-encoded to plain, and the in-memory buffer might expand significantly. That's why the current design choose to over-estimated the page size, which is a preventive strategy to avoid OOM when fallback happens.
To solve this issue completely, I think we need to redesign the current fallback mechanism, to estimate the page size precisely while getting rid of the OOM risk.
I will try to find some quick-fix first to avoid this error without user awareness.
When we were writing an encrypted file, we encountered the following error:
Error Stack:
Reasons: The
getBufferedSize
method of FallbackValuesWriter returns raw data size to decide if we want to flush the page, so the actual size of the page written could be much more smaller due to dictionary encoding. This prevents page being too big when fallback happens, but can also produce too many pages in a single column chunk. On the other side, the encryption module only supports up to 32767 pages per chunk, as we useShort
to store page ordinal as a part of AAD. Reproduce: reproduce.zipReporter: Ence Wang
Original Issue Attachments:
Note: This issue was originally created as PARQUET-2424. Please see the migration documentation for further details.