apache / paimon

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.
https://paimon.apache.org/
Apache License 2.0
2.12k stars 839 forks source link

[Feature] Optimize heap memory usage during full compaction of manifest files #3590

Closed codeTai closed 1 week ago

codeTai commented 1 week ago

Search before asking

Motivation

When submitting a snapshot triggers a full compaction of the manifest file, we hope to reduce the usage of the taskManager heap memory.

Solution

Based on the background that writing HDFS files is slow but reading HDFS files is fast, the code logic is optimized to avoid reading multiple manifest files at the same time and accumulating data in the memory.

Anything else?

Part of the debug log: image

Heap memory usage before optimization:

image

Heap memory usage after optimization: image

Are you willing to submit a PR?

JingsongLi commented 1 week ago

Hi @codeTai , I created #3598 for using ScanParallelExecutor.parallelismBatchIterable, can you validate this PR in your testing env?

JingsongLi commented 1 week ago

@codeTai If #3598 cannot solve your problem, please create a new pull request.

codeTai commented 1 week ago

Ok, I'll test it later.

codeTai commented 1 week ago

ScanParallelExecutor.parallelismBatchIterable can solve my problem, thanks.