[SUPPORT] How we can speed up individual file write(HoodieMergeHandle part)

VitoMakarevich commented 5 months ago

Describe the problem you faced

We are using Spark 3.3 and Hudi 0.12.2. I need your assistance in helping me to improve the Doing partition and writing data stage. For us, it looks to be the most time consuming. We are using snappy compression(the most lightweight from available as I know), file size is ~160mb, which is effectively 80-90 GB GZIP(default codec in Hudi for our workload). Files itself consist of 1.5-2M rows. So our problem is that unfortunately due to partitioning + CDC nature, we must udpate a lot of files at peak hours, we have clustering to group rows together, but it's still thousands of files affected. 75th percentile of individual file overwrite(task in the Doing partition and writing data stage) takes ~40-60 seconds, it does not correlate to the number of rows updated inside(for 75th percentile it's < 100 rows changed in every file). Also - the payload class is almost default(minor changes which not affect performance IMO). Q:

What are knobs we can play with? We tried compression format(snappy looks to be the best among zstd- has memory leak in Spark 3.3 BTW and gzip) Also we tried hoodie.write.buffer.limit.bytes - rising to 32MB, unfortunately no visible difference. Is there any other?
Do you know some performance improvements in newer versions(0.12.3-0.14.1) regarding specifically file write(MergeHandle) task

Environment Description

Hudi version : 0.12.2
Spark version : 3.3.0

ad1happy2go commented 5 months ago

@VitoMakarevich Just checking if you have lots of file groups impacted in each batch, then why not use MERGE_ON_READ table. In your current setup, you can only try to optimize the parquet writes which you have already tried.

Adding @xushiyan to suggest more to optimize performance.

xushiyan commented 5 months ago

+1 to use MOR to balance the ingestion speed and merge cost through compaction. There is also a new feature in 0.13.x https://hudi.apache.org/releases/release-0.13.0#simple-write-executor-as-default to improve write handle performance.

VitoMakarevich commented 5 months ago

Hello, thanks for the suggestions! As I said, I'd like to know how I can speed up this individual part, I know it's option to use MOR in theory, but it's impossible for our use case.

xushiyan commented 5 months ago

we have clustering to group rows together, but it's still thousands of files affected. 75th percentile of individual file overwrite(task in the Doing partition and writing data stage) takes ~40-60 seconds

based on this, i think clustering can be tuned further to rewrite files such that more updates can be targeted to the same file to reduce write amplification. Make sure your number of clustering groups is not limited to default 30, otherwise you miss a lot of files to cluster. COW is expected to have high write amplification with heavy updates, especially if you spread out the updates to a lot of files. Also consider a better partitioning to have updates concentrated on a few partitions if possible. Upgrade to newer version to try configuring the executor type too.

VitoMakarevich commented 5 months ago

Thanks! Yeah, we are certain we run clustering for all partitions which are big enough, it's a big effort to analyze data for optimal clustering settings, that's why I'm asking about write optimization.

apache / hudi

[SUPPORT] How we can speed up individual file write(HoodieMergeHandle part) #10997