Open VitoMakarevich opened 5 months ago
@VitoMakarevich Just checking if you have lots of file groups impacted in each batch, then why not use MERGE_ON_READ table. In your current setup, you can only try to optimize the parquet writes which you have already tried.
Adding @xushiyan to suggest more to optimize performance.
+1 to use MOR to balance the ingestion speed and merge cost through compaction. There is also a new feature in 0.13.x https://hudi.apache.org/releases/release-0.13.0#simple-write-executor-as-default to improve write handle performance.
Hello, thanks for the suggestions! As I said, I'd like to know how I can speed up this individual part, I know it's option to use MOR in theory, but it's impossible for our use case.
we have clustering to group rows together, but it's still thousands of files affected. 75th percentile of individual file overwrite(task in the Doing partition and writing data stage) takes ~40-60 seconds
based on this, i think clustering can be tuned further to rewrite files such that more updates can be targeted to the same file to reduce write amplification. Make sure your number of clustering groups is not limited to default 30, otherwise you miss a lot of files to cluster. COW is expected to have high write amplification with heavy updates, especially if you spread out the updates to a lot of files. Also consider a better partitioning to have updates concentrated on a few partitions if possible. Upgrade to newer version to try configuring the executor type too.
Thanks! Yeah, we are certain we run clustering for all partitions which are big enough, it's a big effort to analyze data for optimal clustering settings, that's why I'm asking about write optimization.
Describe the problem you faced
We are using Spark 3.3 and Hudi 0.12.2. I need your assistance in helping me to improve the
Doing partition and writing data
stage. For us, it looks to be the most time consuming. We are usingsnappy
compression(the most lightweight from available as I know), file size is ~160mb, which is effectively 80-90 GB GZIP(default codec in Hudi for our workload). Files itself consist of 1.5-2M rows. So our problem is that unfortunately due to partitioning + CDC nature, we must udpate a lot of files at peak hours, we have clustering to group rows together, but it's still thousands of files affected. 75th percentile of individual file overwrite(task in theDoing partition and writing data
stage) takes ~40-60 seconds, it does not correlate to the number of rows updated inside(for 75th percentile it's < 100 rows changed in every file). Also - the payload class is almost default(minor changes which not affect performance IMO). Q:snappy
looks to be the best amongzstd
- has memory leak in Spark 3.3 BTW andgzip
) Also we triedhoodie.write.buffer.limit.bytes
- rising to 32MB, unfortunately no visible difference. Is there any other?MergeHandle
) taskEnvironment Description
Hudi version : 0.12.2
Spark version : 3.3.0