apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.45k stars 2.43k forks source link

[HUDI-8393] Introduce multiple file-slice based partition for HoodieBaseRelation #12134

Open TheR1sing3un opened 1 month ago

TheR1sing3un commented 1 month ago

Now, when hudi performs Snapshot-Query on MOR table, each required file-slice will be provided to Spark as a Partition. In some scenarios, for example, the amount of data required for each file-slice is small, in which case a lot of spark tasks will be generated, resulting in a lot of resource consumption in task scheduling and resource application. So I think we can provide a process with different strategies (file-size based / hash value based / custom...) to combine multiple file-slice to one Partition to reduce total task num for better performance.

Change Logs

  1. refactor HoodieBaseRelation and its subclass to introduce multiple file-slice each task Describe context and summary for this change. Highlight if any code was copied.

Impact

Describe any public API or user-facing feature change or any performance impact. none

Risk level (write none, low medium or high below)

low If medium or high, explain what verification was done to mitigate the risks.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

Contributor's checklist

hudi-bot commented 4 weeks ago

CI report:

Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build