Now, when hudi performs Snapshot-Query on MOR table, each required file-slice will be provided to Spark as a Partition.
In some scenarios, for example, the amount of data required for each file-slice is small, in which case a lot of spark tasks will be generated, resulting in a lot of resource consumption in task scheduling and resource application.
So I think we can provide a process with different strategies (file-size based / hash value based / custom...) to combine multiple file-slice to one Partition to reduce total task num for better performance.
Change Logs
refactor HoodieBaseRelation and its subclass to introduce multiple file-slice each task
Describe context and summary for this change. Highlight if any code was copied.
Impact
Describe any public API or user-facing feature change or any performance impact.
none
Risk level (write none, low medium or high below)
low
If medium or high, explain what verification was done to mitigate the risks.
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".
The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.
Now, when hudi performs Snapshot-Query on MOR table, each required file-slice will be provided to Spark as a Partition. In some scenarios, for example, the amount of data required for each file-slice is small, in which case a lot of spark tasks will be generated, resulting in a lot of resource consumption in task scheduling and resource application. So I think we can provide a process with different strategies (file-size based / hash value based / custom...) to combine multiple file-slice to one Partition to reduce total task num for better performance.
Change Logs
Impact
Describe any public API or user-facing feature change or any performance impact. none
Risk level (write none, low medium or high below)
low If medium or high, explain what verification was done to mitigate the risks.
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".
Contributor's checklist