[HUDI-8447] Claiming RFC-83 for Hudi data is ordered by primary key

usberkeley commented 4 weeks ago

Change Logs

The process of sorting Hudi's base file and log file data by primary key is as follows:

During the delta commit phase, the data is sorted.
During the compaction phase, the data is sorted and merged using merge sort.

Delta Commit: The batch data written is sorted by primary key Compaction: Compaction with Merge Sort Implementation link

Impact

Sorting the base file and log file data in Hudi offers the following benefits:

Enhancing Hudi Reader Performance: After sorting the data, merge sort can be used to read the data, thereby avoiding the large memory usage or disk IO overhead associated with Map-based methods.
Improving Compaction Performance: Sorted data can also utilize merge sort during the merging process, reducing the large memory usage and disk IO overhead required by Map-based methods.
Supporting MDT Introduction of Primary Key Index: Sorted data facilitates the introduction of spare index similar to those in ClickHouse, thus improving the query efficiency of Hudi Reader.

Performance Comparison After Introducing Ordered Hudi Data (Including Compaction with Merge Sort and Log Compaction with Merge Sort), Primary Key Index and Secondary Index: Performance Comparison link

Risk level (write none, low medium or high below)

medium

Documentation Update

none

Contributor's checklist

[x] Read through contributor's guide
[x] Change Logs and Impact were stated clearly
[x] Adequate tests were added if applicable
[x] CI passed

hudi-bot commented 4 weeks ago

CI report:

7311b06461e8b833cf476793ac984c0f8f50e741 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build

usberkeley commented 3 weeks ago

the PR design content has been submitted to https://github.com/apache/hudi/pull/11793

apache / hudi