Sorting the base file and log file data in Hudi offers the following benefits:
Enhancing Hudi Reader Performance: After sorting the data, merge sort can be used to read the data, thereby avoiding the large memory usage or disk IO overhead associated with Map-based methods.
Improving Compaction Performance: Sorted data can also utilize merge sort during the merging process, reducing the large memory usage and disk IO overhead required by Map-based methods.
Supporting MDT Introduction of Primary Key Index: Sorted data facilitates the introduction of spare index similar to those in ClickHouse, thus improving the query efficiency of Hudi Reader.
Performance Comparison After Introducing Ordered Hudi Data (Including Compaction with Merge Sort and Log Compaction with Merge Sort), Primary Key Index and Secondary Index: Performance Comparison link
Change Logs
The process of sorting Hudi's base file and log file data by primary key is as follows:
Delta Commit: The batch data written is sorted by primary key Compaction: Compaction with Merge Sort Implementation link
Impact
Sorting the base file and log file data in Hudi offers the following benefits:
Performance Comparison After Introducing Ordered Hudi Data (Including Compaction with Merge Sort and Log Compaction with Merge Sort), Primary Key Index and Secondary Index: Performance Comparison link
Risk level (write none, low medium or high below)
medium
Documentation Update
none
Contributor's checklist