apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.46k stars 2.43k forks source link

[HUDI-8447] Claiming RFC-83 for Hudi data is ordered by primary key #12170

Closed usberkeley closed 3 weeks ago

usberkeley commented 4 weeks ago

Change Logs

The process of sorting Hudi's base file and log file data by primary key is as follows:

  1. During the delta commit phase, the data is sorted.
  2. During the compaction phase, the data is sorted and merged using merge sort.

Delta Commit: The batch data written is sorted by primary key Compaction: Compaction with Merge Sort Implementation link

Impact

Sorting the base file and log file data in Hudi offers the following benefits:

  1. Enhancing Hudi Reader Performance: After sorting the data, merge sort can be used to read the data, thereby avoiding the large memory usage or disk IO overhead associated with Map-based methods.
  2. Improving Compaction Performance: Sorted data can also utilize merge sort during the merging process, reducing the large memory usage and disk IO overhead required by Map-based methods.
  3. Supporting MDT Introduction of Primary Key Index: Sorted data facilitates the introduction of spare index similar to those in ClickHouse, thus improving the query efficiency of Hudi Reader.

Performance Comparison After Introducing Ordered Hudi Data (Including Compaction with Merge Sort and Log Compaction with Merge Sort), Primary Key Index and Secondary Index: Performance Comparison link

Risk level (write none, low medium or high below)

medium

Documentation Update

none

Contributor's checklist

hudi-bot commented 4 weeks ago

CI report:

Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build
usberkeley commented 3 weeks ago

the PR design content has been submitted to https://github.com/apache/hudi/pull/11793