apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.46k stars 2.43k forks source link

[HUDI-8543] Fixing SI MDT record generation in MDT to not rely on RDD<WriteStatus> #12291

Closed nsivabalan closed 3 days ago

nsivabalan commented 5 days ago

Change Logs

Fixing SI MDT record generation in MDT to not rely on RDD This is a stacked patch over https://github.com/apache/hudi/pull/12269

This is the 2nd patch among series of patch. In this patch, we are making SI to not rely on RDD. SI needs to know what record keys have been deleted or updated so that it can delete those entries from SI. We are only adjusting this to not depend on the writeStatus.

We had to make some changes to our RLI and SI records are generated. We do not want to read from data files repeatedly(once for RLI and again for SI). And so, here is what we are doing.

Step1: For every HoodieWriteStat in the HoodieCommitMetadata, we are generating PerFileGroupRecordKeyInfos which contains

where RecordStatus is an enum with INSERT or UPDATE or DELETE

Incase of parquet file:

Step2: Persist the output from Step1. (PairRDD)

Step 3:

Step4:

Essentially, at a high level, we do one processing of data files to populate PerFileGroupRecordKeyInfos for every HoodieWriteStat (i.e. every partition, fileId combination). And then use the paired RDD to compute RLI and SI records.

Pending: I need to write more tests. But most of existing tests for RLI and SI except 1 which I am investigating.

Impact

No non-determinism or undefined behavior for RLI and SI is fixed.

Risk level (write none, low medium or high below)

low

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

Contributor's checklist

hudi-bot commented 5 days ago

CI report:

Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build
nsivabalan commented 3 days ago

Closing this in favor of https://github.com/apache/hudi/pull/12313