apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.45k stars 2.42k forks source link

[HUDI-8543] Fixing Secondary Index Record generation to not rely on WriteStatus #12313

Open nsivabalan opened 9 hours ago

nsivabalan commented 9 hours ago

Change Logs

SI record generation is of two steps: a. Find record keys that are updated or deleted and add deleted records to SI index. We will do a lookup of the same in SI to find the SI, record key combo and prepare delete records. b. For the latest data (inserted or updated), we read the records to find SI value, record key combination to generate new insert records to ingest to SI.

Among the above steps, (a) is the one which was relying on WriteStatus.

In this patch, we are only fixing (a). i.e. Finding the list of records keys that got updated or deleted in the current commit of interest will not rely on WriteStatus, but do on-demand read from data files. Based on time permitting for the 1.x release, we might have a follow up patch, where we can unify steps a and b and get it done in one step.

This patch definitely has to go in to remove the dependency on WriteStatus. The optimization of merging steps a and b will be followed up based on available bandwidht and timeframe before we wrap up 1.0. It is an optimization step and not really impacts correctness. Since Secondary Index itself is a new feature that we are introducing in 1.x, we wanted to take care of correctness and reliability in the first place.

Impact

Risk level (write none, low medium or high below)

low

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

Contributor's checklist

hudi-bot commented 8 hours ago

CI report:

Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build