SI record generation is of two steps:
a. Find record keys that are updated or deleted and add deleted records to SI index. We will do a lookup of the same in SI to find the SI, record key combo and prepare delete records.
b. For the latest data (inserted or updated), we read the records to find SI value, record key combination to generate new insert records to ingest to SI.
Among the above steps, (a) is the one which was relying on WriteStatus.
In this patch, we are only fixing (a). i.e. Finding the list of records keys that got updated or deleted in the current commit of interest will not rely on WriteStatus, but do on-demand read from data files.
Based on time permitting for the 1.x release, we might have a follow up patch, where we can unify steps a and b and get it done in one step.
This patch definitely has to go in to remove the dependency on WriteStatus. The optimization of merging steps a and b will be followed up based on available bandwidht and timeframe before we wrap up 1.0. It is an optimization step and not really impacts correctness. Since Secondary Index itself is a new feature that we are introducing in 1.x, we wanted to take care of correctness and reliability in the first place.
Impact
Fixing Secondary Index Record generation to not rely on WriteStatus.
Risk level (write none, low medium or high below)
low
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".
The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.
Change Logs
SI record generation is of two steps: a. Find record keys that are updated or deleted and add deleted records to SI index. We will do a lookup of the same in SI to find the SI, record key combo and prepare delete records. b. For the latest data (inserted or updated), we read the records to find SI value, record key combination to generate new insert records to ingest to SI.
Among the above steps, (a) is the one which was relying on WriteStatus.
In this patch, we are only fixing (a). i.e. Finding the list of records keys that got updated or deleted in the current commit of interest will not rely on WriteStatus, but do on-demand read from data files. Based on time permitting for the 1.x release, we might have a follow up patch, where we can unify steps a and b and get it done in one step.
This patch definitely has to go in to remove the dependency on WriteStatus. The optimization of merging steps a and b will be followed up based on available bandwidht and timeframe before we wrap up 1.0. It is an optimization step and not really impacts correctness. Since Secondary Index itself is a new feature that we are introducing in 1.x, we wanted to take care of correctness and reliability in the first place.
Impact
Risk level (write none, low medium or high below)
low
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".
Contributor's checklist