apache / hudi

Upserts, Deletes And Incremental Processing on Big Data.
https://hudi.apache.org/
Apache License 2.0
5.45k stars 2.42k forks source link

[HUDI-8542] Updating how RLI records are generated for MDT updates #12269

Open nsivabalan opened 5 days ago

nsivabalan commented 5 days ago

Change Logs

RLI record preparation for MDT has been relying on RDD. We are removing the dependency in this patch to generate it on the fly to ensure RLI updates are resilient to task/stage retries with spark. Will update the PR desc w/ more details on the design shortly.

Current design of RLI prior to this fix:

mdt_dag1_1 mdt_dag1_2

Root cause for the potential inconsistencies: Major point of concern here is that, for preparing MDT records for some of the partitions, we are relying in RDD. All other partitions in MDT is relying on HoodieCommitMetadata which is in driver's memory, but RLI is relying on RDD which could be recomputed again when a subset of spark partition goes missing.

Proposal to fix the inconsistencies: So, our proposal targets to avoid the reliance on the RDD on a high level. We are choosing consistency over performance for the time. Eventually we wanted to consider a full streaming way of generating MDT records (ref draft patch here) We might come back post 1.0 release to revisit the full dag rewrite for sure, since its more streaming friendly and we have to take that route eventually to support minute level commits w/ numerous indexes being added to our indexing sub system.

Design Once we trigger DT writes via collect() to fetch List, all of downstream computation will never look up the RDD for any MDT record preparation. For RLI and Secondary index, we will do on-demand read of the data files to fetch the required info to prepare MDT records for these partitions. Here is the illustration of the design.

image (13)

So, for RLI and Secondary index, we will be doing on-demand read of data files to fetch the necessary info to assist with preparing MDT records. With this design change, our entire record generation for RLI and Secondary index will be resilient to spark task retries. Infact, we trigger the collect() just once for the data table and so, post collecting() the writeStatus/HoodieWriteStats in data table, no downstream caller will ever try to dereference the dag and so there is no chance of inconsistencies.

In this patch, we will focus only in RLI. We will work on a follow up patch for SI.

So, lets try to understand what info do we need and how to fetch it.

For RLI, we need below info For every file group that got touched, we need the following:

For now, lets dive into details on how we plan to get the required info.

How do we fetch the required info for a given file group being touched in the current commit of interest: a. We need to read the latest image of the file slice added as part of the current commit. b. Optionally we need to read the previous image of the file slice or previous file slice (excluding the files being added as part of current commit). Find the difference b/w (a) and (b) to fetch the required info for both indexes.

Reason for (b) being optional is : If new file being considered, consists of purely of inserts or updates, we don't even need to look up the previous version of the file slice. Note that, HoodieWriteStat will give us the info about numInserts, numUpdates and numDeletes and we can rely on that to deduce it.

Lets understand what refers to "previous image of the file slice": Essentially its the latest file slice excluding the file being added in the current commit. Incase of COW table, if a new base file is added to an existing file group, "previous image of the file slice" refers to the previous file slice(i.e the previous base file). This also applies for compaction in case of MOR table. Degenerative case: Incase of a new base file added to a new file group, there is no "previous image of the file slice". Every record is an insert in this case. Incase of a new log file added to an existing file slice, "previous image of the file slice" refers to the file slice excluding the log file.

Computing the record key ➝ fileId mapping differs for base file vs log file. Lets take a look at each of them.

Base file record key mapping computation: RLI:

Log files record key mapping computation: RLI is an index, where we can't have inserts to log files. So, we can completely ignore reading any log files w/ data blocks only. In other words, log files w/ data blocks can only contain updates. But still, some payload implementation could realize deletes via custom implementation which may be seen as updates. So, to account for those cases, we can let the next compaction take care of realizing the deletes from the file group of interest.

To summarize, if we have log files added to commit of interest: for RLI since inserts cannot go into log files:

Lets skim through every operation:

Impact

Robust RLI updates even w/ spark task retries.

Risk level (write none, low medium or high below)

medium

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

Contributor's checklist

nsivabalan commented 1 day ago

@danny0405 : addressed all feedback

hudi-bot commented 1 day ago

CI report:

Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build