apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.47k stars 1.28k forks source link

[partial-upsert] Partial Upserts and Sorted Column #12397

Open ankitsultana opened 8 months ago

ankitsultana commented 8 months ago

I haven't spent time to reproduce this particular issue yet and the following is based on my understanding, so do evaluate and contest any claims made herein.

Issue Description

When a sorted column is set for any Realtime table, the MutableSegment remains unsorted. During the segment-commit, we read each record from the MutableSegment in the sortedDocId order.

A segment commit will be followed by a addOrReplaceSegment call in the ConcurrentMapPartitionUpsertMetadataManager (call it UMM for Upsert Metadata Manager), where the oldSegment will be set to the ImmutableSegment. This is because you need to update this map so that it points to the new segment, and the docId is also updated as needed.

With Partial Upserts, say we had 4 events for a given primary-key in the Mutable Segment: R0, R1, R2, R3. Let's also assume that the comparison column value of these events is: R0 < R1 < (R2 = R3).

If after applying the sorted column, their order changes to: R0, R1, R3, R2, then the UMM will start pointing to R2 as the valid doc.

To summarize: If Partial Upsert tables have a sorted column, then the users must make sure that their events for a given primary key are emitted in strictly increasing order.

You may ask "What about Full Upsert tables?"

This is not an issue for Full Upsert tables in my opinion, because Pinot could say that in case of ties of comparison column values, any of the records may be picked as the latest record.

For Partial Upsert tables, from a user perspective, the bug above will be seen as random events being dropped and not applied to the Partial Upsert merger, which would lead to inconsistent data.

Discussion

What should be the follow-up here? Some options I see are:

tibrewalpratik17 commented 8 months ago

One more scenario I want to confirm is what happens when the values of SortedColumn are equal? Is it definite that the result will be sorted in ascending order of DocIds. Sorry I could have written a UT myself and confirmed; just hope this would be quicker.

I see this part of code does the processing:

https://github.com/apache/pinot/blob/43dadbfd96a70c19a9ac83bb6c0c35f3fa58bffb/pinot-segment-local/src/main/java/org/apache/pinot/segment/local/indexsegment/mutable/MutableSegmentImpl.java#L1029-L1042

But not sure if Roaring BitMap guarantees that batches and the items within the batches are going to be in increasing order. Didn't find any relevant docs with a quick search.