Open tibrewalpratik17 opened 3 months ago
Attention: Patch coverage is 64.70588%
with 6 lines
in your changes are missing coverage. Please review.
Project coverage is 62.17%. Comparing base (
59551e4
) to head (f0d82d2
). Report is 444 commits behind head on master.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
One potential solution would be to persist the original doc id into a column, then use that to break tie
Yes that's what I suggested in one of the threads above. Having a virtual column like $originalDocID would also make sense and especially for upserts we can refer that to resolve comparison ties. For now, i can think of only this as a possible solution or to persist the original docID in segment-metadata somewhere but that doesn't seem optimal. Maintaining a virtual column still makes more sense. cc @klsince what are your thoughts on this?
Can we consider adding the stream offset as a virtual column? Then we could allow users to set that virtual column as the comparison column.
Can we consider adding the stream offset as a virtual column? Then we could allow users to set that virtual column as the comparison column.
Setting offset field as a comparison column might be another discussion we are getting into. But we can use offsets virtual column to break ties in general (very similar logic to originalDocID field). Other than higher memory / disk usage compared to originalDocID field i don't see any other challenges. But yeah having offset as a column also comes with pros like easier debugging and observability around data.
to my best understanding on this issue, I'd +1 to keep original docId in a virtualColumn to help break tie.
for the idea of using stream offset, I'd prefer to put stream offset in a real column, assuming we may create some ingestion transformation to extract the msg offset and put into the column. Then we can use this column as one of the comparison columns, to prevent comparison ties from happening.
label:
bugfix
upsert
Potential fix for #12397
As called out in the issue:
Now based on the assumption that R3's docId will be greater than R2's docId, raising this fix as an echancement to fix in #12395 where in case of same comparison column value we also check the docID value of the old mutable segment to know for a particular record which one to keep.
We keep the max-doc-id of older segment in making this decision. Say, in the example above; we will ensure to dedup R3 as the only record.
Tested in our local cluster by adding sorted-index column to a table with constant comparison column. Ensured that we are deduping the correct record and persisting it. Added UTs for updating the
resolvingComparisonTies
method.cc @ankitsultana @Jackie-Jiang