Open tibrewalpratik17 opened 5 months ago
Scenario 1 is resolved by #13489.
For scenario 2, we should fundamentally solve it to prevent Segment ZK Metadata CRC != Segment CRC deepstore altogether. I will see the cases where we encounter this. If the solutions are intrusive then specifically for upsert-compaction, I am planning to introduce skipCrcMismatch
config to solve it temporarily.
During the execution of the Upsert Compaction task, we perform a three-way equality check of CRCs from different sources of truth: (Ref).
There are several scenarios that can lead to this situation, all involving replicas having different CRCs. If the replicas didn't have different CRCs, this issue would not arise at all.
Scenario 1: Segment ZK Metadata CRC = Segment CRC deepstore != ValidDocID Bitmap CRC
The leader server uploads to ZK metadata and deepstore but is not called during the ValidDocID bitmap fetch from the minion. In this scenario, ZK metadata CRC and deepstore CRC would match. During minion task execution, we fetch the validDocID bitmap from one of the replica servers. If that server was not the leader in uploading to ZK and deepstore during segment commit, we will end up with an inequality.
Scenario 2: Segment ZK Metadata CRC != Segment CRC deepstore
I'm not entirely sure about all the cases where this scenario would occur, but thinking out loud, it seems this might happen during a deepstore-upload-retry task. In the deepstore-upload-retry task, we randomly choose a replica server to upload to deepstore. If the chosen replica server has a different CRC compared to the segment ZK metadata, we may encounter this issue. This can also happen when both the replicas update ZK and upload to deepstore during segment-commit. Say one replica updates ZK but is slower in uploading to deepstore but the other replica updates ZK and uploads to deepstore first. In this scenario, ZK data will be of the second replica but deepstore will have data of the first replica.