`DuplicateKeyException` thrown by StoreOriginalSearchResults

seanmobrien commented 1 year ago

Describe the bug Specific searches are failing to complete. When these searches are ran, ATLAS appears to stop processing in the RunMatchPredictionBatch operation, with SQL errors being logged in StoreOriginalSearchResults

To Reproduce Steps to reproduce the behavior:

Submit search request ( https://anthonynolan.sharepoint.com/:t:/r/sites/ATLASTeam/Shared%20Documents/General/[issue-961-input.txt](https://anthonynolan.sharepoint.com/:t:/r/sites/ATLASTeam/Shared%20Documents/General/issue-961-input.txt?csf=1&web=1&e=1rhCvA)?csf=1&web=1&e=1rhCvA )

Expected behaviour Search request completes or error is logged

zabeen commented 1 year ago

Thanks @seanmobrien Could you please share the ids of the search that silently failed?

Re: "appears to stop processing" - that might be due to some logs not being tagged with the search id, as described in #963. I.e., the search may have continued further but we can't determine that by querying AI using the search id alone.

StoreOriginalSearchResults is a function on the repeat search functions app which is a consumer of the matching results, so the SQL errors there may be a separate issue to the silent search failure you are reporting, and may require a new ticket. Will confirm once I receive some ids.

zabeen commented 1 year ago

@seanmobrien, I found two searches that completed successfully yesterday, one has the exact same search HLA as listed in the file (4ce72de0-6392-41ea-8f6e-390c57f1fe19) and the second has very similar HLA which only differs at position A_2 (96759bd1-c7b7-419c-9093-d026d61df5a4). Both searches were mismatch donor searches (6/8 and 8/10, respectively), had half a million results, and took over a day to complete - most of that time was spent on match prediction.

So I am not sure if searches are actually failing silently, but instead are just taking a long time due to their complexity. As I mentioned above, the logging bug may imply the searches are "stuck" but it could just be that AI logs were not being tagged with a request id, giving a false picture. That should be resolved soon hopefully.

I will now focus the investigation on the SQL errors around StoreOriginalSearchResults

seanmobrien commented 1 year ago

Thank you Zabeen! I agree #963 could explain why the search logs drop off - I will get the updates merged and deployed today.

Re: the SQL errors, there is a dead lettered message (id 30cd9b94c8bd4887828b1bc16e022854 ) in the matching-results-ready/repeat-search that corresponds with the search ID in the exception logs. One thing I noticed is it looks like ATLAS was running a number of large/complicated searches at the time.

zabeen commented 1 year ago

action: @seanmobrien to paste in screenshots from his investigation of the sql errors.

zabeen commented 1 year ago

@seanmobrien I can see the same exceptions being thrown in the new UAT env (example) - and now repeat search priority has been scaled up (it is now a potential blocker to full go live within wmda), we need to resolve this asap.

Do you still have your investigation notes?

IgorKupreychik commented 1 year ago

We've analyzed the issue in the UAT env and it looks like the real issue is related to the fact that the function StoreOriginalSearchResults has been called twice for the same searchId. The example proves that theory: here we can see 10 attempts of StoreOriginalSearchResults execution, all of them have failed with the error "StoreOriginalSearchResults Cannot insert duplicate key row in object 'RepeatSearch.CanonicalResultSets' with unique index 'IX_CanonicalResultSets_OriginalSearchRequestId'. The duplicate key value is (cca3c7d8-2dd7-4d4e-9d3f-a51f0d3b39b5).". However, since the database already contains data for this search id (and number of donors found is correct), it makes me think that there was another attempt of running StoreOriginalSearchResults for the same searchId, it was run before the failed ones and it was succeeded. So, the data needed for repeat search was captured properly and the error messages we're observing shouldn't affect repeat search functionality.

zabeen commented 11 months ago

Update: insert duplicate key exception still being observed in UAT which suggests #1041 is also still an issue

zabeen commented 9 months ago

Update: same duplicate key exception has been observed in LIVE-WMDA.

for the given search request ID:

one matching request message was created, it had a DeliveryCount of 1
but two matching result messages were published, the second a few mins after the first.
- first message enqueued at 11:24, and matching ElapsedTime was 00:02:20.5041913
- second message enqueued at 11:27, and matching ElapsedTime was 00:01:52.2458647
- Both had 3575 results
- Need to check AI logs to see if there was any overlap in run time of first search attempt and initiation time of second.

This is definitive proof that matching requests are being processed more than once, though it's not clear why - that is covered by #1041.

To resolve the duplicate key exception, we could make a small change to repeat search to either overwrite existing results, or ignore any subsequent matching results messages for the same ID.

SergeyEzh commented 9 months ago

@zabeen, I haven't found anything specific which could cause that behavior. However, it seems that it is expected service bus behavior. According (documentation ), queue in 'Peek lock' receive mode implements so-called 'at-least once' delivery. Since ServiceBusTrigger receives messages in 'Peek lock' mode (link), it is possible that one message will be processed by function twice. Moreover, service bus client has auto-retry policy and can send message twice if doesn't get acknowledgement. So in certain cases, client can put two identical messages in topic.

I think the case you described in previous comment is cased by auto-retry on service bus client. We don't see any additional delivery attempts in logs but only two the same messages with search results in queue. I'd propose to set up MessageId property to prove this theory in future (PR that). So if it cased by client we'll see two messages with same MessageId.

Also, I found different case. From this log, we can see that RunSearch is triggered twice by message with sequence number 14644, although first run was successful and stored search results. (1st run at 17:12:16 UTC, 2nd run at 17:17:19 UTC)

So with all said above, I'd propose:

Confirm (with proposed PR) that case when there's several messages in matching-results-ready for same searchId is caused by service bus retries.
Update StoreOriginalSearchResults to check if there's already results stored for given search id.

zabeen commented 7 months ago

Testing

Regression: all Atlas E2E tests (incl. repeat search) are passing on DEV
Fix:
- Replayed a matching-results-ready message on DEV service bus for which results have already been saved to the repeat search db.
- The message on repeat-search sub did not dead-letter.
- Did not see any duplicate key exceptions on app. insights.

Tests passed

Anthony-Nolan / Atlas

`DuplicateKeyException` thrown by StoreOriginalSearchResults #962

Testing