Anthony-Nolan / Atlas

A free & open-source Donor Search Algorithm Service
GNU General Public License v3.0
9 stars 5 forks source link

PersistSearchResults failed: "Insufficient memory to continue the execution of the program." #931

Closed zabeen closed 1 year ago

zabeen commented 1 year ago

Details to follow

zabeen commented 1 year ago

Ticket raised from testing #897 on WMDA DEV

@seanmobrien to perform initial investigation and paste findings into ticket description @IgorKupreychik to work on fix

zabeen commented 1 year ago

@seanmobrien Forgot to say on the earlier discussion around this bug, that one easy thing to check is: does reducing batch size resolve the problem, maybe writing even smaller files (5000? 1000?) - just need to update app setting on top-level functions app and replay the matching result message. If memory error still there with a small batch size, suggests something else is going wrong.

daria-sorokina-da commented 1 year ago

From my investigation of the error that happened 31 March ~ 7 am UTC: PersistSearchResults failed, and looks like the function stopped here - as "Download match prediction algorithm results" was never logged: https://github.com/Anthony-Nolan/Atlas/blob/master/Atlas.Functions/Services/ResultsCombiner.cs#L90

 Looks like this method - DownloadMatchPredictionResults - loads all match prediction results in memory to add them to the resulting JSON. There were 443484 donor results to download during the failed run.

zabeen commented 1 year ago

Thanks @daria-sorokina-da , I think the simple solution is to restrict the scope of the combiner service to one matching results blob at a time:

for each result file in batch results folder:
- download blob from matching-results
- Download match prediction result blobs for donors in current result file
- Combine results
- Write out the combined results to atlas-search-results blob storage.

That would also mean we don’t need to explicitly chunk the final combined results ourselves, as the incoming result set is already chunked. So we can remove the SearchResultsBatchSize app setting from the top level function.

daria-sorokina-da commented 1 year ago

For QA:

Config changes for batching:

daria-sorokina-da commented 1 year ago

Testing results: