PersistSearchResults failed: "Insufficient memory to continue the execution of the program."

Anthony-Nolan / Atlas

A free & open-source Donor Search Algorithm Service

GNU General Public License v3.0

9 stars 5 forks source link

PersistSearchResults failed: "Insufficient memory to continue the execution of the program." #931

Closed zabeen closed 1 year ago

zabeen commented 1 year ago

Details to follow

zabeen commented 1 year ago

Ticket raised from testing #897 on WMDA DEV

@seanmobrien to perform initial investigation and paste findings into ticket description @IgorKupreychik to work on fix

zabeen commented 1 year ago

@seanmobrien Forgot to say on the earlier discussion around this bug, that one easy thing to check is: does reducing batch size resolve the problem, maybe writing even smaller files (5000? 1000?) - just need to update app setting on top-level functions app and replay the matching result message. If memory error still there with a small batch size, suggests something else is going wrong.

daria-sorokina-da commented 1 year ago

From my investigation of the error that happened 31 March ~ 7 am UTC: PersistSearchResults failed, and looks like the function stopped here - as "Download match prediction algorithm results" was never logged: https://github.com/Anthony-Nolan/Atlas/blob/master/Atlas.Functions/Services/ResultsCombiner.cs#L90

Looks like this method - DownloadMatchPredictionResults - loads all match prediction results in memory to add them to the resulting JSON. There were 443484 donor results to download during the failed run.

zabeen commented 1 year ago

Thanks @daria-sorokina-da , I think the simple solution is to restrict the scope of the combiner service to one matching results blob at a time:

for each result file in batch results folder:
- download blob from matching-results
- Download match prediction result blobs for donors in current result file
- Combine results
- Write out the combined results to atlas-search-results blob storage.

That would also mean we don’t need to explicitly chunk the final combined results ourselves, as the incoming result set is already chunked. So we can remove the SearchResultsBatchSize app setting from the top level function.

daria-sorokina-da commented 1 year ago

For QA:

please run a search with match prediction with Batching OFF -- make sure that matching-algorithm-results are written as one blob -- make sure that match-prediction-results are written per donor -- make sure that atlas-search-results are batched are written as one blob
please run a search with match prediction with Batching ON -- make sure that matching-algirithm-results are written per batching -- make sure that match-prediction-results are written per donor -- make sure that atlas-search-results are batched are written per the same batching as matching-algirithm-results -- make sure that atlas-search-results have correct matching and match prediction data

Config changes for batching:

\<env>-ATLAS-MATCHING-ALGORITHM-FUNCTIONS AzureStorage:SearchResultsBatchSize 0 (default in AN) or batch size
\<env>-ATLAS-FUNCTIONS AtlasFunction:AzureStorage:ShouldBatchResults false (default in AN) / true

daria-sorokina-da commented 1 year ago

Testing results:

Search with Batching ON - succeeded
Search with Batching OFF - failed, because the reulting file in atlas-search-results does not conatain the Results