Closed zabeen closed 1 year ago
Ticket raised from testing #897 on WMDA DEV
@seanmobrien to perform initial investigation and paste findings into ticket description @IgorKupreychik to work on fix
@seanmobrien Forgot to say on the earlier discussion around this bug, that one easy thing to check is: does reducing batch size resolve the problem, maybe writing even smaller files (5000? 1000?) - just need to update app setting on top-level functions app and replay the matching result message. If memory error still there with a small batch size, suggests something else is going wrong.
From my investigation of the error that happened 31 March ~ 7 am UTC: PersistSearchResults failed, and looks like the function stopped here - as "Download match prediction algorithm results" was never logged: https://github.com/Anthony-Nolan/Atlas/blob/master/Atlas.Functions/Services/ResultsCombiner.cs#L90
Looks like this method - DownloadMatchPredictionResults - loads all match prediction results in memory to add them to the resulting JSON. There were 443484 donor results to download during the failed run.
Thanks @daria-sorokina-da , I think the simple solution is to restrict the scope of the combiner service to one matching results blob at a time:
for each result file in batch results folder:
- download blob from matching-results
- Download match prediction result blobs for donors in current result file
- Combine results
- Write out the combined results to atlas-search-results blob storage.
That would also mean we don’t need to explicitly chunk the final combined results ourselves, as the incoming result set is already chunked. So we can remove the SearchResultsBatchSize
app setting from the top level function.
For QA:
please run a search with match prediction with Batching OFF -- make sure that matching-algorithm-results are written as one blob -- make sure that match-prediction-results are written per donor -- make sure that atlas-search-results are batched are written as one blob
please run a search with match prediction with Batching ON -- make sure that matching-algirithm-results are written per batching -- make sure that match-prediction-results are written per donor -- make sure that atlas-search-results are batched are written per the same batching as matching-algirithm-results -- make sure that atlas-search-results have correct matching and match prediction data
Config changes for batching:
Testing results:
Details to follow