Closed seanmobrien closed 1 year ago
Have decided to take two approaches to minimise memory usage when writing results:
Result files will now be split into two types:
This change will impact both normal search and repeat search, and will introduce a breaking change in the search API.
TBC: Details about which components need to be amended
@WMDAJesse @mmelchers do you have statistics from HAP-E about largest number of search results observed from a single search?
We have decided to go with the 1st approach (Batch results into multiple files) first, once it's implemented we could check if it helps to avoid OutOfMemory exception and, if not - we will implement the 2nd approach (Write to individual result files using the AppendBlob method).
We propose the following file structure
With this structure we won't have to send results file names within the service bus message, all the results can be read with just having the search_id (we will need to run 'ListBlobs' of the
A new field 'ResultBatched', indicating if results are batched (i.e. split in to multiple files) or not, will be added to search results notification (for topics search-results-ready, matching-results-ready, repeat-search-results-ready)
@WMDAJesse @mmelchers do you have statistics from HAP-E about largest number of search results observed from a single search?
I think the largest one that we have seen so far was around 700.000 records, but that happens rarely I think. But we have definitely seen result set of 500.000+ records.
@WMDAJesse thanks! Would it be possible to be granted the search request details (specifically, patient HLA and mismatch count) for the top 10 largest searches?
this is the top 10 based on 0 mismatch. All have at least 50 thousand results in case of 0 mismatch (when run in Hap-E prod). The amount of results will drastically go up when you increase the number of allowed mismatches.
Thanks @mmelchers ! Do you know what number of allowed mismatches we are targeting for ATLAS? Also, is it possible to get some performance metrics - eg how quickly we need these large searches to complete in order to deliver a minimaly viable product?
@seanmobrien Donors (adult): 0, 1, and 2 mismatches There is a demand for 3 and 4 mismatch searches for donors, but this is not part of MVP. expected time to run: Hap-E has the following for searches with > 100 thousand potential 0 mismatch donors: 0 mismatch: median = 1016 seconds 1 mismatch: median = 1343 seconds 2 mismatch: median = 3788 seconds
So it would be reasonable for ATLAS to have the following: 0 mismatch: median < 2000 seconds 1 mismatch: median < 3000 seconds 2 mismatch: median < 7200 seconds
Cords: In case of n/8 or n/10: 0, 1, 2, 3, and 4 mismatches in case of n/6: 0, 1, and 2 mismatches all cord searches in Hap-E (even with 4 mismatches and more than 50 thousand records) finish within 2500 seconds. The median is < 700
So for ATLAS: any cord search with more than 50 thousand records: median < 1400 seconds
@seanmobrien @mmelchers I'll copy the performance requirements to a new ticket
RESULTS_BATCH_SIZE
, which gets applied to app setting, SearchResultsBatchSize
on both the matching and repeat search functions apps - value of >0 will cause results to be written as batched files, else only a single file.What is left on this ticket before it can be closed and moved to final review column:
@DmitriyShcherbina to add testing notes from AN dev
@seanmobrien to write up testing notes from WMDA dev
Describe the bug When a search request that generates a large resultset is ran, an OutOfMemory exception is thrown when the search result is being serialized for storage in the search-results container. This behavior is more visible when multiple search requests are being processed at the same time, but with a large enough resultset the error occurs on a single search. OutOfMemory error can be duplicated with a single search request when Elastic plan is scaled to maximum allowed (E3)
To Reproduce Steps to reproduce the behavior:
Expected behaviour 1) Search results should be successfully uploaded to the storage container even when the resulset is large.
2) If a critical out-of-memory error occurs during serialization, an appropriate search failed message (with retry information) should be added to the search-results-ready topic.
Screenshots
Inputs/Outputs
{"searchDonorType":"Adult","matchCriteria":{"donorMismatchCount":2,"locusMismatchCriteria":{"a":2,"b":2,"c":2,"dpb1":null,"dqb1":2,"drb1":2},"includeBetterMatches":true},"scoringCriteria":{"lociToScore":["Dpb1"],"lociToExcludeFromAggregateScore":[]},"searchHlaData":{"a":{"position1":"01:01","position2":"01:01"},"b":{"position1":"08:01","position2":"07:02"},"c":{"position1":"07:01","position2":"07:02"},"dpb1":{"position1":"02:01","position2":"87:01"},"dqb1":{"position1":"02:01","position2":"06:02"},"drb1":{"position1":"03:01","position2":"15:01"}},"patientEthnicityCode":null,"patientRegistryCode":null,"runMatchPrediction":true,"donorRegistryCodes":null}
Atlas Build & Runtime Info (please complete the following information):
Additional context This issue is reliably reproducable in WMDA's dev environment using the input above.