Anthony-Nolan / Atlas

A free & open-source Donor Search Algorithm Service
GNU General Public License v3.0
9 stars 5 forks source link

Chunk search results to prevent OutOfMemory exception when large resultsets serialized #897

Closed seanmobrien closed 1 year ago

seanmobrien commented 1 year ago

Describe the bug When a search request that generates a large resultset is ran, an OutOfMemory exception is thrown when the search result is being serialized for storage in the search-results container. This behavior is more visible when multiple search requests are being processed at the same time, but with a large enough resultset the error occurs on a single search. OutOfMemory error can be duplicated with a single search request when Elastic plan is scaled to maximum allowed (E3)

To Reproduce Steps to reproduce the behavior:

  1. Send a search request that generates a large resultset (example input for WMDA dev environment provided in Inputs section)
  2. In matching-request topic, observe message is redelivered 10 times and ultimately dead-lettered
  3. Using the Diagnose and Troubleshoot tool associated with the matching algorithm function, observe OutOfMemory exceptions have occurred.

Expected behaviour 1) Search results should be successfully uploaded to the storage container even when the resulset is large.
2) If a critical out-of-memory error occurs during serialization, an appropriate search failed message (with retry information) should be added to the search-results-ready topic.

Screenshots Snag_1ab3ff61

Snag_1ab42846

Inputs/Outputs

{"searchDonorType":"Adult","matchCriteria":{"donorMismatchCount":2,"locusMismatchCriteria":{"a":2,"b":2,"c":2,"dpb1":null,"dqb1":2,"drb1":2},"includeBetterMatches":true},"scoringCriteria":{"lociToScore":["Dpb1"],"lociToExcludeFromAggregateScore":[]},"searchHlaData":{"a":{"position1":"01:01","position2":"01:01"},"b":{"position1":"08:01","position2":"07:02"},"c":{"position1":"07:01","position2":"07:02"},"dpb1":{"position1":"02:01","position2":"87:01"},"dqb1":{"position1":"02:01","position2":"06:02"},"drb1":{"position1":"03:01","position2":"15:01"}},"patientEthnicityCode":null,"patientRegistryCode":null,"runMatchPrediction":true,"donorRegistryCodes":null}

Atlas Build & Runtime Info (please complete the following information):

Additional context This issue is reliably reproducable in WMDA's dev environment using the input above.

zabeen commented 1 year ago

HLD

TBC: Details about which components need to be amended

zabeen commented 1 year ago

@WMDAJesse @mmelchers do you have statistics from HAP-E about largest number of search results observed from a single search?

IgorKupreychik commented 1 year ago

We have decided to go with the 1st approach (Batch results into multiple files) first, once it's implemented we could check if it helps to avoid OutOfMemory exception and, if not - we will implement the 2nd approach (Write to individual result files using the AppendBlob method).

We propose the following file structure

With this structure we won't have to send results file names within the service bus message, all the results can be read with just having the search_id (we will need to run 'ListBlobs' of the "folder" to get all the file names with results).

A new field 'ResultBatched', indicating if results are batched (i.e. split in to multiple files) or not, will be added to search results notification (for topics search-results-ready, matching-results-ready, repeat-search-results-ready)

WMDAJesse commented 1 year ago

@WMDAJesse @mmelchers do you have statistics from HAP-E about largest number of search results observed from a single search?

I think the largest one that we have seen so far was around 700.000 records, but that happens rarely I think. But we have definitely seen result set of 500.000+ records.

zabeen commented 1 year ago

@WMDAJesse thanks! Would it be possible to be granted the search request details (specifically, patient HLA and mismatch count) for the top 10 largest searches?

mmelchers commented 1 year ago

top 10 patients with most results 2023.csv

mmelchers commented 1 year ago

this is the top 10 based on 0 mismatch. All have at least 50 thousand results in case of 0 mismatch (when run in Hap-E prod). The amount of results will drastically go up when you increase the number of allowed mismatches.

seanmobrien commented 1 year ago

Thanks @mmelchers ! Do you know what number of allowed mismatches we are targeting for ATLAS? Also, is it possible to get some performance metrics - eg how quickly we need these large searches to complete in order to deliver a minimaly viable product?

mmelchers commented 1 year ago

@seanmobrien Donors (adult): 0, 1, and 2 mismatches There is a demand for 3 and 4 mismatch searches for donors, but this is not part of MVP. expected time to run: Hap-E has the following for searches with > 100 thousand potential 0 mismatch donors: 0 mismatch: median = 1016 seconds 1 mismatch: median = 1343 seconds 2 mismatch: median = 3788 seconds

So it would be reasonable for ATLAS to have the following: 0 mismatch: median < 2000 seconds 1 mismatch: median < 3000 seconds 2 mismatch: median < 7200 seconds

Cords: In case of n/8 or n/10: 0, 1, 2, 3, and 4 mismatches in case of n/6: 0, 1, and 2 mismatches all cord searches in Hap-E (even with 4 mismatches and more than 50 thousand records) finish within 2500 seconds. The median is < 700

So for ATLAS: any cord search with more than 50 thousand records: median < 1400 seconds

zabeen commented 1 year ago

@seanmobrien @mmelchers I'll copy the performance requirements to a new ticket

zabeen commented 1 year ago

Testing Notes

Search Requests

Repeat Search

zabeen commented 1 year ago

What is left on this ticket before it can be closed and moved to final review column:

@DmitriyShcherbina to add testing notes from AN dev

@seanmobrien to write up testing notes from WMDA dev