CPU spikes during Data Refresh

daria-sorokina-da commented 1 year ago

Currently, the UAT-WMDA-ATLAS-MATCHING-ALGORITHM-FUNCTIONS could not run the forced Data Refresh, because the function got restarted during the progress due to CPU spiking at 100%

You can see the pattern here - the rise of CPU happened during Data Refresh process:

As a temporary solution, UAT-WMDA-ATLAS-ELASTIC-PLAN was scaled up from EP2 to EP3

zabeen commented 1 year ago

I find this interesting because auto-heal is switched off on matching algorithm funcs app, so how is the app being restarted? Perhaps if it's really bad, Azure will do it anyway??

zabeen commented 1 year ago

Notes: The db logs for UAT-WMDA-ATLAS show that the last 3 attempts of data refresh all failed during the donor import step (I'm surprised, I thought it would be during HLA processing).

We did not observe this issue before when running the same job on WMDA-hosted Atlas, but perhaps we ran the job on EP3, it was a while back now.

jdoherty-nmdp commented 1 year ago

In order to replay a message, RefreshEndUtc (not WasSuccessful) needs to be NULL. Zabeen plans to update a README to indicate this.

Today we ramped up UAT to EP3, then updated the message's DB entry (setting RefreshEndUtc to NULL) and replayed it. Afterwards, notifications for it appeared on the 'WMDA-UAT Support' channel. Unfortunately, the message has retried several times, so something might be going on.

Resource usage has been maxed out at times during the last 48 hours, including just before we triggered today's replay. The usage spikes may be caused by concurrent searches and "challenging tests" being run in UAT environment. The latter activity isn't expected to be a factor for Prod, but it still might make sense to temporarily bump up to EP3 as part of the quarterly data refresh process to reduce the chances of updates being affected by resource limitations.

zabeen commented 1 year ago

Testing on Live

@mmelchers says the number of searches running against live-wmda-atlas on a week day is similar to the number being run against live HAP-E during the weekend
so we will kick off data refresh on live-wmda-atlas on a week day to see how it performs during off-peak load, which is the expected use case

First attempt on EP2

Manually increased max delivery count on live-wmda-atlas.data-refresh-requests.matching-algorithm to 100
Kick off data refresh job (id: 5)
Job failed due to "Data Refresh Failed: $Atlas.MatchingAlgorithm.Exceptions.DonorImportHttpException: Unable to complete donor import: Login failed for user 'matching'."
Re-releasing atlas to live-wmda-atlas to ensure the powershell script completes on success and the db passwords are applied
Db password script completed with success
Retrying refresh

Second attempt on EP2

Id: 6
Pw issue has been resolved and donors are being imported into matching-db-b
Job took approx. 24 hours to complete and 15 retries.
The duration is almost exactly the same as last time the job was run, just after the full load of donors, but that run only involved 6 retries.
133 searches completed successfully during this time.
Unfortunately, there were 258 failed searches, but these were caused by a different issue #1055

Actions

the max delivery attempt on data refresh job should be increased to 30
to minimise search issues, data refresh should be run at an off peak time, e.g., over the weekend
if the first attempt to refresh did not complete on EP2, it should be re-attempted on EP3

Anthony-Nolan / Atlas