Anthony-Nolan / Atlas

A free & open-source Donor Search Algorithm Service
GNU General Public License v3.0
9 stars 5 forks source link

CPU spikes during Data Refresh #1048

Closed daria-sorokina-da closed 1 year ago

daria-sorokina-da commented 1 year ago

Currently, the UAT-WMDA-ATLAS-MATCHING-ALGORITHM-FUNCTIONS could not run the forced Data Refresh, because the function got restarted during the progress due to CPU spiking at 100%

You can see the pattern here - the rise of CPU happened during Data Refresh process: Image

As a temporary solution, UAT-WMDA-ATLAS-ELASTIC-PLAN was scaled up from EP2 to EP3

zabeen commented 1 year ago

I find this interesting because auto-heal is switched off on matching algorithm funcs app, so how is the app being restarted? Perhaps if it's really bad, Azure will do it anyway??

zabeen commented 1 year ago

Notes: The db logs for UAT-WMDA-ATLAS show that the last 3 attempts of data refresh all failed during the donor import step (I'm surprised, I thought it would be during HLA processing).

We did not observe this issue before when running the same job on WMDA-hosted Atlas, but perhaps we ran the job on EP3, it was a while back now.

jdoherty-nmdp commented 1 year ago

In order to replay a message, RefreshEndUtc (not WasSuccessful) needs to be NULL. Zabeen plans to update a README to indicate this.

Today we ramped up UAT to EP3, then updated the message's DB entry (setting RefreshEndUtc to NULL) and replayed it. Afterwards, notifications for it appeared on the 'WMDA-UAT Support' channel. Unfortunately, the message has retried several times, so something might be going on.

Resource usage has been maxed out at times during the last 48 hours, including just before we triggered today's replay. The usage spikes may be caused by concurrent searches and "challenging tests" being run in UAT environment. The latter activity isn't expected to be a factor for Prod, but it still might make sense to temporarily bump up to EP3 as part of the quarterly data refresh process to reduce the chances of updates being affected by resource limitations.

zabeen commented 1 year ago

Testing on Live

First attempt on EP2

Second attempt on EP2

Actions