Closed daria-sorokina-da closed 1 year ago
I find this interesting because auto-heal is switched off on matching algorithm funcs app, so how is the app being restarted? Perhaps if it's really bad, Azure will do it anyway??
Notes: The db logs for UAT-WMDA-ATLAS show that the last 3 attempts of data refresh all failed during the donor import step (I'm surprised, I thought it would be during HLA processing).
We did not observe this issue before when running the same job on WMDA-hosted Atlas, but perhaps we ran the job on EP3, it was a while back now.
In order to replay a message, RefreshEndUtc (not WasSuccessful) needs to be NULL. Zabeen plans to update a README to indicate this.
Today we ramped up UAT to EP3, then updated the message's DB entry (setting RefreshEndUtc to NULL) and replayed it. Afterwards, notifications for it appeared on the 'WMDA-UAT Support' channel. Unfortunately, the message has retried several times, so something might be going on.
Resource usage has been maxed out at times during the last 48 hours, including just before we triggered today's replay. The usage spikes may be caused by concurrent searches and "challenging tests" being run in UAT environment. The latter activity isn't expected to be a factor for Prod, but it still might make sense to temporarily bump up to EP3 as part of the quarterly data refresh process to reduce the chances of updates being affected by resource limitations.
live-wmda-atlas.data-refresh-requests.matching-algorithm
to 100
Currently, the UAT-WMDA-ATLAS-MATCHING-ALGORITHM-FUNCTIONS could not run the forced Data Refresh, because the function got restarted during the progress due to CPU spiking at 100%
You can see the pattern here - the rise of CPU happened during Data Refresh process:
As a temporary solution, UAT-WMDA-ATLAS-ELASTIC-PLAN was scaled up from EP2 to EP3