Anthony-Nolan / Atlas

A free & open-source Donor Search Algorithm Service
GNU General Public License v3.0
9 stars 5 forks source link

Function App restarts at Matching Phase 2 during large search #901

Closed zabeen closed 1 year ago

zabeen commented 1 year ago

Describe the bug During a search request, a large number of donors is returned by Matching Phase 1, as shown in the AI logs. The search will not continue past the log message "Matching timing: Phase 1 complete". This suggests something is going wrong during phase 2. The message will be replayed till dead-lettering, and the AI logs terminate at the same point. If the app service plan tier is increased, then the search completes.

Diagnostics/troubleshooting doesn't usually mention app restart or auto-healing, but it is the only logical explanation for why search terminates at the same place and the message is replayed.

To Reproduce I don't have a search request to hand, as we have been doing a lot of tweaking of the app service plan config, and sometimes the same search that fails on a lower plan completes on a higher plan. I may be able to get an example for AN search, for a 4/8 CBU search that failed on a lower tier.

Expected behaviour The search should complete, either fail with an explicit error, e.g., OutOfMemoryException, or succeed.

Inputs/Outputs Need to obtain, will paste in comments.

Atlas Build & Runtime Info (please complete the following information):

zabeen commented 1 year ago

Re: search HLA to reproduce the error, as it belongs to a real patient, I won't paste it here out of concern for privacy. Whomever works on this ticket, please message me directly to obtain it securely.

Extra notes: the search was a 4/8 CBU search that failed on EP1 plan but succeeded on the EP2 plan.

zabeen commented 1 year ago

HLD

zabeen commented 1 year ago

@luken-an to investigate possibility of toggling auto-heal behaviour

luken-an commented 1 year ago

@zabeen it is possible to disable the Proactive Auto-Heal. To do this you must go to the relevant Function then click 'Diagnose and solve problems' in the right hand menu -> 'Diagnostic Tools' -> Auto-Heal. The click on the Proactive Auto-Heal tab. There is then option to toggle on or off

zabeen commented 1 year ago

If we decide to disable auto-heal permanently then it needs to be encoded within terraform. I couldn't find anything in terraform docs about how to do this, but this link gives instructions on how to use the portal and terraform plan to discover what terraform settings should be (i.e., disable auto heal manually, then run terraform plan to see the manual change that will be overwritten by terraform).

zabeen commented 1 year ago

Note to dev: when investigating this ticket, run the search with auto-heal manually disabled to see if any exceptions are thrown by the application, which will give further data about what exactly is causing auto-heal to restart the app,

zabeen commented 1 year ago

Temporarily blocking this ticket until after #897 is merged and performance testing is repeated. Initial investigation suggests that disabling auto-heal is enough to resolve this problem; need to verify the implications of disabling auto-heal.

zabeen commented 1 year ago

@daria-sorokina-da says that auto-heal is disabled on some other AN apps, and that it may not need to be terraformed (i.e., terraform release may not undo a manual disabling of auto-heal). I have disabled auto-heal on wmda-dev matching app, ahead of a release, for confirmation.

zabeen commented 1 year ago

Closing this ticket as disable of auto-heal does not need to be terraformed to be kept in place - I will raise a new tech debt ticket to cover the terraform change, as it would be good to have this applied automatically for a new installation