Closed zabeen closed 1 year ago
Re: search HLA to reproduce the error, as it belongs to a real patient, I won't paste it here out of concern for privacy. Whomever works on this ticket, please message me directly to obtain it securely.
Extra notes: the search was a 4/8 CBU search that failed on EP1 plan but succeeded on the EP2 plan.
IsAvailableForSearch
and DonorType
, both of which involve a join onto the dbo.donors
table. This may impact query performance, even with the proper indexes in place.@luken-an to investigate possibility of toggling auto-heal behaviour
@zabeen it is possible to disable the Proactive Auto-Heal. To do this you must go to the relevant Function then click 'Diagnose and solve problems' in the right hand menu -> 'Diagnostic Tools' -> Auto-Heal. The click on the Proactive Auto-Heal tab. There is then option to toggle on or off
If we decide to disable auto-heal permanently then it needs to be encoded within terraform. I couldn't find anything in terraform docs about how to do this, but this link gives instructions on how to use the portal and terraform plan
to discover what terraform settings should be (i.e., disable auto heal manually, then run terraform plan
to see the manual change that will be overwritten by terraform).
Note to dev: when investigating this ticket, run the search with auto-heal manually disabled to see if any exceptions are thrown by the application, which will give further data about what exactly is causing auto-heal to restart the app,
Temporarily blocking this ticket until after #897 is merged and performance testing is repeated. Initial investigation suggests that disabling auto-heal is enough to resolve this problem; need to verify the implications of disabling auto-heal.
@daria-sorokina-da says that auto-heal is disabled on some other AN apps, and that it may not need to be terraformed (i.e., terraform release may not undo a manual disabling of auto-heal). I have disabled auto-heal on wmda-dev matching app, ahead of a release, for confirmation.
Closing this ticket as disable of auto-heal does not need to be terraformed to be kept in place - I will raise a new tech debt ticket to cover the terraform change, as it would be good to have this applied automatically for a new installation
Describe the bug During a search request, a large number of donors is returned by Matching Phase 1, as shown in the AI logs. The search will not continue past the log message "Matching timing: Phase 1 complete". This suggests something is going wrong during phase 2. The message will be replayed till dead-lettering, and the AI logs terminate at the same point. If the app service plan tier is increased, then the search completes.
Diagnostics/troubleshooting doesn't usually mention app restart or auto-healing, but it is the only logical explanation for why search terminates at the same place and the message is replayed.
To Reproduce I don't have a search request to hand, as we have been doing a lot of tweaking of the app service plan config, and sometimes the same search that fails on a lower plan completes on a higher plan. I may be able to get an example for AN search, for a 4/8 CBU search that failed on a lower tier.
Expected behaviour The search should complete, either fail with an explicit error, e.g.,
OutOfMemoryException
, or succeed.Inputs/Outputs Need to obtain, will paste in comments.
Atlas Build & Runtime Info (please complete the following information):