Open paulbatum opened 3 years ago
Tagging @AnatoliB @alrod @mathewc - fyi
@paulbatum @pragnagopa @fabiocav I have started looking into this. Can you share some thoughts on what signals can help Host identify if a worker is faulty?
I would suggest starting with a function app that follows a simple failure pattern - specifically write the function code such that when its running in worker 1, it always succeeds, when its working in worker 2, it always fails. You can use the shared filesystem to coordinate such behavior. Then you can observe the signals the host recieves in this case - e.g. every execution it dispatches to worker 2 fails.
Because part of the solution here will be to introduce monitoring for each worker to track % failed invocations, invocation latencies and other health metrics, perhaps a good first step would be to introduce this monitoring along with logging/metrics which would enable us to query production logs to see how many apps are in this state, and the prevalent patterns that we can use to guide the feature. It'll also give us a good idea on how many recycles will be initiated once the feature is turned on.
Based on @paulbatum's suggestion, ran an experiment with details below -
Repro Steps -
Logs -
//cus
FunctionsLogs
| where TIMESTAMP >= datetime(2022-08-02 21:37:17.626)
| where AppName == "surg-net-test33"
| order by TIMESTAMP asc
This is the only signal/exception I received in kusto logs after running above experiment.
Microsoft.Azure.WebJobs.Host.FunctionInvocationException :
Exception while executing function: Functions.HttpTrigger2 --->
Microsoft.Azure.WebJobs.Script.Workers.Rpc.RpcException : Result: Failure
Exception: Azure.RequestFailedException: There is already a lease present.
@paulbatum please let me know if this test setup needs any improvements.
@mathewc does your suggestion mean to first add appropriate monitoring in Host to emit health metrics for workers and then use Production logs to understand the pattern and identify signals given by faulty workers?
cc @fabiocav
Moving to sprint 128 as @surgupta-msft is still defining the requirements and identifying whether additional logs/telemetry would be required for us to be able to generate reliable signals to drive the host behavior.
@surgupta-msft that test setup is exactly what I had in mind! You could now use that to experiment with host changes where it collects metrics about the health of worker. For example, something like a count of attempted invocations, and a success rate. In this example with only two requests sent, I don't think thats enough data to make a decision. The hard part of this task is figuring out the appropriate thresholds for taking action (recycling a worker).
@surgupta-msft is still iterating on the design and getting feedback from stakeholders. Moving this to 129, but this is currently an investigation/design item until we settle on an approach
This is a tricky subject, and I am worried about our host making unilateral decisions on what is an "unhealthy" worker. For the original problem statement, I see two different issues we should address:
Additionally, simply answering the question of what constitutes an "unhealthy" worker is a very contextual answer. This depends on many customer factors that are opaque to us. I do see value in a worker recycling system, but this needs to be approached cautiously. For starters, who are we to say the customer code is in a "bad state"? We don't know what the code is. We don't know if this is intentional or not. Maybe they have chosen to fail these requests for some reason unknown to us. Or what if their definition of "unhealthy" diverges from us? When addressing customer issues, we may think in hindsight "oh yeah it is obvious there is a bug in their code", but do we believe that is something we can automate detecting?
We need to be cautious with this because it may not be something that can be applied evenly to all customers.
Update - discussed offline on the design and we are planning to execute this feature in 3 parts -
Part-2 and Part-3 can be combined based on the impact on Prod determined from Part-1.
Have seen a few scenarios where the functions host is running multiple language workers and some of them are consistently failing, while others are working fine. Often this involves some user code component going into a bad state.
In most cases this is resolved with a restart on the function app, but this is a heavy handed approach that requires manual intervention. The functions host should detect this scenario and attempt to resolve it by recycling the language workers that are failing. It will still be up to the user to analyze their logs and fix their problems, but the platform can reduce the impact until that happens by automatically recycling failing workers.
The single language worker case is a bit trickier. There's no obvious signal that a restart will fix the problem. I'd suggest the first version of this improvement should only kick in when there are multiple language workers.