Recycle failing language workers

paulbatum commented 3 years ago

Have seen a few scenarios where the functions host is running multiple language workers and some of them are consistently failing, while others are working fine. Often this involves some user code component going into a bad state.

In most cases this is resolved with a restart on the function app, but this is a heavy handed approach that requires manual intervention. The functions host should detect this scenario and attempt to resolve it by recycling the language workers that are failing. It will still be up to the user to analyze their logs and fix their problems, but the platform can reduce the impact until that happens by automatically recycling failing workers.

The single language worker case is a bit trickier. There's no obvious signal that a restart will fix the problem. I'd suggest the first version of this improvement should only kick in when there are multiple language workers.

pragnagopa commented 3 years ago

Tagging @AnatoliB @alrod @mathewc - fyi

surgupta-msft commented 2 years ago

@paulbatum @pragnagopa @fabiocav I have started looking into this. Can you share some thoughts on what signals can help Host identify if a worker is faulty?

paulbatum commented 2 years ago

I would suggest starting with a function app that follows a simple failure pattern - specifically write the function code such that when its running in worker 1, it always succeeds, when its working in worker 2, it always fails. You can use the shared filesystem to coordinate such behavior. Then you can observe the signals the host recieves in this case - e.g. every execution it dispatches to worker 2 fails.

mathewc commented 2 years ago

Because part of the solution here will be to introduce monitoring for each worker to track % failed invocations, invocation latencies and other health metrics, perhaps a good first step would be to introduce this monitoring along with logging/metrics which would enable us to query production logs to see how many apps are in this state, and the prevalent patterns that we can use to guide the feature. It'll also give us a good idea on how many recycles will be initiated once the feature is turned on.

surgupta-msft commented 2 years ago

Based on @paulbatum's suggestion, ran an experiment with details below -

Repro Steps -

Created a function app with an HttpTrigger. The logic in this HttpTrigger is to acquire lease to a blob for 30 seconds.
Added AppSetting to keep worker process count = 2.
Invoked HttpTrigger 2 times simultaneously. One of them got executed successfully and acquired the lease. The second one failed with exception as the lease was already acquired.

Logs -

//cus
FunctionsLogs
| where TIMESTAMP >= datetime(2022-08-02 21:37:17.626)
| where AppName == "surg-net-test33"
| order by TIMESTAMP asc

This is the only signal/exception I received in kusto logs after running above experiment.

Microsoft.Azure.WebJobs.Host.FunctionInvocationException : 
Exception while executing function: Functions.HttpTrigger2 ---> 
Microsoft.Azure.WebJobs.Script.Workers.Rpc.RpcException : Result: Failure
Exception: Azure.RequestFailedException: There is already a lease present.

@paulbatum please let me know if this test setup needs any improvements.

@mathewc does your suggestion mean to first add appropriate monitoring in Host to emit health metrics for workers and then use Production logs to understand the pattern and identify signals given by faulty workers?

cc @fabiocav

fabiocav commented 2 years ago

Moving to sprint 128 as @surgupta-msft is still defining the requirements and identifying whether additional logs/telemetry would be required for us to be able to generate reliable signals to drive the host behavior.

paulbatum commented 2 years ago

@surgupta-msft that test setup is exactly what I had in mind! You could now use that to experiment with host changes where it collects metrics about the health of worker. For example, something like a count of attempted invocations, and a success rate. In this example with only two requests sent, I don't think thats enough data to make a decision. The hard part of this task is figuring out the appropriate thresholds for taking action (recycling a worker).

fabiocav commented 2 years ago

@surgupta-msft is still iterating on the design and getting feedback from stakeholders. Moving this to 129, but this is currently an investigation/design item until we settle on an approach

surgupta-msft commented 2 years ago

Update on the issue - The design (link) and PR (link) are under review.

jviau commented 2 years ago

This is a tricky subject, and I am worried about our host making unilateral decisions on what is an "unhealthy" worker. For the original problem statement, I see two different issues we should address:

Our observability has gaps where customers are not able to identify their code has a bug in it.
- By recycling workers, we may be obscuring this customer code issue from the customers, making it harder for them to diagnose and address.
We do not have a health probe system. Think like k8s.
- Customer code should have an input on what is healthy or not.

Additionally, simply answering the question of what constitutes an "unhealthy" worker is a very contextual answer. This depends on many customer factors that are opaque to us. I do see value in a worker recycling system, but this needs to be approached cautiously. For starters, who are we to say the customer code is in a "bad state"? We don't know what the code is. We don't know if this is intentional or not. Maybe they have chosen to fail these requests for some reason unknown to us. Or what if their definition of "unhealthy" diverges from us? When addressing customer issues, we may think in hindsight "oh yeah it is obvious there is a bug in their code", but do we believe that is something we can automate detecting?

We need to be cautious with this because it may not be something that can be applied evenly to all customers.

surgupta-msft commented 2 years ago

Update - discussed offline on the design and we are planning to execute this feature in 3 parts -

[x] Part-1: Adding additional telemetry to analyze how many Production apps will be impacted once we start recycling the workers and deciding on the optimal failure Threshold.
- Implementation - https://github.com/Azure/azure-functions-host/pull/8796. PR merged.
- ToDo - We can analyzer prod logs once Host changes are released.
[ ] Part-2: Based on the analysis from part-1, we can then add logic in Host to determine faulty workers and flagging them for recycling based on failure threshold.
[ ] Part-3: Add logic to take the action and actually recycle the faulty workers.

Part-2 and Part-3 can be combined based on the impact on Prod determined from Part-1.

Azure / azure-functions-host

Recycle failing language workers #7292