HMoen commented 2 years ago

@kshyju answer below works at the moment, but now we recently started to see the following trace in AppInsights for our AF Cosmos DB triggers : "Soon retries will not be supported for function '[Function Name]'. For more information, please visit http://aka.ms/func-retry-policies."

The Retry examples section also shows "Retry policies aren't yet supported when running in an isolated process." and the Retries section reflects no support for the Cosmos DB trigger/binding.

What's the path forward for AF Cosmos DB triggers running out-of-process?

Yes, retry policies are supported in isolated(out-of-process) function apps. You can enable it by adding the retry section to your host config. Here is an example host.json

{
  "version": "2.0",
  "retry": {
    "strategy": "fixedDelay",
    "maxRetryCount": 2,
    "delayInterval": "00:00:03"
  }
}

The reason why I'm asking is the documentation mentioning Retries require NuGet package Microsoft.Azure.WebJobs >= 3.0.23

That documentation which refers the usage of ExponentialBackoffRetry attribute is for in-proc function apps.

Please give it a try and let us know if you run into any problems.

Originally posted by @kshyju in https://github.com/Azure/azure-functions-dotnet-worker/issues/832#issuecomment-1072545934

HMoen commented 1 year ago

Thanks for the feedback. I opened this issue specifically for out-of-proc functions, so will await the response of @shibayan or @Ved2806 before closing.

your-eggcellency commented 1 year ago

Is there any update on retry policies for CosmosDB triggers on out-of-process functions? .NET 7+ supports only out-of-process functions and without retry policies CosmosDB trigger is unusable because it will keep losing changes.

Perhaps this question should be taken to azure-functions-dotnet-worker repository?

joeizy commented 1 year ago

Overview

Hello, after some extensive testing with the latest version of nuget packages for out-of-process dotnet cosmos change feed function here's what I discovered the behavior to be. PLEASE correct me where I missed something 😄

Goal: My goal is to be able to use Azure Functions with Cosmos Change Feed as a way to reliably process changes coming from Cosmos DB. (Think replication to another collection, notifying consumers, etc.)

Problem: I found that the behavior of code written in the way one might expect to write it (i.e. throw exception in change feed processor function) still advances the change feed in edge cases making the provided tooling for Azure Function Change Feed Trigger into a "best effort" process instead of reliable processing (see scenarios below). This behavior means that changes in the Change Feed can be lost unexpectedly (to the developer writing code) in edge cases.

Environment

func --version
4.0.5198

dotnet --version
7.0.304

*.csproj
<Project Sdk="Microsoft.NET.Sdk">
    <PropertyGroup>
        <TargetFramework>net7.0</TargetFramework>
        <AzureFunctionsVersion>V4</AzureFunctionsVersion>
        <OutputType>Exe</OutputType>
        <ImplicitUsings>enable</ImplicitUsings>
        <Nullable>enable</Nullable>
    </PropertyGroup>
    <ItemGroup>
        <PackageReference Include="Azure.Storage.Queues" Version="12.14.0" />
        <PackageReference Include="Microsoft.Azure.Functions.Worker" Version="1.14.1" />
        <PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.CosmosDB" Version="4.3.0" />
        <PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.Http" Version="3.0.13" />
        <PackageReference Include="Microsoft.Azure.Functions.Worker.Extensions.Storage.Queues" Version="5.1.2" />
        <PackageReference Include="Microsoft.Azure.Functions.Worker.Sdk" Version="1.10.0" />
    </ItemGroup>
...
</Project>

OS
Edition Windows 11 Pro
Version 22H2
Installed on    ‎10/‎15/‎2022
OS build    22621.1848
Experience  Windows Feature Experience Pack 1000.22642.1000.0

The failure conditions / edge cases I tried are meant to simulate:

The function runtime process crashing
The isolated worker process crashing
A dependency of the user function code being down for an extended period of time (i.e. a service outage)
The user function code having a bug requiring a code deployment (wherein the function runtime is gracefully shutdown)

NOTE: I did these tests with the [FixedDelayRetry] attribute on my function. It's my understanding (and assumption) that this behavior is similar if not worse with out the retry attribute.

Findings

NOTE: When I say "runtime process" I'm referring to func.exe from the Azure Function Core Tools locally. When I say "worker process" or "isolated process" I'm referring to dotnet.exe running my user function code locally.

The change feed will still advance in all of these cases:

Unhandled Exceptions in User Function Code will still advance the change feed
Add ExponentialBackoffRetry[] or FixedDelayRetry[] attributes will retry in-memory but if the runtime process is gracefully stopped (i.e. code deployment, environment restart, etc.) it will still advance the change feed
Repeated throwing exception in Function will still advance the change feed
Repeated crashing the worker process (i.e. with Environment.FailFast() in the User Function Code) will still advance the change feed
Exiting the worker process with a non-zero status code (i.e. Environment.Exit(-1)) seems that it will hang indefinitely but when you go to stop the runtime process, it will advance the change feed.

The change feed will not advance in any of these cases:

The runtime process is killed or terminates unexpectedly (i.e. host VM crashed, Process is terminated via task manager, etc.).
The runtime process starts successfully, but the worker process does not start successfully. (i.e. The worker process crashes before host.Run())
The runtime and worker processes start and begin to process a message but then the worker (isolated) process crashes and the auto-restart of the worker process (which is controlled by the runtime process) exceeds the number of retries (i.e. the worker process crashed and never successfully started) with this Error Message: Exceeded language worker restart retry count for runtime:dotnet-isolated. Shutting down and proactively recycling the Functions Host to recover. This crash of the worker process during startup must take place before the host.Run() in Program.cs. NOTE: If the worker process successfully restarts (ie. host.Run() succeeds) it doesn't matter that you continue to crash the process in the function code, the feed will still advance; you must crash the worker process before host.Run().

NOTE: I did find it unusual that the runtime Cosmos trigger so was lenient with when it advances the change feed lease/tracker and expected it to be a lot more conservative. For example, even when I see this in the logs which clearly indicates a scenario where the worker is not happy during processing the change feed tracking is still advancing.

Language Worker Process exited. Pid=43372.
Exceeded language worker restart retry count for runtime:dotnet-isolated. Shutting down and proactively recycling the Functions Host to recover
Lease 0 encountered an unhandled user exception during processing.

How to get reliable processing (workaround)

The only thing I was able to get to work was to use Environment.FailFast() in the User Function Code when encountering an issue that is not recoverable to kill the isolated worker processed. I must crash the worker process, simply throwing exceptions in user code or even Environment.Exit(1) doesn't work here.

Then, when the runtime process restarts the worker process, I need to detect that failure condition in the bootstrap / startup code (inside the Program.cs) of the worker process and crash the process before host.Run() is called. Crashing here can be throwing unhandled exception or Environment.FailFast() or even Environment.Exit([non-zero value]) with non-zero exit code it doesn't matter as long as host.Run() isn't called and the process exits in a failure condition.

In this scenario, the runtime process will restart the worker process a few times then give up with error message Exceeded language worker restart retry count for runtime:dotnet-isolated. Shutting down and proactively recycling the Functions Host to recover and the runtime process will exit. The Change Feed is not advanced and the next time the runtime process is started it will attempt to process the change in the feed again.

NOTE: "Just putting the message on a queue from code in your change feed function" (i.e. ServiceBus or Storage) won't work here because that operation can also fail and if the current in-progress change is not properly retried it will be lost. Even in the "put it on a queue" scenario, we need to reliably get the value from change feed to the queue without dropping changes. Due to the above edge cases that isn't possible without this workaround from that I can see.

Narvalex commented 10 months ago

hey @joeizy , how would you exactly detect in Program.cs that you crashed in the previous run?

joeizy commented 10 months ago

Hello @Narvalex, I was not detecting a previous failure (IMO this is not productive), I was predicting whether the next attempt to process the message would knowing fail (i.e. does it have the prerequisites/dependencies to succeed). The assumption is that the User Function Code is idempotent AND that it will not fail unless there is a dependency which is failing or inaccessible. So, if I can check whether the dependencies are working I'm set.

Consider a case where the User Function Code receives the change feed item and puts a message on a durable queue.

If the durable queue is down or the function is misconfigured with an incorrect connection string or there is misconfigured networking between the function and the queue, the attempt to enqueue the message will fail. This can be checked before host.Run() is called during startup (for example by putting a dummy type message).

This is NOT perfect as I'm sure you can imagine other scenarios where things might fail which you cannot proactively check for but it is quite common for a dependency failure to be the cause (and one of the main uncontrollable reasons) and it's common to want to be resilient to this case (i.e. not lose / drop messages due to intermittent failures).

NOTE: It's my understanding that further resilience requires a change to the Cosmos Trigger Code (which controls advancement to the next change in the feed) to adopt an at-least-once-successful message delivery pattern with dead-letter queue (which I believe most people are expecting it already does) OR to delegate advancement of the cursor / change feed to the User Function Code (advanced use cases).

Azure / azure-webjobs-sdk-extensions

Retry policies for isolated (out-of-process) AF Cosmos DB triggers/bindings soon deprecated? #783

Overview

Findings

How to get reliable processing (workaround)