Azure / azure-webjobs-sdk-extensions

Azure WebJobs SDK Extensions
MIT License
342 stars 206 forks source link

Retry policies for isolated (out-of-process) AF Cosmos DB triggers/bindings soon deprecated? #783

Open HMoen opened 2 years ago

HMoen commented 2 years ago

@kshyju answer below works at the moment, but now we recently started to see the following trace in AppInsights for our AF Cosmos DB triggers : "Soon retries will not be supported for function '[Function Name]'. For more information, please visit http://aka.ms/func-retry-policies."

The Retry examples section also shows "Retry policies aren't yet supported when running in an isolated process." and the Retries section reflects no support for the Cosmos DB trigger/binding.

What's the path forward for AF Cosmos DB triggers running out-of-process?

Yes, retry policies are supported in isolated(out-of-process) function apps. You can enable it by adding the retry section to your host config. Here is an example host.json

{
  "version": "2.0",
  "retry": {
    "strategy": "fixedDelay",
    "maxRetryCount": 2,
    "delayInterval": "00:00:03"
  }
}

The reason why I'm asking is the documentation mentioning Retries require NuGet package Microsoft.Azure.WebJobs >= 3.0.23

That documentation which refers the usage of ExponentialBackoffRetry attribute is for in-proc function apps.

Please give it a try and let us know if you run into any problems.

Originally posted by @kshyju in https://github.com/Azure/azure-functions-dotnet-worker/issues/832#issuecomment-1072545934

mattchenderson commented 2 years ago

Because this is not exclusive to the isolated model, we should move this item to the extension repo. My understanding is that this would be a feature request for the Cosmos DB extension.

MikeNicholls commented 2 years ago

Will there be any guidance on retries for Cosmos DB triggered functions before the October 2022 deadline when retry policy support will be removed?

Cosmos DB is currently the outlier on the table of retry support for triggers/bindings, with "n/a" and "Not configurable" listed. https://docs.microsoft.com/en-us/azure/azure-functions/functions-bindings-error-pages#retries

Ved2806 commented 2 years ago

Hi @ealsur Could you please help with this issue?

iamarvil commented 2 years ago

Will there be any guidance on retries for Cosmos DB triggered functions before the October 2022 deadline when retry policy support will be removed?

Cosmos DB is currently the outlier on the table of retry support for triggers/bindings, with "n/a" and "Not configurable" listed. https://docs.microsoft.com/en-us/azure/azure-functions/functions-bindings-error-pages#retries

I agree, we built our change feed processing with the retries in mind and we were just about to roll-out our solution. While moving to webjobs with our own change feed processor is a viable option, we would be losing other benefits that azure functions can provide. Any guidance for cosmos db change feed retries would be appreciated.

ealsur commented 2 years ago

At least at this time, there is no Retry Policy support on the Cosmos DB extension that is built in or uses any of those configurations

HMoen commented 2 years ago

@ealsur , any plans to implement before the October 2022 deadline mentioned above? If not, so we can plan our own retry mechanism (queue etc). Thanks.

ealsur commented 2 years ago

If I understand the notice correctly, the retry policy is being removed from all triggers that currently support it:

The retry policy support in the runtime for triggers other than Timer and Event Hubs is being removed after this feature becomes generally available (GA). Preview retry policy support for all triggers other than Timer and Event Hubs will be removed in October 2022.

So adding Retry Policy support would not make sense as it does not align with the notice.

We currently have no planned work to add custom retry configuration, but that can change if this is a feature that is widely requested (as an opt-in configuration). Technically, the Trigger could allow infinite retries or no retries (nothing intermediate) because we internally use the Change Feed Processor and those would be the only available options.

HMoen commented 2 years ago

@ealsur , thanks for the feedback. Infinite or none won't work in our environment unfortunately. We piggybacked off dead-letter-queues before and removed the functionality when we found out about the built-in retry mechanism.

Link added to upvote idea.

mpaul31 commented 2 years ago

The removal of this has a very large impact on many of our production apps that use change feed (cosmos db trigger) with exponential infinite retries. We obviously don't want to lose and messages and rely heavily on this. @ealsur Is there anyway to get this added before the deadline or possibly a sample workaround? I would much rather not bring in more infra (queues) and build something specifically for this in all our apps.

ealsur commented 2 years ago

@mpaul31 I don't understand what is the request here. Right now the Change Feed Trigger has no retry policy, so it never retries, how are you creating apps that have exponential infinite retries? The deadline that the documentation shows is a deadline after which the Retry Policies will be removed, for Cosmos DB they are not even there in the first place, so there is no point in adding them if they are going to be removed anyway?

This issue requires input from the Azure Functions team for clarification @Ved2806

mpaul31 commented 2 years ago

@ealsur We use the following attribute on our change feed trigger method:

[ExponentialBackoffRetry(-1, "00:00:01", "00:00:10")]

It would be awesome if we could configure similar behavior on the CosmosDBTrigger.

ealsur commented 2 years ago

There is no code in the Cosmos DB Trigger or extension that reacts to [ExponentialBackoffRetry(-1, "00:00:01", "00:00:10")].

The Trigger code in fact has no error retry logic, here is where your Function code is called: https://github.com/Azure/azure-webjobs-sdk-extensions/blob/dev/src/WebJobs.Extensions.CosmosDB/Trigger/CosmosDBTriggerListener.cs#L225 if there was an execution error, it is simply logged and the next batch of changes is processed. This is mentioned in the Cosmos DB documentation (https://docs.microsoft.com/en-us/azure/cosmos-db/sql/troubleshoot-changefeed-functions#some-changes-are-missing-in-my-trigger).

There isn't (and never was) any logic that would pickup that ExponentialBackoffRetry decorator information in the Cosmos DB Trigger. If the Functions runtime is doing something with it I don't know, but the Trigger code moves forward always. My understanding is that the expectation that these RetryPolicy decorators had any impact on the CosmosDBTrigger are incorrect, so basically they are not doing anything, hence, when they get deprecated, there is nothing that will change.

HMoen commented 2 years ago

@ealsur , on this thread https://github.com/Azure/azure-functions-dotnet-worker/issues/832#issuecomment-1072545934, @kshyju mentioned the ExponentialBackoffRetry only works(ed) for in-proc function apps.

ealsur commented 2 years ago

I don't know how that works, from the Trigger perspective, there is nothing that is blocking the checkpointing and moving forward if the execution of the code fails.

Once the ProcessChangesAsync method completes in the Trigger code without errors, the state is saved and a next batch is obtained, and the execution of the user code can never make the method fail because we are using TryExecuteAsync:

FunctionResult result = await this._executor.TryExecuteAsync(new TriggeredFunctionData() { TriggerValue = docs }, cancellationToken);

For any logic to correctly affect the Trigger and make it retry the current batch (not checkpoint), the ProcessChangesAsync method should fail. That is how the Change Feed Processor would do the actual retry.

There is also no in-proc specific code in the CosmosDBTriggerListener either.

The linked comment is not saying that the configuration applies to CosmosDBTrigger either.

HMoen commented 2 years ago

The issue https://github.com/Azure/azure-functions-dotnet-worker/issues/832, was opened by someone looking specifically for retry logic for CosmosDBTrigger and @kshyju offered the linked comment as the solution. That issue should be updated to reflect your comments.

ealsur commented 2 years ago

I will let @kshyju comment then, as author of the CosmosDBTrigger, I am not aware of any behavior affected by these policies in the Trigger source code.

kshyju commented 2 years ago

The issue Azure/azure-functions-dotnet-worker#832, was opened by someone looking specifically for retry logic for CosmosDBTrigger and @kshyju offered the linked comment as the solution. That issue should be updated to reflect your comments.

The retry option I mentioned is a function runtime feature which will retry the function execution when it fails for the first time

For example, if you have a function like this.

[Function("Function1")]
public void Run([CosmosDBTrigger(
    databaseName: "items-db",
    collectionName: "products-container",
    ConnectionStringSetting = "MyCosmosDbConnStr",
    LeaseCollectionName = "leases")] IReadOnlyList<MyDocument> input)
{
    if (input != null && input.Count > 0)
    {
        _logger.LogInformation($"Documents modified: {input.Count}");
        _logger.LogInformation($"First document Id: {input[0].Id}");

        // example to simulate failed function execution
        if (input[0].Name.Contains("foo"))
        {
            throw new ArgumentNullException("Bad name for  document");
        }
    }
}

and configure your host.json to enable retry behavior

{
  "version": "2.0",
  "retry": {
    "strategy": "fixedDelay",
    "maxRetryCount": 2,
    "delayInterval": "00:00:03"
  }
}

Let' say you edit some document in your cosmosdb collection, the Function1 function will be triggered and if the document name contains "foo", the function code simply throws, which creates a failed function execution. The function runtime(host) will retry the same trigger again after 3 seconds and this happens 2 times.

So in short, the retry behavior is something the functions runtime/host does based on what is configured on host.json. Hope this helps.

HMoen commented 2 years ago

@kshyju , thanks for your response. The retry behavior you depict above is/will be permanent in the runtime for out-of-proc functions?

I opened this issue since as stated above early June we started to see the following trace in AppInsights for our AF Cosmos DB triggers: "Soon retries will not be supported for function '[Function Name]'. For more information, please visit http://aka.ms/func-retry-policies."

ealsur commented 2 years ago

The function runtime(host) will retry the same trigger again after 3 seconds and this happens 2 times.

Will the execution be with the same input (basically the same list of IReadOnlyList<MyDocument> input)? I'm intrigued to understand how this works, I have not added any logic in the Trigger that would stop the checkpointing, unless this decorator is basically stopping the instance?

kshyju commented 2 years ago

@ealsur Yes, Same list. The logic is somewhere in the functions host I believe.

kshyju commented 2 years ago

@kshyju , thanks for your response. The retry behavior you depict above is/will be permanent in the runtime for out-of-proc functions?

I opened this issue since as stated above early June we started to see the following trace in AppInsights for our AF Cosmos DB triggers: "Soon retries will not be supported for function '[Function Name]'. For more information, please visit http://aka.ms/func-retry-policies."

This is the current behavior as of today. Let me circle back with the team to understand what is the migration path & update you.

mpaul31 commented 2 years ago

@kshyju Any updates on the migration path or postponing the removal?

kshyju commented 2 years ago

The retry behavior is currently a preview feature. As the deprecation notice mentions, this feature will only be available for Timer and EventHub triggers for GA. Preview features are considered experimental and meant to collect feedback and not all of them make it to GA.

This will be applicable for both in-proc and isolated.

mpaul31 commented 2 years ago

I get that but just like stage slots, which was in preview for a very (1+ years) long time, some adopt things sooner out of necessity. What about guidance then? Will the host just ignore the attribute or stop the app from starting? I'm sure there were many excited to see this as it relates to the cosmos trigger and are now going to be facing the same fate.

shibayan commented 2 years ago

The fundamental issue is not that CosmosDBTrigger cannot be configured with a retry policy, but that CosmosDBTrigger advances the lease when a Function fails to execute.

I believe this could be solved if CosmosDBTrigger instead of retry policy could use exponential backoff to extend polling time for Change Feed without advancing the lease on Functions failure.

mpaul31 commented 2 years ago

Just to be clear, the retry policy/attribute does indeed work today aka with lease checkpointing working correctly. It appears it using an in memory retry policy similar to Polly.

cheekp commented 2 years ago

I get that but just like stage slots, which was in preview for a very (1+ years) long time, some adopt things sooner out of necessity. What about guidance then? Will the host just ignore the attribute or stop the app from starting? I'm sure there were many excited to see this as it relates to the cosmos trigger and are now going to be facing the same fate.

Agree, I imagine this feature will have had a lot of adoption out of necessity over the 1-2 years of availability. With the loss of ExponentialBackoffRetry transient failures are going to become a lot more expensive again. Knock on effect to compute, source/target ops, logging etc. Customers need guidance on what will happen in October and how to mitigate without the out-of-the-box in-memory retry option. Removing preview support for retry patterns and the alignment towards azure cloud best practices seems a strange decision.

asos-martinsmith commented 2 years ago

This is certainly required for Cosmos.

Before I started using this attribute I encountered issues where "in flight" batches of changes were lost when performing a release.

I was advised (funnily enough by @ealsur!) to use the retry policy to mitigate this here

https://github.com/Azure/azure-webjobs-sdk-extensions/issues/716#issuecomment-824920182

In general too the change feed processing may well be writing to things with 99.999% availability and there needs to be a way to handle the 0.001% without just skipping that batch and moving on.

ealsur commented 2 years ago

I was advised (funnily enough by @ealsur!) to use the retry policy to mitigate this here

My response does not advise you to you the retry policies, the response is regarding what happens if your function crashes before checkpointing. I mentioned the documentation about the retry policies that were in preview because my understanding back then was that Functions was building a generic retry layer for all extensions, I did not explicitly say to use it. I don't work on the Functions team and have no insight on the roadmap nor plans for their features. I wasn't even aware these Retry policies affected the Cosmos DB Trigger (as you can see on this thread).

In general too the change feed processing may well be writing to things with 99.999% availability and there needs to be a way to handle the 0.001% without just skipping that batch and moving on.

There is, a try/catch approach on the Function can let you handle any individual item failure. This pattern is commonly used (because retrying the entire batch is not something you would want to do in some cases), where the individual items that failed processing are sent to a queue or another storage for later processing or analyzing (why did they fail? is the failure transient?)

asos-martinsmith commented 2 years ago

There are two separate issues here.

The first one is that certainly I have observed that deploying changes to the function app means that batches end up being aborted (maybe due to thread abort) and the lease is still updated so the net effect is that the batch is just missed. After using this preview feature I did not observe any instances of this. When it is removed I no longer will have confidence in any method of deployment that avoids this, (it was never got to the bottom of this issue in the linked thread)

The second is that I don't want to provision a whole load of largely redundant resources for the 0.001% case (and write code to then merge from the fallback storage to the main storage). I just want the checkpoint not to be written and for it to be retried later (as happens with the exponential retry). Is there anything that can be put in the catch block that will achieve the suppression of checkpointing?

(Typical use case for me is to use the Change feed to upsert to a different Cosmos - e.g. to maintain a materialized view of a collection so pretty much all errors will be transient such as throttling and succeed on retry)

ealsur commented 2 years ago

Is there anything that can be put in the catch block that will achieve the suppression of checkpointing?

Not at this point. As I mentioned in https://github.com/Azure/azure-webjobs-sdk-extensions/issues/783#issuecomment-1216727630, aligning with the Change Feed Processor, the only potential alternatives we could offer for the Cosmos DB Trigger is either an infinite retry (retry on any error) or no retry. There is no exponential backoff retry support on the Change Feed Processor. When checkpointing is aborted due to a processing error, the lease is released, any instance (same or other) could claim that and retry, so storing the number of previous attempts on an instance would not apply.

Unless the Retry Policies are exposed to Extension authors in a way that we could invoke/call/signal them in some way but the model from what I understand in this thread is that Retry Policies exist and apply outside of the Extension scope.

shibayan commented 2 years ago

Unless an alternative to the CosmosDBTrigger retry policy is provided by the Azure Functions team, we will unfortunately have to fix the Azure Functions Runtime version. Fixing FUNCTIONS_EXTENSION_VERSION to the version before the retry policy behavior was changed should give us additional time.

Would our request be better considered if we created a support request? I recall the removal of Azure Functions Proxies in the v4 upgrade. (BTW, Functions Proxies are about to be put back after being removed)

iamarvil commented 2 years ago

As a temporary fix, we'd be wrapping our triggers with Polly in the meantime just to cover transient errors. Consumption plan timeouts would be the next thing to worry about, but considering that processing usually takes less than a second anyway, i don't think it's too much of a concern. We're just unsure how much cost it will add to our azure functions bill.

I would consider adding a try catch block to send erring processes, to queues but we really rely on the order of documents coming through.

A retry forever with a fixed wait duration from the cosmosdb trigger would be great though, and is exactly what we need.

sayyaari commented 2 years ago

@ealsur is retry policy support for Cosmos DB Trigger that you've added recently an intermittent change until October 2022 according to this document or the document just needs to be updated?

ealsur commented 2 years ago

@sayyaari The change I added is for the 4.x extension that will release sometime next month, the document would need to be updated I guess

sayyaari commented 2 years ago

Thanks @ealsur, Still we have a major problem which is when the Azure Function is terminated by any reason it would advance the lease. Per my investigation in the code it is because the ProcessChangesAsync method of CosmosDBTriggerListener which handles the ChangeFeedProcessor doesn't throw any exception so that the lease doesn't advance in case of such situation. It just logs when get failed results and progress gracefully. Is there any plan to change this behaviour and make it for example more configurable for the next release?

ealsur commented 2 years ago

@sayyaari I believe you should create a different issue that we can discuss separately. Azure Function being terminated and the Listener not aborting on errors are two different things. Please see: https://github.com/Azure/azure-webjobs-sdk-extensions/issues/716#issuecomment-824910846

mpaul31 commented 1 year ago

Unless an alternative to the CosmosDBTrigger retry policy is provided by the Azure Functions team, we will unfortunately have to fix the Azure Functions Runtime version. Fixing FUNCTIONS_EXTENSION_VERSION to the version before the retry policy behavior was changed should give us additional time.

Would our request be better considered if we created a support request? I recall the removal of Azure Functions Proxies in the v4 upgrade. (BTW, Functions Proxies are about to be put back after being removed)

@shibayan Is pinning the version the route you are going with? I am debating this approach vs. adding a Polly retry policy.

shibayan commented 1 year ago

@mpaul31 I am currently looking at ways to pin Azure Functions Runtime for safety and then support the new Cosmos DB Extension v4. The reason is to avoid losing retry support for CosmosDBTrigger in Azure Function Runtime due to platform updates before the Cosmos DB Extension v4 migration is complete.

The Azure Functions team has disclosed that the previous retry policy will be removed in October, so the risk is that we cannot read when exactly in October the removal will take place.

mpaul31 commented 1 year ago

@mpaul31 I am currently looking at ways to pin Azure Functions Runtime for safety and then support the new Cosmos DB Extension v4. The reason is to avoid losing retry support for CosmosDBTrigger in Azure Function Runtime due to platform updates before the Cosmos DB Extension v4 migration is complete.

The Azure Functions team has disclosed that the previous retry policy will be removed in October, so the risk is that we cannot read when exactly in October the removal will take place.

I am in the exact same boat. Updating the app setting FUNCTIONS_EXTENSION_VERSION value to version 3.15.0.0 seems less invasive at this point.

Ved2806 commented 1 year ago

Hi @HMoen Is this resolved for you?

HMoen commented 1 year ago

The Azure Functions team has disclosed that the previous retry policy will be removed in October, so the risk is that we cannot read when exactly in October the removal will take place.

@Ved2806 , has this been confirmed?

Ved2806 commented 1 year ago

Hi @shibayan Could you please confirm?

shibayan commented 1 year ago

@Ved2806 I believe it will be difficult to confirm until the new Cosmos DB Extension v4 goes GA.

The Cosmos DB Extension v4 should reach GA before the retries in Cosmos DB Extension v3 are removed from Azure Functions Runtime. https://www.nuget.org/packages/Microsoft.Azure.WebJobs.Extensions.CosmosDB/

ealsur commented 1 year ago

For what it's worth, the rc package contains the code for the Retry Policy integration: https://www.nuget.org/packages/Microsoft.Azure.WebJobs.Extensions.CosmosDB/4.0.0-rc

shibayan commented 1 year ago

Retry policy works in v4.0.0-rc, but the scale controller no longer monitors Change Feed when used with Consumption Plan.

As a result, when a change occurs in the Change Feed, Azure Functions are not started and processing is not performed.

ealsur commented 1 year ago

@shibayan That is the reason the package is not yet GA, Scale Controller support is not completed deployment yet (reference https://azure.microsoft.com/en-us/updates/azure-cosmos-db-azure-functions-extension-version-40-in-public-preview/)

iamarvil commented 1 year ago

It's finally on GA! Thank you very much @ealsur!

We can finally update now!

HMoen commented 1 year ago

@shibayan @ealsur , does this enhancement apply to out-of-proc functions as well or docs are accurate?

image

ealsur commented 1 year ago

@HMoen I cannot comment, I am not aware of how out-of-proc works, but based on the documentation you linked, there is no support for any extension.