Open HMoen opened 2 years ago
Because this is not exclusive to the isolated model, we should move this item to the extension repo. My understanding is that this would be a feature request for the Cosmos DB extension.
Will there be any guidance on retries for Cosmos DB triggered functions before the October 2022 deadline when retry policy support will be removed?
Cosmos DB is currently the outlier on the table of retry support for triggers/bindings, with "n/a" and "Not configurable" listed. https://docs.microsoft.com/en-us/azure/azure-functions/functions-bindings-error-pages#retries
Hi @ealsur Could you please help with this issue?
Will there be any guidance on retries for Cosmos DB triggered functions before the October 2022 deadline when retry policy support will be removed?
Cosmos DB is currently the outlier on the table of retry support for triggers/bindings, with "n/a" and "Not configurable" listed. https://docs.microsoft.com/en-us/azure/azure-functions/functions-bindings-error-pages#retries
I agree, we built our change feed processing with the retries in mind and we were just about to roll-out our solution. While moving to webjobs with our own change feed processor is a viable option, we would be losing other benefits that azure functions can provide. Any guidance for cosmos db change feed retries would be appreciated.
At least at this time, there is no Retry Policy support on the Cosmos DB extension that is built in or uses any of those configurations
@ealsur , any plans to implement before the October 2022 deadline mentioned above? If not, so we can plan our own retry mechanism (queue etc). Thanks.
If I understand the notice correctly, the retry policy is being removed from all triggers that currently support it:
The retry policy support in the runtime for triggers other than Timer and Event Hubs is being removed after this feature becomes generally available (GA). Preview retry policy support for all triggers other than Timer and Event Hubs will be removed in October 2022.
So adding Retry Policy support would not make sense as it does not align with the notice.
We currently have no planned work to add custom retry configuration, but that can change if this is a feature that is widely requested (as an opt-in configuration). Technically, the Trigger could allow infinite retries or no retries (nothing intermediate) because we internally use the Change Feed Processor and those would be the only available options.
@ealsur , thanks for the feedback. Infinite or none won't work in our environment unfortunately. We piggybacked off dead-letter-queues before and removed the functionality when we found out about the built-in retry mechanism.
Link added to upvote idea.
The removal of this has a very large impact on many of our production apps that use change feed (cosmos db trigger) with exponential infinite retries. We obviously don't want to lose and messages and rely heavily on this. @ealsur Is there anyway to get this added before the deadline or possibly a sample workaround? I would much rather not bring in more infra (queues) and build something specifically for this in all our apps.
@mpaul31 I don't understand what is the request here. Right now the Change Feed Trigger has no retry policy, so it never retries, how are you creating apps that have exponential infinite retries? The deadline that the documentation shows is a deadline after which the Retry Policies will be removed, for Cosmos DB they are not even there in the first place, so there is no point in adding them if they are going to be removed anyway?
This issue requires input from the Azure Functions team for clarification @Ved2806
@ealsur We use the following attribute on our change feed trigger method:
[ExponentialBackoffRetry(-1, "00:00:01", "00:00:10")]
It would be awesome if we could configure similar behavior on the CosmosDBTrigger
.
There is no code in the Cosmos DB Trigger or extension that reacts to [ExponentialBackoffRetry(-1, "00:00:01", "00:00:10")]
.
The Trigger code in fact has no error retry logic, here is where your Function code is called: https://github.com/Azure/azure-webjobs-sdk-extensions/blob/dev/src/WebJobs.Extensions.CosmosDB/Trigger/CosmosDBTriggerListener.cs#L225 if there was an execution error, it is simply logged and the next batch of changes is processed. This is mentioned in the Cosmos DB documentation (https://docs.microsoft.com/en-us/azure/cosmos-db/sql/troubleshoot-changefeed-functions#some-changes-are-missing-in-my-trigger).
There isn't (and never was) any logic that would pickup that ExponentialBackoffRetry
decorator information in the Cosmos DB Trigger. If the Functions runtime is doing something with it I don't know, but the Trigger code moves forward always. My understanding is that the expectation that these RetryPolicy decorators had any impact on the CosmosDBTrigger are incorrect, so basically they are not doing anything, hence, when they get deprecated, there is nothing that will change.
@ealsur , on this thread https://github.com/Azure/azure-functions-dotnet-worker/issues/832#issuecomment-1072545934, @kshyju mentioned the ExponentialBackoffRetry only works(ed) for in-proc function apps.
I don't know how that works, from the Trigger perspective, there is nothing that is blocking the checkpointing and moving forward if the execution of the code fails.
Once the ProcessChangesAsync
method completes in the Trigger code without errors, the state is saved and a next batch is obtained, and the execution of the user code can never make the method fail because we are using TryExecuteAsync
:
FunctionResult result = await this._executor.TryExecuteAsync(new TriggeredFunctionData() { TriggerValue = docs }, cancellationToken);
For any logic to correctly affect the Trigger and make it retry the current batch (not checkpoint), the ProcessChangesAsync
method should fail. That is how the Change Feed Processor would do the actual retry.
There is also no in-proc specific code in the CosmosDBTriggerListener
either.
The linked comment is not saying that the configuration applies to CosmosDBTrigger either.
The issue https://github.com/Azure/azure-functions-dotnet-worker/issues/832, was opened by someone looking specifically for retry logic for CosmosDBTrigger and @kshyju offered the linked comment as the solution. That issue should be updated to reflect your comments.
I will let @kshyju comment then, as author of the CosmosDBTrigger, I am not aware of any behavior affected by these policies in the Trigger source code.
The issue Azure/azure-functions-dotnet-worker#832, was opened by someone looking specifically for retry logic for CosmosDBTrigger and @kshyju offered the linked comment as the solution. That issue should be updated to reflect your comments.
The retry option I mentioned is a function runtime feature which will retry the function execution when it fails for the first time
For example, if you have a function like this.
[Function("Function1")]
public void Run([CosmosDBTrigger(
databaseName: "items-db",
collectionName: "products-container",
ConnectionStringSetting = "MyCosmosDbConnStr",
LeaseCollectionName = "leases")] IReadOnlyList<MyDocument> input)
{
if (input != null && input.Count > 0)
{
_logger.LogInformation($"Documents modified: {input.Count}");
_logger.LogInformation($"First document Id: {input[0].Id}");
// example to simulate failed function execution
if (input[0].Name.Contains("foo"))
{
throw new ArgumentNullException("Bad name for document");
}
}
}
and configure your host.json
to enable retry behavior
{
"version": "2.0",
"retry": {
"strategy": "fixedDelay",
"maxRetryCount": 2,
"delayInterval": "00:00:03"
}
}
Let' say you edit some document in your cosmosdb collection, the Function1
function will be triggered and if the document name contains "foo", the function code simply throws, which creates a failed function execution. The function runtime(host) will retry the same trigger again after 3 seconds and this happens 2 times.
So in short, the retry behavior is something the functions runtime/host does based on what is configured on host.json. Hope this helps.
@kshyju , thanks for your response. The retry behavior you depict above is/will be permanent in the runtime for out-of-proc functions?
I opened this issue since as stated above early June we started to see the following trace in AppInsights for our AF Cosmos DB triggers: "Soon retries will not be supported for function '[Function Name]'. For more information, please visit http://aka.ms/func-retry-policies."
The function runtime(host) will retry the same trigger again after 3 seconds and this happens 2 times.
Will the execution be with the same input (basically the same list of IReadOnlyList<MyDocument> input
)? I'm intrigued to understand how this works, I have not added any logic in the Trigger that would stop the checkpointing, unless this decorator is basically stopping the instance?
@ealsur Yes, Same list. The logic is somewhere in the functions host I believe.
@kshyju , thanks for your response. The retry behavior you depict above is/will be permanent in the runtime for out-of-proc functions?
I opened this issue since as stated above early June we started to see the following trace in AppInsights for our AF Cosmos DB triggers: "Soon retries will not be supported for function '[Function Name]'. For more information, please visit http://aka.ms/func-retry-policies."
This is the current behavior as of today. Let me circle back with the team to understand what is the migration path & update you.
@kshyju Any updates on the migration path or postponing the removal?
The retry behavior is currently a preview feature. As the deprecation notice mentions, this feature will only be available for Timer and EventHub triggers for GA. Preview features are considered experimental and meant to collect feedback and not all of them make it to GA.
This will be applicable for both in-proc and isolated.
I get that but just like stage slots, which was in preview for a very (1+ years) long time, some adopt things sooner out of necessity. What about guidance then? Will the host just ignore the attribute or stop the app from starting? I'm sure there were many excited to see this as it relates to the cosmos trigger and are now going to be facing the same fate.
The fundamental issue is not that CosmosDBTrigger
cannot be configured with a retry policy, but that CosmosDBTrigger
advances the lease when a Function fails to execute.
I believe this could be solved if CosmosDBTrigger
instead of retry policy could use exponential backoff to extend polling time for Change Feed without advancing the lease on Functions failure.
Just to be clear, the retry policy/attribute does indeed work today aka with lease checkpointing working correctly. It appears it using an in memory retry policy similar to Polly.
I get that but just like stage slots, which was in preview for a very (1+ years) long time, some adopt things sooner out of necessity. What about guidance then? Will the host just ignore the attribute or stop the app from starting? I'm sure there were many excited to see this as it relates to the cosmos trigger and are now going to be facing the same fate.
Agree, I imagine this feature will have had a lot of adoption out of necessity over the 1-2 years of availability. With the loss of ExponentialBackoffRetry transient failures are going to become a lot more expensive again. Knock on effect to compute, source/target ops, logging etc. Customers need guidance on what will happen in October and how to mitigate without the out-of-the-box in-memory retry option. Removing preview support for retry patterns and the alignment towards azure cloud best practices seems a strange decision.
This is certainly required for Cosmos.
Before I started using this attribute I encountered issues where "in flight" batches of changes were lost when performing a release.
I was advised (funnily enough by @ealsur!) to use the retry policy to mitigate this here
https://github.com/Azure/azure-webjobs-sdk-extensions/issues/716#issuecomment-824920182
In general too the change feed processing may well be writing to things with 99.999% availability and there needs to be a way to handle the 0.001% without just skipping that batch and moving on.
I was advised (funnily enough by @ealsur!) to use the retry policy to mitigate this here
My response does not advise you to you the retry policies, the response is regarding what happens if your function crashes before checkpointing. I mentioned the documentation about the retry policies that were in preview because my understanding back then was that Functions was building a generic retry layer for all extensions, I did not explicitly say to use it. I don't work on the Functions team and have no insight on the roadmap nor plans for their features. I wasn't even aware these Retry policies affected the Cosmos DB Trigger (as you can see on this thread).
In general too the change feed processing may well be writing to things with 99.999% availability and there needs to be a way to handle the 0.001% without just skipping that batch and moving on.
There is, a try/catch
approach on the Function can let you handle any individual item failure. This pattern is commonly used (because retrying the entire batch is not something you would want to do in some cases), where the individual items that failed processing are sent to a queue or another storage for later processing or analyzing (why did they fail? is the failure transient?)
There are two separate issues here.
The first one is that certainly I have observed that deploying changes to the function app means that batches end up being aborted (maybe due to thread abort) and the lease is still updated so the net effect is that the batch is just missed. After using this preview feature I did not observe any instances of this. When it is removed I no longer will have confidence in any method of deployment that avoids this, (it was never got to the bottom of this issue in the linked thread)
The second is that I don't want to provision a whole load of largely redundant resources for the 0.001% case (and write code to then merge from the fallback storage to the main storage). I just want the checkpoint not to be written and for it to be retried later (as happens with the exponential retry). Is there anything that can be put in the catch block that will achieve the suppression of checkpointing?
(Typical use case for me is to use the Change feed to upsert to a different Cosmos - e.g. to maintain a materialized view of a collection so pretty much all errors will be transient such as throttling and succeed on retry)
Is there anything that can be put in the catch block that will achieve the suppression of checkpointing?
Not at this point. As I mentioned in https://github.com/Azure/azure-webjobs-sdk-extensions/issues/783#issuecomment-1216727630, aligning with the Change Feed Processor, the only potential alternatives we could offer for the Cosmos DB Trigger is either an infinite retry (retry on any error) or no retry. There is no exponential backoff retry support on the Change Feed Processor. When checkpointing is aborted due to a processing error, the lease is released, any instance (same or other) could claim that and retry, so storing the number of previous attempts on an instance would not apply.
Unless the Retry Policies are exposed to Extension authors in a way that we could invoke/call/signal them in some way but the model from what I understand in this thread is that Retry Policies exist and apply outside of the Extension scope.
Unless an alternative to the CosmosDBTrigger
retry policy is provided by the Azure Functions team, we will unfortunately have to fix the Azure Functions Runtime version. Fixing FUNCTIONS_EXTENSION_VERSION
to the version before the retry policy behavior was changed should give us additional time.
Would our request be better considered if we created a support request? I recall the removal of Azure Functions Proxies in the v4 upgrade. (BTW, Functions Proxies are about to be put back after being removed)
As a temporary fix, we'd be wrapping our triggers with Polly in the meantime just to cover transient errors. Consumption plan timeouts would be the next thing to worry about, but considering that processing usually takes less than a second anyway, i don't think it's too much of a concern. We're just unsure how much cost it will add to our azure functions bill.
I would consider adding a try catch block to send erring processes, to queues but we really rely on the order of documents coming through.
A retry forever with a fixed wait duration from the cosmosdb trigger would be great though, and is exactly what we need.
@ealsur is retry policy support for Cosmos DB Trigger that you've added recently an intermittent change until October 2022 according to this document or the document just needs to be updated?
@sayyaari The change I added is for the 4.x extension that will release sometime next month, the document would need to be updated I guess
Thanks @ealsur, Still we have a major problem which is when the Azure Function is terminated by any reason it would advance the lease. Per my investigation in the code it is because the ProcessChangesAsync method of CosmosDBTriggerListener which handles the ChangeFeedProcessor doesn't throw any exception so that the lease doesn't advance in case of such situation. It just logs when get failed results and progress gracefully. Is there any plan to change this behaviour and make it for example more configurable for the next release?
@sayyaari I believe you should create a different issue that we can discuss separately. Azure Function being terminated and the Listener not aborting on errors are two different things. Please see: https://github.com/Azure/azure-webjobs-sdk-extensions/issues/716#issuecomment-824910846
Unless an alternative to the
CosmosDBTrigger
retry policy is provided by the Azure Functions team, we will unfortunately have to fix the Azure Functions Runtime version. FixingFUNCTIONS_EXTENSION_VERSION
to the version before the retry policy behavior was changed should give us additional time.Would our request be better considered if we created a support request? I recall the removal of Azure Functions Proxies in the v4 upgrade. (BTW, Functions Proxies are about to be put back after being removed)
@shibayan Is pinning the version the route you are going with? I am debating this approach vs. adding a Polly retry policy.
@mpaul31 I am currently looking at ways to pin Azure Functions Runtime for safety and then support the new Cosmos DB Extension v4. The reason is to avoid losing retry support for CosmosDBTrigger
in Azure Function Runtime due to platform updates before the Cosmos DB Extension v4 migration is complete.
The Azure Functions team has disclosed that the previous retry policy will be removed in October, so the risk is that we cannot read when exactly in October the removal will take place.
@mpaul31 I am currently looking at ways to pin Azure Functions Runtime for safety and then support the new Cosmos DB Extension v4. The reason is to avoid losing retry support for
CosmosDBTrigger
in Azure Function Runtime due to platform updates before the Cosmos DB Extension v4 migration is complete.The Azure Functions team has disclosed that the previous retry policy will be removed in October, so the risk is that we cannot read when exactly in October the removal will take place.
I am in the exact same boat. Updating the app setting FUNCTIONS_EXTENSION_VERSION
value to version 3.15.0.0
seems less invasive at this point.
Hi @HMoen Is this resolved for you?
The Azure Functions team has disclosed that the previous retry policy will be removed in October, so the risk is that we cannot read when exactly in October the removal will take place.
@Ved2806 , has this been confirmed?
Hi @shibayan Could you please confirm?
@Ved2806 I believe it will be difficult to confirm until the new Cosmos DB Extension v4 goes GA.
The Cosmos DB Extension v4 should reach GA before the retries in Cosmos DB Extension v3 are removed from Azure Functions Runtime. https://www.nuget.org/packages/Microsoft.Azure.WebJobs.Extensions.CosmosDB/
For what it's worth, the rc
package contains the code for the Retry Policy integration: https://www.nuget.org/packages/Microsoft.Azure.WebJobs.Extensions.CosmosDB/4.0.0-rc
Retry policy works in v4.0.0-rc, but the scale controller no longer monitors Change Feed when used with Consumption Plan.
As a result, when a change occurs in the Change Feed, Azure Functions are not started and processing is not performed.
@shibayan That is the reason the package is not yet GA, Scale Controller support is not completed deployment yet (reference https://azure.microsoft.com/en-us/updates/azure-cosmos-db-azure-functions-extension-version-40-in-public-preview/)
It's finally on GA! Thank you very much @ealsur!
We can finally update now!
@shibayan @ealsur , does this enhancement apply to out-of-proc functions as well or docs are accurate?
@HMoen I cannot comment, I am not aware of how out-of-proc works, but based on the documentation you linked, there is no support for any extension.
@kshyju answer below works at the moment, but now we recently started to see the following trace in AppInsights for our AF Cosmos DB triggers : "Soon retries will not be supported for function '[Function Name]'. For more information, please visit http://aka.ms/func-retry-policies."
The Retry examples section also shows "Retry policies aren't yet supported when running in an isolated process." and the Retries section reflects no support for the Cosmos DB trigger/binding.
What's the path forward for AF Cosmos DB triggers running out-of-process?
Yes, retry policies are supported in isolated(out-of-process) function apps. You can enable it by adding the retry section to your host config. Here is an example host.json
That documentation which refers the usage of
ExponentialBackoffRetry
attribute is for in-proc function apps.Please give it a try and let us know if you run into any problems.
Originally posted by @kshyju in https://github.com/Azure/azure-functions-dotnet-worker/issues/832#issuecomment-1072545934