Allow option for us to manually checkpoint for Cosmos change feed or option to not update leases on error

martinsmith123456 commented 3 years ago

Please provide a succinct description of the issue.

Repro steps

On unhandled error in the function app the cosmos trigger will checkpoint anyway - meaning this batch is never processed. Whilst this is useful for avoiding poison messages and the processing being permanently stuck it is not always the behaviour I want.

Often I want "at least once successfully processed" semantics not just "at least once seen"

If my trigger encounters a transient error trying to access other cloud resources I would like the failures to be automatically retried.

I have also observed that we sometimes lose in flight documents when the function app is turned off for deployments (seemingly somehow the function is aborted but the leases collection is still updated so it resumes from after the lost documents on restart)

Expected behavior

That we should be able to manually call checkpoint or to configure an option to only checkpoint after successful processing (and accept the risk that this may cause processing to be stuck if our code is not robust)

MartinWickman commented 3 years ago

If a function app crashes (is stopped/killed) during processing of a document/batch, the trigger will continue with the next document/batch once the app is up again, effectively losing that change. This is not really acceptable considering there just isn't a way to safely stop the function app without risk of losing documents.

I think this should work as the Event Hub trigger, which will retry the same event only if the app does not respond at all. Without knowing the details, it seems to me that the trigger should only update the checkpoint after the function has completed, not before (which it looks like to me).

ealsur commented 3 years ago

I think this should work as the Event Hub trigger, which will retry the same event only if the app does not respond at all. Without knowing the details, it seems to me that the trigger should only update the checkpoint after the function has completed, not before (which it looks like to me).

This is exactly how it works. The checkpoint only happens after the Function finishes execution, not before. If there is a crash that stops all running code while your Function executing is in-flight, then the point where the checkpoint happens never reaches.

martinsmith123456 commented 3 years ago

I think this should work as the Event Hub trigger, which will retry the same event only if the app does not respond at all. Without knowing the details, it seems to me that the trigger should only update the checkpoint after the function has completed, not before (which it looks like to me).

This is exactly how it works. The checkpoint only happens after the Function finishes execution, not before. If there is a crash that stops all running code while your Function executing is in-flight, then the point where the checkpoint happens never reaches.

What does stopping the function app in the portal actually do?

I definitely encountered a case when using the change feed to copy documents from a collection with constant writes that I had batches of missed documents that correspond to documents that would have been in flight when the function app was shut down.

It looks as though somehow the function was aborted but the code that then updates the leases still managed to run somehow.

Even without this scenario I would still want this request anyway though because transient errors can happen and the simplest and best solution for this would be to just retry the same batch. (which is the default behaviour of the change feed processor before the function app wraps it)

ealsur commented 3 years ago

What does stopping the function app in the portal actually do?

I don't know in this case, but would not be an instance crash.

Even without this scenario I would still want this request anyway though because transient errors can happen and the simplest and best solution for this would be to just retry the same batch. (which is the default behaviour of the change feed processor before the function app wraps it)

I think this is what https://docs.microsoft.com/en-us/azure/azure-functions/functions-bindings-error-pages?tabs=csharp#retry-policies-preview is about.

As you mentioned, this is a wrapping of the Change Feed Processor. And Change Feed Processor only checkpoints after the execution, not before. So if a checkpoint happened, it means the Function delegate completed execution.

v-anvari commented 3 years ago

Hi @martinsmith123456 , Let us know if this can be closed as the required information is provided

martinsmith123456 commented 3 years ago

Hi @martinsmith123456 , Let us know if this can be closed as the required information is provided

Yes thanks, I'll add retry policy attributes to all my Comsos triggered functions. Hopefully that will solve this!

sayyaari commented 2 years ago

Hi @ealsur I asked the following question on another issue thread and you properly guide me to raise the question here.

Thanks for adding retry policy support for version 4.x, but still we have a major problem (as mentioned in this thread) which is when the Azure Function is terminated for any reason and it would advance the lease. Per my investigation of the code, it is because the ProcessChangesAsync method of CosmosDBTriggerListener which handles the ChangeFeedProcessor doesn't throw any exception which is the expected behavior per this document so that it doesn't advance the checkpoint. Instead, It just logs when it faces the failed results and progresses gracefully while we expect that at least it gives the consumers some options of custom error handling mechanism at this level so that they mindfully decide what to do in case of such errors.

As you know, the consequence of the current behavior is that all the documents in that batch will be lost.

Is there any plan to change this behavior and make it for example more configurable (through Options or some error handling delegates) for the next release as supporting the retry policy?

ealsur commented 2 years ago

@sayyaari Can you clarify what evidence do you have to say that if the Function is terminated the checkpoint still happens? As I described in this thread, when a Function is terminated, all executing code should stop, including the checkpoint, that happens after your Function code executes. So if the Function stops while your code executes, it would not reach the checkpoint.

The retry policies add support for retrying a failed batch already, by Functions design, there cannot be 2 layers of retry handling. I added the retry policies as requested, adding another configuration that would generate a retry on the CFP code would go against the guidelines.

ealsur commented 2 years ago

We understand as Function Terminated as the instance shutting down/crashing.

The reason the Function Trigger does not automatically retry on errors is by design, and aligns with other event based triggers (for example Event Hub https://github.com/Azure/azure-sdk-for-net/blob/a5f2ba22c34a6d931b3717aa36f1c286c4647d13/sdk/eventhub/Microsoft.Azure.WebJobs.Extensions.EventHubs/src/Listeners/EventHubListener.cs#L184).

sayyaari commented 2 years ago

Thanks @ealsur and sorry for delay as I tried to find some time so that I can elaborate more clearly.

As you mentioned by function termination I mean the instance is shutting down/crashing. We understand as Function Terminated as the instance shutting down/crashing.

@sayyaari Can you clarify what evidence do you have to say that if the Function is terminated the checkpoint still happens? As I described in this thread, when a Function is terminated, all executing code should stop, including the checkpoint, that happens after your Function code executes. So if the Function stops while your code executes, it would not reach the checkpoint.

I will try to describe my discovery, understanding and suggestion so that my point be more clearer this time.
The Azure Function execution is happening inside this loop https://github.com/Azure/azure-webjobs-sdk/blob/919d2d63419691166a6e36f1c1785f4c2c31aeb3/src/Microsoft.Azure.WebJobs.Host/Executors/FunctionExecutorExtensions.cs#L31
The RetryStrategy works until the Azure Function is running because all the information about the strategy and its current state like how many times the retry has been occurred are kept inside this infinite while loop in the TryExecuteAsync of FunctionExecutorExtensions
This loop will continue until o The batch is processed successfully o Or the retry strategy if any exist reaches its termination condition (like max retry times) o Or the function is terminated (e.g. shutting down)
The Azure function is called inside the loop by calling the Function executor. https://github.com/Azure/azure-webjobs-sdk/blob/919d2d63419691166a6e36f1c1785f4c2c31aeb3/src/Microsoft.Azure.WebJobs.Host/Executors/FunctionExecutorExtensions.cs#L50
The function executor (https://github.com/Azure/azure-webjobs-sdk/blob/dev/src/Microsoft.Azure.WebJobs.Host/Executors/FunctionExecutor.cs) catches any exception happens inside the Azure Functions and then returns FunctionResult.
If the FunctionResult is not null means the function execution faced exception and the loop check for the retry strategy
If the Retry Strategy can be used, then a Task.Delay operation will be executed between each retry inside the loop and this Delay operation is wrapped by a try/catch to catch any TaskCanceledException so that if the Azure Function Termination (e.g shut down) happens during this delay, it can break the loop. https://github.com/Azure/azure-webjobs-sdk/blob/919d2d63419691166a6e36f1c1785f4c2c31aeb3/src/Microsoft.Azure.WebJobs.Host/Executors/FunctionExecutorExtensions.cs#L83
In nutshell, if the Azure Function is terminated this loop will break and return the FunctionResult. to its caller (the caller in the stacktrace which I want to emphasize) the ProcessChangesAsync in the CosmosDBTriggerListener https://github.com/Azure/azure-webjobs-sdk-extensions/blob/79e2b16b6703561bd6a7254fe4a91246136aa5dc/src/WebJobs.Extensions.CosmosDB/Trigger/CosmosDBTriggerListener.cs#L222
This ProcessChangesAsync is the one of the places which I think the changes can happen.
This function ProcessChangesAsync is the delegate passed to the ChangeFeedProcessor to process the document.
According to this document https://learn.microsoft.com/en-us/azure/cosmos-db/sql/change-feed-processor?tabs=dotnet#error-handling if this delegate throws any exception the checkpoint won’t proceed and it will be retried . To be clear for other readers it should be mentioned that this retry is not the same as the retry strategy like FixedDelayRetry used to annotate Azure Functions, it is a built in retry of the change feed processor when the check point doesn't proceed).
But as it can be observed in the ProcessChangesAsync https://github.com/Azure/azure-webjobs-sdk-extensions/blob/79e2b16b6703561bd6a7254fe4a91246136aa5dc/src/WebJobs.Extensions.CosmosDB/Trigger/CosmosDBTriggerListener.cs#L222 , this method doesn’t throw any exception when the FunctionResult is not a success even the failure is because the Azure Function is terminated (e.g. shut down)

My point is that:

ProcessChangesAsync should have some information about the cause of the failure result so that if the failure is due to Azure Function termination (like Shut down) it throws exception so that the checkpoint doesn’t proceed.
As I mentioned in my above comments, this behaviour can be configurable.
I am sure if we accept this requirement as a legitimate requirement, then there are many ways to cover it.

ealsur commented 2 years ago

ProcessChangesAsync should have some information about the cause of the failure result so that if the failure is due to Azure Function termination (like Shut down) it throws exception so that the checkpoint doesn’t proceed.

That is not up to the Extension, the contract is defined by the Functions runtime, if this is exposed, then we (extension authors) can interpret it, but this is not something extension authors can act on, this needs to be first done at the Functions level.

In nutshell, if the Azure Function is terminated this loop will break and return the FunctionResult. to its caller (the caller in the stacktrace which I want to emphasize) the ProcessChangesAsync in the CosmosDBTriggerListener

If the Function crashes, why would code execution continue? Last week we actually investigated an incident that was exactly this case. The user was exhausting the instance resources and the instance crashed, the checkpoint was not saved and the Function kept retrying the same batch over and over, because the user code execution exhausted resources every time.

The code you have in the private repository TestCosmosDbTriggerRetry does not generate an instance crash but an unhandled runtime exception, it is not the same. Unhandled runtime exceptions are not retried as documented: https://learn.microsoft.com/en-us/azure/cosmos-db/sql/troubleshoot-changefeed-functions#some-changes-are-missing-in-my-trigger

Function crashing could be simulated by:

Exhausting memory/CPU
Exhausting time (maximum runtime is configured as 5min)

In any of the cases, make sure the Function runs successfully at least once to generate a saved state it can fallback/retry on.

sayyaari commented 2 years ago

In our case, the Function termination happened in reality when a new Function instance is deployed or stop manually and that function retry strategy was kicked off because off some exception during the document process. The function was not crashing because of the Resource Exhaustion.
I simulate the function stop/terminate in dubbing mode in Vs Code by pressing
- Ctrl+C in terminal and then you should be able to see Invocation cancelled - exiting retry loop. which come from logging in this line:
- https://github.com/Azure/azure-webjobs-sdk/blob/919d2d63419691166a6e36f1c1785f4c2c31aeb3/src/Microsoft.Azure.WebJobs.Host/Executors/FunctionExecutorExtensions.cs#L90
- Please Terminate the function after the exception in my code is thrown which kicks off the retry strategy
I updated the code and now the batch size is 1
There are 3 documents in my local CosmosDb
If you press Ctrl+C in terminal after the function throws Exception("Test") in my code and then start the function again you can see that the next item will be processed. I tested this pattern with all of the items in my db.

But the interesting observation is that:

if you remove the "throw" exception line which means the the retry strategy will not be kicked off then repeat the experiment then the document will be retried in the next function run as you mentioned in the previous comments.

My understanding per debugging your codes (I managed to debug it with the help of Resharper) if the retry strategy is kicked off then following code of the function extension which creates a delay between each retry will catch and swallow TaskCanceledException exception https://github.com/Azure/azure-webjobs-sdk/blob/919d2d63419691166a6e36f1c1785f4c2c31aeb3/src/Microsoft.Azure.WebJobs.Host/Executors/FunctionExecutorExtensions.cs#L83

and as I described the control flow in my previous comment the checkpoint will proceed.

                    try
                    {
                        // If the invocation is cancelled retries must stop.
                        await Task.Delay(nextDelay, cancellationToken);
                    }
                    catch (TaskCanceledException)
                    {
                        logger.LogExitFromRetryLoop();
                        break;
                    }

Is there any reason not to throw the exception?

ealsur commented 2 years ago

@sayyaari That code is outside of the Cosmos DB extension, so I cannot comment on why it is like that or how it should behave (the code you linked does not even belong to this repo). As you saw on the other thread, I discovered that Retry Policies were affecting the Cosmos DB Trigger recently, there is no code on the Trigger itself that interacts with the policies in any way. The code in the Trigger does the same TryExecute pattern as the Event Hub and other event based triggers, they all work the same way.

My experience is with instance crashing and Function terminating (by terminating, stopping running code), not deployments, it looks like for deployments or hitting CTRL+C then the instance is still running code, the question is mainly for the Functions folks, which events are related to this process and how are these communicated upwards. From an Extensions perspective, we only interact with the Executor, and we can only see what it returns. If there is any signal in particular that it returns for this case, then sure we can act on it, but I don't see any.

sayyaari commented 2 years ago

@ealsur The code is outside the CosmosDb extension. I thought it is part of the area you and your team are working on it as well. Does it mean I should post my question on the repo of the code?

ealsur commented 2 years ago

@sayyaari I work on the Cosmos DB Extension: https://github.com/Azure/azure-webjobs-sdk-extensions/tree/dev/src/WebJobs.Extensions.CosmosDB, this repo contains that extension. I don't work nor have scope on any other Function code or behavior (such as what the Retry Policies actually do). I got involved because the ask was to add the required decorators on the Cosmos DB Extension for Retry Policies to detect and be able to apply to it, just like it does for Event Hub, Kafka, and other Triggers.

Azure / azure-webjobs-sdk-extensions