Azure / azure-webjobs-sdk-extensions

Azure WebJobs SDK Extensions
MIT License
344 stars 206 forks source link

Allow option for us to manually checkpoint for Cosmos change feed or option to not update leases on error #716

Closed martinsmith123456 closed 3 years ago

martinsmith123456 commented 3 years ago

Please provide a succinct description of the issue.

Repro steps

On unhandled error in the function app the cosmos trigger will checkpoint anyway - meaning this batch is never processed. Whilst this is useful for avoiding poison messages and the processing being permanently stuck it is not always the behaviour I want.

Often I want "at least once successfully processed" semantics not just "at least once seen"

If my trigger encounters a transient error trying to access other cloud resources I would like the failures to be automatically retried.

I have also observed that we sometimes lose in flight documents when the function app is turned off for deployments (seemingly somehow the function is aborted but the leases collection is still updated so it resumes from after the lost documents on restart)

Expected behavior

That we should be able to manually call checkpoint or to configure an option to only checkpoint after successful processing (and accept the risk that this may cause processing to be stuck if our code is not robust)

MartinWickman commented 3 years ago

If a function app crashes (is stopped/killed) during processing of a document/batch, the trigger will continue with the next document/batch once the app is up again, effectively losing that change. This is not really acceptable considering there just isn't a way to safely stop the function app without risk of losing documents.

I think this should work as the Event Hub trigger, which will retry the same event only if the app does not respond at all. Without knowing the details, it seems to me that the trigger should only update the checkpoint after the function has completed, not before (which it looks like to me).

ealsur commented 3 years ago

I think this should work as the Event Hub trigger, which will retry the same event only if the app does not respond at all. Without knowing the details, it seems to me that the trigger should only update the checkpoint after the function has completed, not before (which it looks like to me).

This is exactly how it works. The checkpoint only happens after the Function finishes execution, not before. If there is a crash that stops all running code while your Function executing is in-flight, then the point where the checkpoint happens never reaches.

martinsmith123456 commented 3 years ago

I think this should work as the Event Hub trigger, which will retry the same event only if the app does not respond at all. Without knowing the details, it seems to me that the trigger should only update the checkpoint after the function has completed, not before (which it looks like to me).

This is exactly how it works. The checkpoint only happens after the Function finishes execution, not before. If there is a crash that stops all running code while your Function executing is in-flight, then the point where the checkpoint happens never reaches.

What does stopping the function app in the portal actually do?

I definitely encountered a case when using the change feed to copy documents from a collection with constant writes that I had batches of missed documents that correspond to documents that would have been in flight when the function app was shut down.

It looks as though somehow the function was aborted but the code that then updates the leases still managed to run somehow.

Even without this scenario I would still want this request anyway though because transient errors can happen and the simplest and best solution for this would be to just retry the same batch. (which is the default behaviour of the change feed processor before the function app wraps it)

ealsur commented 3 years ago

What does stopping the function app in the portal actually do?

I don't know in this case, but would not be an instance crash.

Even without this scenario I would still want this request anyway though because transient errors can happen and the simplest and best solution for this would be to just retry the same batch. (which is the default behaviour of the change feed processor before the function app wraps it)

I think this is what https://docs.microsoft.com/en-us/azure/azure-functions/functions-bindings-error-pages?tabs=csharp#retry-policies-preview is about.

As you mentioned, this is a wrapping of the Change Feed Processor. And Change Feed Processor only checkpoints after the execution, not before. So if a checkpoint happened, it means the Function delegate completed execution.

v-anvari commented 3 years ago

Hi @martinsmith123456 , Let us know if this can be closed as the required information is provided

martinsmith123456 commented 3 years ago

Hi @martinsmith123456 , Let us know if this can be closed as the required information is provided

Yes thanks, I'll add retry policy attributes to all my Comsos triggered functions. Hopefully that will solve this!

sayyaari commented 2 years ago

Hi @ealsur I asked the following question on another issue thread and you properly guide me to raise the question here.

Thanks for adding retry policy support for version 4.x, but still we have a major problem (as mentioned in this thread) which is when the Azure Function is terminated for any reason and it would advance the lease. Per my investigation of the code, it is because the ProcessChangesAsync method of CosmosDBTriggerListener which handles the ChangeFeedProcessor doesn't throw any exception which is the expected behavior per this document so that it doesn't advance the checkpoint. Instead, It just logs when it faces the failed results and progresses gracefully while we expect that at least it gives the consumers some options of custom error handling mechanism at this level so that they mindfully decide what to do in case of such errors.

As you know, the consequence of the current behavior is that all the documents in that batch will be lost.

Is there any plan to change this behavior and make it for example more configurable (through Options or some error handling delegates) for the next release as supporting the retry policy?

ealsur commented 2 years ago

@sayyaari Can you clarify what evidence do you have to say that if the Function is terminated the checkpoint still happens? As I described in this thread, when a Function is terminated, all executing code should stop, including the checkpoint, that happens after your Function code executes. So if the Function stops while your code executes, it would not reach the checkpoint.

The retry policies add support for retrying a failed batch already, by Functions design, there cannot be 2 layers of retry handling. I added the retry policies as requested, adding another configuration that would generate a retry on the CFP code would go against the guidelines.

ealsur commented 2 years ago

We understand as Function Terminated as the instance shutting down/crashing.

The reason the Function Trigger does not automatically retry on errors is by design, and aligns with other event based triggers (for example Event Hub https://github.com/Azure/azure-sdk-for-net/blob/a5f2ba22c34a6d931b3717aa36f1c286c4647d13/sdk/eventhub/Microsoft.Azure.WebJobs.Extensions.EventHubs/src/Listeners/EventHubListener.cs#L184).

sayyaari commented 2 years ago

Thanks @ealsur and sorry for delay as I tried to find some time so that I can elaborate more clearly.

@sayyaari Can you clarify what evidence do you have to say that if the Function is terminated the checkpoint still happens? As I described in this thread, when a Function is terminated, all executing code should stop, including the checkpoint, that happens after your Function code executes. So if the Function stops while your code executes, it would not reach the checkpoint.

My point is that:

ealsur commented 2 years ago

ProcessChangesAsync should have some information about the cause of the failure result so that if the failure is due to Azure Function termination (like Shut down) it throws exception so that the checkpoint doesn’t proceed.

That is not up to the Extension, the contract is defined by the Functions runtime, if this is exposed, then we (extension authors) can interpret it, but this is not something extension authors can act on, this needs to be first done at the Functions level.

In nutshell, if the Azure Function is terminated this loop will break and return the FunctionResult. to its caller (the caller in the stacktrace which I want to emphasize) the ProcessChangesAsync in the CosmosDBTriggerListener

If the Function crashes, why would code execution continue? Last week we actually investigated an incident that was exactly this case. The user was exhausting the instance resources and the instance crashed, the checkpoint was not saved and the Function kept retrying the same batch over and over, because the user code execution exhausted resources every time.

The code you have in the private repository TestCosmosDbTriggerRetry does not generate an instance crash but an unhandled runtime exception, it is not the same. Unhandled runtime exceptions are not retried as documented: https://learn.microsoft.com/en-us/azure/cosmos-db/sql/troubleshoot-changefeed-functions#some-changes-are-missing-in-my-trigger

Function crashing could be simulated by:

  1. Exhausting memory/CPU
  2. Exhausting time (maximum runtime is configured as 5min)

In any of the cases, make sure the Function runs successfully at least once to generate a saved state it can fallback/retry on.

sayyaari commented 2 years ago

But the interesting observation is that:

My understanding per debugging your codes (I managed to debug it with the help of Resharper) if the retry strategy is kicked off then following code of the function extension which creates a delay between each retry will catch and swallow TaskCanceledException exception https://github.com/Azure/azure-webjobs-sdk/blob/919d2d63419691166a6e36f1c1785f4c2c31aeb3/src/Microsoft.Azure.WebJobs.Host/Executors/FunctionExecutorExtensions.cs#L83

and as I described the control flow in my previous comment the checkpoint will proceed.

                    try
                    {
                        // If the invocation is cancelled retries must stop.
                        await Task.Delay(nextDelay, cancellationToken);
                    }
                    catch (TaskCanceledException)
                    {
                        logger.LogExitFromRetryLoop();
                        break;
                    }

Is there any reason not to throw the exception?

ealsur commented 2 years ago

@sayyaari That code is outside of the Cosmos DB extension, so I cannot comment on why it is like that or how it should behave (the code you linked does not even belong to this repo). As you saw on the other thread, I discovered that Retry Policies were affecting the Cosmos DB Trigger recently, there is no code on the Trigger itself that interacts with the policies in any way. The code in the Trigger does the same TryExecute pattern as the Event Hub and other event based triggers, they all work the same way.

My experience is with instance crashing and Function terminating (by terminating, stopping running code), not deployments, it looks like for deployments or hitting CTRL+C then the instance is still running code, the question is mainly for the Functions folks, which events are related to this process and how are these communicated upwards. From an Extensions perspective, we only interact with the Executor, and we can only see what it returns. If there is any signal in particular that it returns for this case, then sure we can act on it, but I don't see any.

sayyaari commented 2 years ago

@ealsur The code is outside the CosmosDb extension. I thought it is part of the area you and your team are working on it as well. Does it mean I should post my question on the repo of the code?

ealsur commented 2 years ago

@sayyaari I work on the Cosmos DB Extension: https://github.com/Azure/azure-webjobs-sdk-extensions/tree/dev/src/WebJobs.Extensions.CosmosDB, this repo contains that extension. I don't work nor have scope on any other Function code or behavior (such as what the Retry Policies actually do). I got involved because the ask was to add the required decorators on the Cosmos DB Extension for Retry Policies to detect and be able to apply to it, just like it does for Event Hub, Kafka, and other Triggers.