Azure / azure-sdk-for-net

This repository is for active development of the Azure SDK for .NET. For consumers of the SDK we recommend visiting our public developer docs at https://learn.microsoft.com/dotnet/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-net.
MIT License
5.47k stars 4.8k forks source link

[FEATURE REQ] [EventsHubs Extension] Do not checkpoint in case of Functions host exception #28653

Closed RohitRanjanMS closed 6 months ago

RohitRanjanMS commented 2 years ago

Library name

Microsoft Azure WebJobs SDK EventHubs Extension

Please describe the feature.

We always checkpoint irrespective of the outcome of the Functions execution.

"When all function execution completes (with or without errors), checkpoints are added to the associated storage account. When check-pointing succeeds, all 1,000 messages are never retrieved again.".

Currently, we do not have the ability to differentiate between exception in the customer code vs exception in the Functions host but this PR is going to address this.

Proposal is to skip checkpoint in case of Functions host failure.

jsquire commented 2 years ago

//cc: @JoshLove-msft

JoshLove-msft commented 2 years ago

Can you clarify what is meant by a function host failure? Would this mean that the customer function is never actually invoked? If that is the assumption, then I would agree that the proposed change makes sense.

RohitRanjanMS commented 2 years ago

yes, for example if the language worker fails to initialize, the host will error out and recycle with the following exception. System.InvalidOperationException : Did not find any initialized language workers

JoshLove-msft commented 2 years ago

Can you clarify how the extension will know whether this scenario has occurred? Will there be a new property exposed in the FunctionResult?

RohitRanjanMS commented 2 years ago

Once the PR to differentiate between host and customer error is merged, we will have to make the required changes to include this in the FunctionResult.

mathewc commented 2 years ago

The referenced PR is specific to OOP/GRPC scenarios. However I think the issue being discussed here is more general. For example, you'll have the same situation in an in-proc csx functions case. Csx functions are demand compiled/loaded, so in the EH case we'll be processing events, dispatching to the invoker, invocations will fail due to compilation error, and we'll checkpoint. Thus, just as in the OOP channel failure case, no user code was actually invoked.

So I think we'll need a general solution that applies to both. This issue only arises in Azure Functions usage of the WebJobs SDK, because it's employing a strategy of generating an IL function wrapper that defines the actual function the SDK will see/invoke, and that function internally dispatches to its own Azure Functions generated invoker which in the cases above is responsible for actually dispatching to user code (csx function or via OOP worker channel).

For example, if the WebJobs SDK defined a special exception type that Azure Functions could throw from its invoker pipeline when we're unable to dispatch to user code, then the trigger executor here could detect that and set some special flag on FunctionResult (e.g. MethodInvoked) and extensions could base any custom logic they want on that. A similar scenario can happen if the function declares input bindings that fail for whatever reason. Input bindings are resolved BEFORE the user code is actually called, so in this EH case you'd also likely not want to checkpoint if one or more bindings failed. So we'll likely want to find a way to reflect that as well via FunctionResult.MethodInvoked.

paulbatum commented 1 year ago

I suggest being careful about this change, when it comes to defining what constitutes a "host" error. If the error is repeatable, the function app potentially gets stuck, processing the same events over and over again (because the checkpoint is not written). In the consumption plan that bills per execution, this could turn into a very large bill for the customer. These problems are discussed in more detail in these issues: https://github.com/Azure/azure-functions-host/issues/2439 https://github.com/Azure/azure-webjobs-sdk/issues/1597

We might want to point customers that really want this behavior to the use of retry policies (and fix the bugs where retry policies do not work correctly).

JoshLove-msft commented 1 year ago

I think the idea is that we want to NOT checkpoint when the customer function was not given a chance to execute. More details are included here - https://github.com/Azure/azure-functions-host/issues/8994

paulbatum commented 1 year ago

That doesn't change the risks associated with indefinitely retrying the same execution with the same data over and over, without having had the customer opt in to a retry policy.

jsquire commented 1 year ago

Agreed, but it would seem better than the unexpected data loss that occurs by checkpointing events that customer code wasn't given an opportunity to process. Perhaps the better way forward would be some kind of "reject event(s)" method that would force an immediate checkpoint so that those events did not get dispatched for processing again.

paulbatum commented 1 year ago

Perhaps a sweet spot between these two extremes would be implementing an implicit exponential backoff retry policy for failures that occur before user code runs?

jsquire commented 1 year ago

That makes sense to me, I don't see any value in spamming executions that have repeated failures. We'd still need to be able to identify whether customer code ran or not so that checkpointing could be suppressed.

github-actions[bot] commented 6 months ago

Hi @RohitRanjanMS, we deeply appreciate your input into this project. Regrettably, this issue has remained unresolved for over 2 years and inactive for 30 days, leading us to the decision to close it. We've implemented this policy to maintain the relevance of our issue queue and facilitate easier navigation for new contributors. If you still believe this topic requires attention, please feel free to create a new issue, referencing this one. Thank you for your understanding and ongoing support.