Allow a function to handle execution timeout gracefully and prevent process restart

paulbatum commented 6 years ago

So the consumption plan for functions has a default execution timeout of 5 minutes. Its not great to allow your functions to hit this timeout because when it happens, your entire process ends up getting restarted (because its the only way to force the execution to stop - task.Abort() does not exist). This will be disruptive to other long running functions in the same process.

The challenge is that as a function author that is aware of this, there's not much you can do to address it. In the case of C#, you could update your function to take a CancellationToken and check the state of that token (or pass it into async APIs your function is calling). However even if you do this, today the system will still terminate the process (because it does not check to see if your function actually honored the cancellation request).

So, this work item tracks the idea of making the timeout mechanism smarter. It would do the following:

If the function does not take a cancellation token, no change in behavior (the process will get killed)
If the function takes a cancellation token, signal the token once the timeout has expired and then wait some period of time (say 5 seconds) and then check to see if the function task has completed. If the task has completed then we assume the timeout was handled gracefully and do not kill the process.

In order for this approach to work for multiple languages, we need a way to support the equivalent of cancellation tokens for out of proc languages which is tracked by https://github.com/Azure/azure-webjobs-sdk-script/issues/2152.

oshevnin commented 5 years ago

Any progress with this issue?

It's not possible currently to gracefully shutdown the running function instances when something outstanding happened - timeout exceeded, host stopped/restarted. Even handling cancellation token doesn't help as mentioned above.

waiting for 5 second can be too aggressive. Is there a reason not waiting 5/10 minutes (max function duration) after cancellation token signaled, before hard stop of the instance? So all active functions will finish the work gracefully even if their cancellation token handling is not perfect.

Cross-refencing https://github.com/Azure/Azure-Functions/issues/866 since it may be related as well

@fabiocav

kreaton commented 5 years ago

When a timeout occurs in a Function App it appears from my testing (C#, V2) that any logging to Application Insights also goes away, ie it is not possible to trace what happened in the function before the timeout. The timeout exception itself also doesn't seem to be logged to Application Insights and thus cannot be monitored.

ishepherd commented 5 years ago

This is, tbh, hella dumb and renders the CancellationTokens rather useless. If the CancellationToken is signalled, it's already too late to save your process.

Any chance? @jeffhollan @eduardolaureano

Any chance you could be nice to the .Net guys by doing this before #2152 :)

mciprijanovic commented 5 years ago

I have a case very related to this topic, and I can't find appropriate answer for a long time. I have the function with the EventHubTrigger. Messages from the event hub are pulled in batches. Now, according the documentation those messages are checked out when function ends. This means that if the process stops for any reason(stop host, restart...), that prevents the function to checkout received messages and next time when function starts, it will pick the same unchecked already received messages. According to all mentioned here, I have no option to gracefully stop the function, which means somehow to tell it to stop, after function ends and checkout occurs, before starting again and pulling new set of messages, and I must handle possible duplicates in my code. Is this true, or there is a solution for controlled shutdown?

paulbatum commented 5 years ago

When processing any event hubs workload, you need to write your code to allow for duplicates, because Event Hubs does not provide "at most once" guarantees. Even if you were able to handle the shutdown case correctly, there are other cases where your code might need to handle duplicates, for example, if a partition lease is lost.

gkindov commented 3 years ago

Hi, not sure if this is the right place to ask but - is there an function app level host shutdown event I can intercept so I can clean app static resources used but the whole app? For example I need to call Serilog.Log.CloseAndFlush(). I can't find anything in the doc, only Startup event where I register the logger. Thanks.

ishepherd commented 3 years ago

@gkindov I don't think so. I suggest asking the folks in the Azure Functions Discord. https://discord.gg/YEQPcCsY

derekrprice commented 3 years ago

I have an issue related to @kreaton's. We use an external performance and error monitoring tool that needs to close gracefully in order to log to a remote server. When the function is killed with prejudice, it never gets a chance to log the performance information that it has recorded so we don't have any traces to use to track down what is causing the runs to take so long.

alonfirestein commented 2 years ago

Any chance that this issue was resolved by anyone? Is it possible to catch and handle the execution timeout gracefully instead of killing the process or restart? In my case an unlimited timeout or retries isn't a effective option so any answer would be appreciated.

duncanthescot commented 1 year ago

There should definitely be an event which fires before the timeout so processes can be shutdown gracefully.

borislavml commented 1 year ago

Does anybody know something about that mysterious event that fires before the timeout? We really need this in our functions!

ChristianPardun commented 4 months ago

Has progress been made on this issue? How can timeouts be handled appropriately? Is there an c# event or delegate to use when a timeout occurs? Thank you very much for your help.

Azure / azure-functions-host

Allow a function to handle execution timeout gracefully and prevent process restart #2153