Open pengchen0692 opened 2 years ago
Hi @pengchen0692, I'm going to move this to the Durable Functions GitHub repo since this appears to be an issue with Functions and not with the Durable Task Framework.
This sounds like expected behavior for functions that take too long to execute. Basically, if the Functions host detects a timeout in a function execution, it will respond by restarting itself. This is to mitigate problems of runaway functions. The reason this continues indefinitely is because the Durable Function doesn't actually fail. It just retries again after the host finishes starting back up again.
If you need to run activity functions that last longer than 10 minutes, you'll need to use either an App Service Plan or an Elastic Premium plan. You might also need to increase the function timeout value in host.json: https://docs.microsoft.com/en-us/azure/azure-functions/functions-host-json#functiontimeout.
Hi Chris, I tried with a simple durable function project, looks like the orchestration failed after second execution fails for timeout, and not retry any more.
The project is the template project except:
Thread.Sleep
in SayHello
to explicitly make it take longer time
functionTimeout
in host.json
so that the execution will timeout
I deployed it into AKS.
After triggered with http, what I see is:
Failed
Please let me know if you need more details, I could share project files and Kubernetes yaml files offline.
Thanks Peng Chen
Unfortunately, it looks like the behavior with the Azure Functions host is inconsistent. Sometimes the host recycles and sometimes it doesn't. If it doesn't recycle, then the activity function execution is surfaced as an ordinary failure. I'm not sure if there's much we can do about this. What's the behavior you expect or want?
In our case, we would prefer to be failed finally after several tries. The underlying reason is that the infinite loop leads the orchestration status to be Running
forever, and customer won't be able to know that until waiting for unreasonable time (could be days).
I hit a wired issue where an activity failed for timeout, then entire execution goes into infinite loop. But unfortunately I am not able to repro it, put all information I have here, wondering if you guys have any insight.
The pattern I see is:
ReindexBatchV2Async
execute, and timeout for 30minPlease let me know if need any more information.