Orchestration goes into infinite loop after timeout on kubernetes

Azure / azure-functions-durable-extension

Durable Task Framework extension for Azure Functions

MIT License

717 stars 271 forks source link

Orchestration goes into infinite loop after timeout on kubernetes #2101

Open pengchen0692 opened 2 years ago

pengchen0692 commented 2 years ago

I hit a wired issue where an activity failed for timeout, then entire execution goes into infinite loop. But unfortunately I am not able to repro it, put all information I have here, wondering if you guys have any insight.

The pattern I see is:

Activity ReindexBatchV2Async execute, and timeout for 30min
Application is shutting down after 10 min
Application start
Go to step 1

Please let me know if need any more information.

cgillum commented 2 years ago

Hi @pengchen0692, I'm going to move this to the Durable Functions GitHub repo since this appears to be an issue with Functions and not with the Durable Task Framework.

cgillum commented 2 years ago

This sounds like expected behavior for functions that take too long to execute. Basically, if the Functions host detects a timeout in a function execution, it will respond by restarting itself. This is to mitigate problems of runaway functions. The reason this continues indefinitely is because the Durable Function doesn't actually fail. It just retries again after the host finishes starting back up again.

If you need to run activity functions that last longer than 10 minutes, you'll need to use either an App Service Plan or an Elastic Premium plan. You might also need to increase the function timeout value in host.json: https://docs.microsoft.com/en-us/azure/azure-functions/functions-host-json#functiontimeout.

pengchen0692 commented 2 years ago

Hi Chris, I tried with a simple durable function project, looks like the orchestration failed after second execution fails for timeout, and not retry any more.

The project is the template project except:

Add Thread.Sleep in SayHello to explicitly make it take longer time

Update functionTimeout in host.json so that the execution will timeout

I deployed it into AKS.

After triggered with http, what I see is:

Orchestration execution timeout and cause host to restart
The host picks up the failed orchestration and fail again, but host doesn't restart this time, instead the orchestration is marked as Failed

Please let me know if you need more details, I could share project files and Kubernetes yaml files offline.

Thanks Peng Chen

cgillum commented 2 years ago

Unfortunately, it looks like the behavior with the Azure Functions host is inconsistent. Sometimes the host recycles and sometimes it doesn't. If it doesn't recycle, then the activity function execution is surfaced as an ordinary failure. I'm not sure if there's much we can do about this. What's the behavior you expect or want?

pengchen0692 commented 2 years ago

In our case, we would prefer to be failed finally after several tries. The underlying reason is that the infinite loop leads the orchestration status to be Running forever, and customer won't be able to know that until waiting for unreasonable time (could be days).