Azure / azure-functions-durable-extension

Durable Task Framework extension for Azure Functions
MIT License
714 stars 270 forks source link

Orchestrator Instances stuck in Pending state however logs show completed #2306

Closed kanupriya15025 closed 1 year ago

kanupriya15025 commented 1 year ago

Description

We are running some load tests on our function app which requires ~ 1000 workflows running. I noticed that for quite a significant no. of orchestrators, the storage table shows "PENDING" for a long time 30-40mins. However, when searching for logs, it shows that the orchestrator instance completed successfully. But, this execution was also delayed.

Expected behavior

Orchestrator instances should run and finish in time and update their status correctly.

App Details

azure-functions==1.11.2 azure-functions-durable==1.1.6

Screenshots

orchestrator show as pending analytics shows as completed

If deployed to Azure

We have access to a lot of telemetry that can help with investigations. Please provide as much of the following information as you can to help us investigate!

cgillum commented 1 year ago

Hi @kanupriya15025. I took a look at the telemetry we have for Kanupriya_Schedule_Load_30138_11:07:2022:09:30:30 (thanks for providing these details) and I was able to confirm that there was about an 18-minute gap between the time you started this workflow and the time it actually started and transitioned into the Running status.

The thing I noticed was that it seems to be a significant delay for all messaging events that go through your control queues. The work-item queue which is responsible for triggering activity functions has no delays. This makes me feel like the problem has something to do with how your app processes orchestrations. I'm also seeing a significant delay between the time our extension starts running an orchestration and the time the orchestration completes. This should also be close to zero when the system is healthy (assuming your orchestrator is following all expected rules about not doing I/O, etc.).

You mentioned that this is a Python app. Which version of Python are you using? My best guess at this point is that you're using one of the older versions, which has very strict concurrency limits enabled. This is the most common cause for the exact behavior you're seeing. More information on this can be found here.

If after researching your concurrency settings you're still not able to resolve this issue, I suggest opening a support ticket with Azure Functions. An engineer should be able to dig further into your case to understand what's happening.

kanupriya15025 commented 1 year ago

Thanks @cgillum. It was my bad that the configuration for autoscaling was disabled on my app. After enabling that issue went away. However, after that there's another issue that started propagating.

Storage shows running but actually completed

Now some of the tasks are stuck in running state in storage table even though they are complete. Attaching a screenshot for the same.

cgillum commented 1 year ago

Thanks for confirming, @kanupriya15025.

The new issue you're describing is commonly seen when tasks are started in an orchestrator function but not completed. For example, if you have a durable timer, an activity, or a sub-orchestrator task that was started and not yielded (or canceled, int he case of durable timers), then the orchestration will remain in the Running state until those tasks have completed in the background. We mention this in the context of timers in our documentation here.

Could this be the cause of the issue you're seeing?

kanupriya15025 commented 1 year ago

@cgillum In my case, we are using "yield" before calling all timers. eg : yield context.create_timer(next_wait_time)

Is that correct?

davidmrdavid commented 1 year ago

@kanupriya15025: That yield seems fine. However, are you cancelling those timers in case they don't fire prior to the orchestrator returning?

davidmrdavid commented 1 year ago

Closing due to inactivity