Azure / azure-functions-durable-js

JavaScript library for using the Durable Functions bindings
https://www.npmjs.com/package/durable-functions
MIT License
128 stars 46 forks source link

Long running Functions hangs and then timeout #544

Closed shyamal890 closed 8 months ago

shyamal890 commented 9 months ago

Describe the bug For some reason long running functions are timing out after doing nothing. Check the screenshot below, we are continuously logging from the activity function. But suddenly it stops logging and then after a while the orchestration timeouts.

image

Investigative information

If deployed to Azure App Service

If you don't want to share your Function App name or Functions names on GitHub, please be sure to provide your Invocation ID, Timestamp, and Region - we can use this to look up your Function App/Function. Provide an invocation id per Function. See the Functions Host wiki for more details.

To Reproduce Long running CPU intensive activity function

While not required, providing your orchestrator's source code in anonymized form is often very helpful when investigating unexpected orchestrator behavior.

Expected behavior

Actual behavior

Screenshots Added above

Known workarounds None

Additional context

nytian commented 9 months ago

Hi, thanks for the information provided and I am starting the investigations. I will update here once I found anything useful.

nytian commented 9 months ago

Hi, @shyamal890 I would like to confirm this activity function expected behavior. What is the failed activity function durableActScrapeWebsite is working for? For how long should this activity function return to the orchestrator? I queried the failed instanceId, and for this activity function, it seems that it never returns to the orchestrator. And the function failed due to the time out. Thanks!

shyamal890 commented 9 months ago

The failed activity durableActScrapeWebsite is scraping a particular website through puppeteer. You are right that it timed out but the activity function hanged. we are logging each scrape activity and hence it should have returned a log but for some reason it doesn't.

Also, recently started noticing that logs are getting mixed up between functions, that is a http function's log is showing up in durable function and vice-versa. I think this shouldn't happen right?

Here's the setting in host.json

"functionTimeout": "03:00:00"

nytian commented 9 months ago

So, the activity function should return in 3 hours but it didn't, right? I just want to confirm this part. Thanks!

And I will query the instnace Id you provided in another issue page too to check. Will update once I found any.

nytian commented 9 months ago

Hi, @shyamal890. I queried the instanced you provided 983c33044cfa4df6b0669502b56bfea9 and fc13abeb0d1d466bb10b29f265789151. The two are from the same reason: the message that scheduled activity function in the work-item queue cannot be processed successfully.

The messageId is 10286b8f-dab6-40a2-88ec-4400a6e9baa5 for this issue and 4257ccf6-d903-4158-a1e0-ea4f079c5343 for issue #545 . image image

As you can see, there were many workers trying to process it but couldn't succeed. And finally, the functions failed due to time out. I checked the age of messages of all the queues. From the picture below we can see that the blue one which is the work-item queue, has much larger Age than the other control queues. image

Since you mentioned that the activity function should return to the orchestration in 3 hours, I would assume the problem should because of the work-item queue being unhealthy. There are several options we can do:

  1. Move to another Taskhub. This is the easiest way, and we will have completely new created queues. Also, since it's all new, we will lose the history, which is stored in the old task hub, and thus all the functions cannot resume from where it's left but start from the beginning.
  2. Create a new work-item queue. We can manually rename the current work-item queue as work-item-backup. Then restart the app, there will be a new work-item queue created. It's a little more complicated but we can have all the history and resume the function.

For the timeout problem, specifically, the timeout setting is designed to ensure that the function runs on a single VM for a maximum of 3 hours. This is to avoid permanent function since nothing can run forever. And the work-item queue can be read and wrote by any active worker. From the screenshot we can also see there would be different worker trying to process the message when it became visible. So, as long as it doesn't keep running at one worker for 3 hours, it won't time out.

I would suggest trying either of the options above to see if we can mitigate the issue. And please feel free to reach out if there is any further questions. Thanks!

microsoft-github-policy-service[bot] commented 9 months ago

This issue has been automatically marked as stale because it has been marked as requiring author feedback but has not had any activity for 4 days. It will be closed if no further activity occurs within 3 days of this comment.