Closed shyamal890 closed 8 months ago
Hi, thanks for the information provided and I am starting the investigations. I will update here once I found anything useful.
Hi, @shyamal890 I would like to confirm this activity function expected behavior. What is the failed activity function durableActScrapeWebsite is working for? For how long should this activity function return to the orchestrator? I queried the failed instanceId, and for this activity function, it seems that it never returns to the orchestrator. And the function failed due to the time out. Thanks!
The failed activity durableActScrapeWebsite is scraping a particular website through puppeteer. You are right that it timed out but the activity function hanged. we are logging each scrape activity and hence it should have returned a log but for some reason it doesn't.
Also, recently started noticing that logs are getting mixed up between functions, that is a http function's log is showing up in durable function and vice-versa. I think this shouldn't happen right?
Here's the setting in host.json
"functionTimeout": "03:00:00"
So, the activity function should return in 3 hours but it didn't, right? I just want to confirm this part. Thanks!
And I will query the instnace Id you provided in another issue page too to check. Will update once I found any.
Hi, @shyamal890. I queried the instanced you provided 983c33044cfa4df6b0669502b56bfea9
and fc13abeb0d1d466bb10b29f265789151
. The two are from the same reason: the message that scheduled activity function in the work-item queue cannot be processed successfully.
The messageId is 10286b8f-dab6-40a2-88ec-4400a6e9baa5
for this issue and 4257ccf6-d903-4158-a1e0-ea4f079c5343
for issue #545 .
As you can see, there were many workers trying to process it but couldn't succeed. And finally, the functions failed due to time out. I checked the age of messages of all the queues. From the picture below we can see that the blue one which is the work-item queue, has much larger Age than the other control queues.
Since you mentioned that the activity function should return to the orchestration in 3 hours, I would assume the problem should because of the work-item queue being unhealthy. There are several options we can do:
For the timeout problem, specifically, the timeout setting is designed to ensure that the function runs on a single VM for a maximum of 3 hours. This is to avoid permanent function since nothing can run forever. And the work-item queue can be read and wrote by any active worker. From the screenshot we can also see there would be different worker trying to process the message when it became visible. So, as long as it doesn't keep running at one worker for 3 hours, it won't time out.
I would suggest trying either of the options above to see if we can mitigate the issue. And please feel free to reach out if there is any further questions. Thanks!
This issue has been automatically marked as stale because it has been marked as requiring author feedback but has not had any activity for 4 days. It will be closed if no further activity occurs within 3 days of this comment.
Describe the bug For some reason long running functions are timing out after doing nothing. Check the screenshot below, we are continuously logging from the activity function. But suddenly it stops logging and then after a while the orchestration timeouts.
Investigative information
If deployed to Azure App Service
To Reproduce Long running CPU intensive activity function
Expected behavior
Actual behavior
Screenshots Added above
Known workarounds None
Additional context