Azure / durabletask

Durable Task Framework allows users to write long running persistent workflows in C# using the async/await capabilities.
Apache License 2.0
1.5k stars 289 forks source link

Tasks are running even when orchestration is complete #1137

Open saurav-microsoft opened 1 month ago

saurav-microsoft commented 1 month ago

Hi, I am using DTF with Azure Storage Provider.

I have an orchestration which contains several tasks. The orchestration completes successfully. However I see that there are 2 tasks which are hidden in the workItem queue. Even after the orchestration is complete, these tasks keep on executing forever.

I tried to debug from logs to understand what is going on here, but could not go further. Below is an example of a good task (task#3) and a bad task (task#5) which executes forever.

LifeStory of Task#3

Sending [TaskScheduled#3] message to newprovisioninghub-workitems for instance 'c2102a9a4d5942e4a1bf91e0767b50c8'

Fetched [TaskScheduled#3] message from newprovisioninghub-workitems (delay = 45ms)

Processing [TaskScheduled#3] (total delay = 45ms)

.......... Runs the Task - <>

Sending [TaskCompleted#3] message to newprovisioninghub-control-00 for instance 'c2102a9a4d5942e4a1bf91e0767b50c8'

Deleting [TaskScheduled#3] message from newprovisioninghub-workitems

Fetched [TaskCompleted#3] message from newprovisioninghub-control-00 (delay = 90ms)

Processing [TaskCompleted#3] (total delay = 166ms)

Deleting [TaskCompleted#3] message from newprovisioninghub-control-00


LifeStory of Task#5

Sending [TaskScheduled#5] message to newprovisioninghub-workitems for instance 'c2102a9a4d5942e4a1bf91e0767b50c8'

Fetched [TaskScheduled#5] message from newprovisioninghub-workitems (delay = 0ms)

Processing [TaskScheduled#5] (total delay = 0ms)

newprovisioninghub-control-00: No new messages were found - backing off

Renewing [TaskScheduled#5] message on newprovisioninghub-workitems for an additional 1200 seconds

Sending [TaskCompleted#5] message to newprovisioninghub-control-00 for instance 'c2102a9a4d5942e4a1bf91e0767b50c8'

Deleting [TaskScheduled#5] message from newprovisioninghub-workitems

Fetched [TaskCompleted#5] message from newprovisioninghub-control-00 (delay = 251ms)

Failed to update or delete message [TaskScheduled#5] with ID = c4eb20ac-48be-47d2-bf25-dd202a67c8e4 because it no longer exists

Processing [TaskCompleted#5] (total delay = 335ms)

Deleting [TaskCompleted#5] message from newprovisioninghub-control-00

saurav-microsoft commented 1 month ago

@davidmrdavid @cgillum - Requesting your help if I can do something more to add logs, or any other information

saurav-microsoft commented 1 month ago

Similar to https://github.com/Azure/durabletask/issues/1049

cgillum commented 1 month ago

According to those logs, it looks like the message should be getting deleted. Are there any errors in the logs? What is the content of the queue messages?

saurav-microsoft commented 1 month ago

Apart from this log I do not see any error in the logs,

Failed to update or delete message [TaskScheduled#5] with ID = c4eb20ac-48be-47d2-bf25-dd202a67c8e4 because it no longer exists

The content of the queue message of type TaskScheduledEvent

saurav-microsoft commented 1 month ago

Also I observed a pattern that the task where renew happens executes forever,

Renewing [TaskScheduled#5] message on newprovisioninghub-workitems for an additional 1200 seconds

@cgillum Although it says that the message was not found, I can still see that message in my queue. Initially it was hidden though.

Failed to update or delete message [TaskScheduled#5] with ID = 24ffa893-77fe-4afa-b4bb-eb2c0ba618cf because it no longer exists

image

cgillum commented 1 month ago

Although it says that the message was not found, I can still see that message in my queue. Initially it was hidden though.

This message could also mean that the message exists, but it was hidden. The messages become hidden when they are being processed by some other listener. In your case, the fact that there's a "Renewing [TaskScheduled#5] message on newprovisioninghub-workitems for an additional 1200 seconds" tells me that the processing of this message is taking a long time, and that it will remain hidden for up to 1200 seconds (20 minutes).

What version of the Microsoft.Azure.DurableTask.AzureStorage nuget package are you using?

saurav-microsoft commented 1 month ago

Yes, the Task does some heavy work which usually takes longer. In that case, it is okay that my long running Task messages are renewed. It's just that it should be deleting it when the processing is done which I believe is failing here.

I am using Microsoft.Azure.DurableTask.AzureStorage-2.0.0-rc.

Moved to this NuGet to move away from StorageConnectionString and use TokenCredential

saurav-microsoft commented 1 month ago

I also checked Azure Queue docs and that it throws 404 when pop receipt is not what it gave away in the GET call. Not sure, if it is a bug here

davidmrdavid commented 1 month ago

@saurav-microsoft - please note that thet -rc pacakges are not production ready, we cannot recommend them to be used in production.

That said - we'ved fixed a few major bugs since the rc release. Can you upgrade to the rc.3 package and see if your issue reproduces?

davidmrdavid commented 1 month ago

In particular, there was an error in the updating logic for the popReceipt property, which seems to match your symptoms (note I've only taken a cursory look at this thread, apologies if I've missed something major)

saurav-microsoft commented 1 month ago

Yes, it was indeed pop receipt issue during renewal flows. It works now for me. @davidmrdavid Can you please help me with a tentative timeline when you are planning for final release? We are not in production yet, but we plan to go live soon. I was hoping I consume the right NuGet before that.