Open papadeltasierra opened 1 year ago
We traced this back through the DTF code (current code seems similar to the tagged release that we were using) and can suggest at least three possible solutions.
src\DurableTask.AzureStorage\Storage\AzureStorageClient.cs
, the code already catches StorageException
but does no special processing for ArgumentOutOfRangeException
. It is assumed that the DurableTaskStorageException
would flow back to the user so perhaps the catch
statement could be extended. However it would also be useful if the error could map from the Azure storage information (which the user might know little about) to the actual DTF field that the user has set incorrectly and needs to change.ArgumentOutOfRangeException
to flow back up the stack but now consider such exceptions as FATAL
(ifFatal
in src\DurableTask.Core\Common\Utils.cs
) so that src\DurableTask.Core\WorkItemDispatcher.cs
, line 405, will not just keep retrying the orchestration/task. The issue with this is that the ArgumentOutOfRangeException
returned to the user will again indicate a field that the user knows nothing about (initialVisibilityDelay
).Adding additional context to this bug. We encountered the same problem, where we were scheduling an activity to run 14 days later, and similarly did not see any indication of an error - however over a few weeks our system's performance degraded dramatically (from around a second per orchestration to sometimes more than an hour).
What may be different for us is that we are using durable entities, and interactions with these entities have become shockingly slow. The log below timed an entity operation (in milliseconds).
We have noted in our investigation that although there are 158 queued messages in the control-worker-01 queue (see image 1 below) the worker process does not appear to pick them up (image 2, for the same period). While we understand that many of these may be queued with a visibility delay (despite the errors) we would expect that should not impact the processing of messages which are queued without any visibility delay - however it appears that they are impacted.
Also worth noting that we only encountered the problem when we migrated to an isolated function running DotNet 8 - previously we ran in-process functions on DotNet 6 and there was no problem (over 2 years).
Storage account showing 158 queued messages, some of which are visible (note that the messages displayed are erratic):
control worker logs showing no new messages found in the same period (NB. valid jobs were queued)
exception message shows that each time this exception occurs the worker waits 10s before continuing - which will compound the problem as the messages grow.
Example of message content - probably not relevant
Overall workflow trace showing the resultant delay (around 15m for this case)
We wrote a recurring orchestrator, recurring every 15 minutes, and wanted it to run for a long time. We set the expiration to 15days and this seemed to work but... We later discovered that the first iteration of the orchestrator happened but subsequent iterations did not and there was no indication of an error back to our code and no obvious logs. We were eventually able to find that indicated that the DTF failing to run the orchestration and then retrying, and DTF continued to do this every 10 minutes, forever!
Having debugged this (against tagged release
durabletask.azurestorage-v1.13.6
) we have traced the issue to thisException
.The call to
CloudQueue.AddMessageAsync Method
, https://learn.microsoft.com/en-us/dotnet/api/microsoft.azure.storage.queue.cloudqueue.addmessageasync?view=azure-dotnet-legacy#definition, seems to have an undocumented limit on theinitialVisibilityDelay
of 7 days. Setting <=7 days works, setting anything more fails with the error above.