Open papadeltasierra opened 7 months ago
We traced this back through the DTF code (current code seems similar to the tagged release that we were using) and can suggest at least three possible solutions.
src\DurableTask.AzureStorage\Storage\AzureStorageClient.cs
, the code already catches StorageException
but does no special processing for ArgumentOutOfRangeException
. It is assumed that the DurableTaskStorageException
would flow back to the user so perhaps the catch
statement could be extended. However it would also be useful if the error could map from the Azure storage information (which the user might know little about) to the actual DTF field that the user has set incorrectly and needs to change.ArgumentOutOfRangeException
to flow back up the stack but now consider such exceptions as FATAL
(ifFatal
in src\DurableTask.Core\Common\Utils.cs
) so that src\DurableTask.Core\WorkItemDispatcher.cs
, line 405, will not just keep retrying the orchestration/task. The issue with this is that the ArgumentOutOfRangeException
returned to the user will again indicate a field that the user knows nothing about (initialVisibilityDelay
).
We wrote a recurring orchestrator, recurring every 15 minutes, and wanted it to run for a long time. We set the expiration to 15days and this seemed to work but... We later discovered that the first iteration of the orchestrator happened but subsequent iterations did not and there was no indication of an error back to our code and no obvious logs. We were eventually able to find that indicated that the DTF failing to run the orchestration and then retrying, and DTF continued to do this every 10 minutes, forever!
Having debugged this (against tagged release
durabletask.azurestorage-v1.13.6
) we have traced the issue to thisException
.The call to
CloudQueue.AddMessageAsync Method
, https://learn.microsoft.com/en-us/dotnet/api/microsoft.azure.storage.queue.cloudqueue.addmessageasync?view=azure-dotnet-legacy#definition, seems to have an undocumented limit on theinitialVisibilityDelay
of 7 days. Setting <=7 days works, setting anything more fails with the error above.