Closed pluralmonad closed 6 years ago
Thanks for opening the issue. Just to clarify, you have a parent orchestrator which calls a sub-orchestrator, and both are succeeding, correct? In spite of this, you're seeing repeated failures in your logs for the sub-orchestrator?
A few other questions (apologies if I missed any of these details from your original post):
I would like to investigate this more deeply to see if this might be an issue with the framework. Thanks for any additional information you can share.
Thanks for opening the issue. Just to clarify, you have a parent orchestrator which calls a sub-orchestrator, and both are succeeding, correct? In spite of this, you're seeing repeated failures in your logs for the sub-orchestrator?
Yes and yes. With the addition that the parent orchestrator and its subs report success while those errors continue to be logged.
Instance ID from the log: 2018-06-08T19:02:38.957 [Info] Function completed (Success, Id=fdf4da1f-9a3d-4729-a7a2-5cdbad985d34, Duration=534ms)
Region: Central US
The errors apparently never stop as all automated triggers are disabled and it is logging the error as we speak.
I cannot say for sure since the log statements with the errors do not report instance ID, but that is certainly possible.
Just realized that might be the execution ID in that log statement. Here is an instance ID that was just used to start a new instance: 7aade37c-98da-4e5d-b3dd-ce943c360858
It was an old execution that did not seem to be able to complete, I suppose due to the nature of that exception. I managed to catch the queue message in one of the hub control queues (these had previously appeared empty in storage explorer). It had a dequeue count over 4000. Clearing that queue seems to have stopped the log messages, so I guess the broken replays as well.
Is there a configuration value in the host file I can set to safeguard against any such similar problem in the future? Ideally I would like to not have to monitor the control queues directly. Or is simply not utilizing the "StartWithRetryAsync" call functionality enough? I said above that I am not using any retry calls, but since this instance has been failing for some time it is possible it was created with such a call in the past.
Aside from the destruction of the logs (and app insights...) I would be most concerned with preventing the accruing of a large bill from constant try/fail cycles.
Interesting. Thanks for providing these details. I'm investigating a similar issue involving sub-orchestrations and I wonder if these two issues are related. Did you happen to save the contents of the problematic queue message? I'd be interested to know what the event type was.
I looked at instance 7aade37c-98da-4e5d-b3dd-ce943c360858 and that one appears to have completed successfully.
I went back and searched for messages with 1000+ dequeue counts and I found two. In both cases I detected the following error which is causing the issue:
TaskOrchestrationDispatcher-736e389c57914edb8a7a42301521c04c-0: Exception while processing workItem 9c8858c0-20d8-4361-a5a8-ac873e59afe8
Exception: Microsoft.WindowsAzure.Storage.StorageException : Element 1 in the batch returned an unexpected response code.
at Microsoft.WindowsAzure.Storage.Core.Executor.Executor.EndExecuteAsync[T](IAsyncResult result) in c:\Program Files (x86)\Jenkins\workspace\release_dotnet_master\Lib\ClassLibraryCommon\Core\Executor\Executor.cs:line 50
at Microsoft.WindowsAzure.Storage.Table.CloudTable.EndExecuteBatch(IAsyncResult asyncResult) in c:\Program Files (x86)\Jenkins\workspace\release_dotnet_master\Lib\ClassLibraryCommon\Table\CloudTable.cs:line 207
at Microsoft.WindowsAzure.Storage.Core.Util.AsyncExtensions.<>c__DisplayClass1`1.<CreateCallback>b__0(IAsyncResult ar) in c:\Program Files (x86)\Jenkins\workspace\release_dotnet_master\Lib\ClassLibraryCommon\Core\Util\AsyncExtensions.cs:line 66
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at DurableTask.AzureStorage.Tracking.AzureTableTrackingStore.<UploadHistoryBatch>d__42.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.ValidateEnd(Task task)
at DurableTask.AzureStorage.Tracking.AzureTableTrackingStore.<UpdateStateAsync>d__34.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.ValidateEnd(Task task)
at DurableTask.AzureStorage.AzureStorageOrchestrationService.<CompleteTaskOrchestrationWorkItemAsync>d__76.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.ValidateEnd(Task task)
at DurableTask.Core.TaskOrchestrationDispatcher.<OnProcessWorkItemAsync>d__20.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at DurableTask.Core.TaskOrchestrationDispatcher.<OnProcessWorkItemSessionAsync>d__19.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at DurableTask.Core.WorkItemDispatcher`1.<ProcessWorkItemAsync>d__34.MoveNext()
Inner Exception:
The relevant part is this:
StorageException : Element 1 in the batch returned an unexpected response code.
This tells me that something in the DurableTask.AzureStorage package is failing to append rows to your History table. I wasn't able to find much more than this, though the issue may be one of the items described here.
Any other hints you can provide which might help explain this? For example, does your parent orchestration catch the sub-orchestration failure and schedule some additional work? If so, could it potentially be a lot of work? I don't think this code takes the 4MB batch limit into consideration, and I wonder if this could be the root cause.
I've separately opened a couple related issues based on this investigation:
Depending on how long those poison messages were hanging around, it very well could have been an oversized message issue. It was initially implemented with activities returning data and passing to the next step (likely over 4MB at certain steps), which was refactored after it appeared the job host was writing the data to storage between every step.
I have made a fix for the 4 MB case (https://github.com/Azure/azure-functions-durable-extension/issues/339) and have also added a bunch of extra tracing to make it easier to diagnose these kinds of issues. Closing this for now. Next week we'll be releasing v1.5.0, which will contain these fixes. Please let us know if you still run into these problems when you upgrade to v1.5.0.
I am using durable orchestration function(s) to send batches of HTTP requests to a 3rd party API.
The job has been experiencing intermittent failures coming back as
System.Net.Http.HttpRequestException
.The job is scheduled using a timer trigger which sends a queue message to a function with
OrchestrationClient
. That function checks for a running instance and starts a new instance if one does not already exist and handles a few other scenarios. It is the only function that starts the orchestration and it always checks for existing before starting it. I disabled all scheduled triggers for the orchestration and had been kicking it off manually through the portal while testing.The HTTP issues appear to be fixed as the job has now completed every time I have kicked it off and watched it through the streaming logs. Although the orchestration succeeds, and I can see that it actually has because the 3rd party data is all updated, the log continues to show the below exception/stacktrace message it was failing with earlier, but after the orchestration has completed successfully. Another indicator it is completing as that the final step is to update a table entity recording its execution status, which is happening as expected.
I want to apologize ahead of time as this is a lot of lines of text, but I wanted to show the entire exception/stacktrace message that is appearing in the log repeatedly AFTER it has already completed.
That is the exception I had been troubleshooting all day and I finally got the job to run to completion without error, but that log message block still shows up repeatedly after completion.
I have seen the top level orchestration show up as executing and completing while only running for less than 1 second and often only a handful of milliseconds (while indicating no work being done since there are no log statements from within the job code being printed). I thought that it might have been because I was reusing a static instanceId in order to affect singleton behavior, but I have since stopped that and now store
LastKnownInstanceID
in table storage and cycle it as needed using newGuid
s.I have some thoughts on maybe what could be throwing a wrench into the orchestrations and DTFX cleanly completing, but they seem unlikely so I was hoping to get some feedback before adding anymore detail to this already long winded Issue.
I can show code and share App Service details as needed. Thank you for taking the time to read the wall of text :)
P.S. One more detail that may be material to the issue is that the failure is coming from a sub-orchestration initiated by the top level orchestration. Both completing successfully, as discussed above, before those errors show up in the log.
P.P.S No calls to sub-orchestrations or activities are being made with retry.
Adding .csproj for assembly version reference,