Azure / azure-functions-durable-extension

Durable Task Framework extension for Azure Functions
MIT License
715 stars 270 forks source link

Phantom failures showing in log after orchestration completes successfully #334

Closed pluralmonad closed 6 years ago

pluralmonad commented 6 years ago

I am using durable orchestration function(s) to send batches of HTTP requests to a 3rd party API.

The job has been experiencing intermittent failures coming back as System.Net.Http.HttpRequestException.

The job is scheduled using a timer trigger which sends a queue message to a function with OrchestrationClient. That function checks for a running instance and starts a new instance if one does not already exist and handles a few other scenarios. It is the only function that starts the orchestration and it always checks for existing before starting it. I disabled all scheduled triggers for the orchestration and had been kicking it off manually through the portal while testing.

The HTTP issues appear to be fixed as the job has now completed every time I have kicked it off and watched it through the streaming logs. Although the orchestration succeeds, and I can see that it actually has because the 3rd party data is all updated, the log continues to show the below exception/stacktrace message it was failing with earlier, but after the orchestration has completed successfully. Another indicator it is completing as that the final step is to update a table entity recording its execution status, which is happening as expected.

I want to apologize ahead of time as this is a lot of lines of text, but I wanted to show the entire exception/stacktrace message that is appearing in the log repeatedly AFTER it has already completed.

Microsoft.Azure.WebJobs.FunctionFailedException : The orchestrator function 'Vds-Orchestrator' failed: "The activity function 'Vds-SyncContacts' failed: "Failed to deserialize exception from TaskActivity: {"ClassName":"System.Net.Http.HttpRequestException","Message":"Error while copying content to a stream.","Data":null,"InnerException":{"ClassName":"System.IO.IOException","Message":"Unable to read data from the transport connection: The connection was closed.","Data":null,"InnerException":null,"HelpURL":null,"StackTraceString":"   at System.Net.ConnectStream.EndWrite(IAsyncResult asyncResult)
   at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)","RemoteStackTraceString":null,"RemoteStackIndex":0,"ExceptionMethod":"8\nEndWrite\nSystem, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089\nSystem.Net.ConnectStream\nVoid EndWrite(System.IAsyncResult)","HResult":-2146232800,"Source":"System","WatsonBuckets":null},"HelpURL":null,"StackTraceString":"   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.ValidateEnd(Task task)
   at WebJob.Utilities.HubSpotApiUtil.<SendRequestsAndHandleResponses>d__6.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.ValidateEnd(Task task)
   at WebJob.Functions.Vds.SyncContacts.<Run>d__3.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.Azure.WebJobs.Host.Executors.FunctionInvoker`2.<InvokeAsync>d__9.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.<InvokeAsync>d__24.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.<ExecuteWithWatchersAsync>d__23.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.<ExecuteWithLoggingAsync>d__22.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.ValidateEnd(Task task)
   at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.<ExecuteWithLoggingAsync>d__16.MoveNext()","RemoteStackTraceString":null,"RemoteStackIndex":0,"ExceptionMethod":"8\nThrowForNonSuccess\nmscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089\nSystem.Runtime.CompilerServices.TaskAwaiter\nVoid ThrowForNonSuccess(System.Threading.Tasks.Task)","HResult":-2146233088,"Source":"mscorlib","WatsonBuckets":null,"SafeSerializationManager":{"m_serializedStates":[{}]},"CLR_SafeSerializationManager_RealType":"System.Net.Http.HttpRequestException, System.Net.Http, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a"}". See the function execution logs for additional details.". See the function execution logs for additional details. ---> System.Exception : The activity function 'Vds-SyncContacts' failed: "Failed to deserialize exception from TaskActivity: {"ClassName":"System.Net.Http.HttpRequestException","Message":"Error while copying content to a stream.","Data":null,"InnerException":{"ClassName":"System.IO.IOException","Message":"Unable to read data from the transport connection: The connection was closed.","Data":null,"InnerException":null,"HelpURL":null,"StackTraceString":"   at System.Net.ConnectStream.EndWrite(IAsyncResult asyncResult)
   at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)","RemoteStackTraceString":null,"RemoteStackIndex":0,"ExceptionMethod":"8\nEndWrite\nSystem, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089\nSystem.Net.ConnectStream\nVoid EndWrite(System.IAsyncResult)","HResult":-2146232800,"Source":"System","WatsonBuckets":null},"HelpURL":null,"StackTraceString":"   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.ValidateEnd(Task task)
   at WebJob.Utilities.HubSpotApiUtil.<SendRequestsAndHandleResponses>d__6.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.ValidateEnd(Task task)
   at WebJob.Functions.Vds.SyncContacts.<Run>d__3.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.Azure.WebJobs.Host.Executors.FunctionInvoker`2.<InvokeAsync>d__9.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.<InvokeAsync>d__24.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.<ExecuteWithWatchersAsync>d__23.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.<ExecuteWithLoggingAsync>d__22.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.ValidateEnd(Task task)
   at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.<ExecuteWithLoggingAsync>d__16.MoveNext()","RemoteStackTraceString":null,"RemoteStackIndex":0,"ExceptionMethod":"8\nThrowForNonSuccess\nmscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089\nSystem.Runtime.CompilerServices.TaskAwaiter\nVoid ThrowForNonSuccess(System.Threading.Tasks.Task)","HResult":-2146233088,"Source":"mscorlib","WatsonBuckets":null,"SafeSerializationManager":{"m_serializedStates":[{}]},"CLR_SafeSerializationManager_RealType":"System.Net.Http.HttpRequestException, System.Net.Http, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a"}". See the function execution logs for additional details. ---> System.Exception : Failed to deserialize exception from TaskActivity: {"ClassName":"System.Net.Http.HttpRequestException","Message":"Error while copying content to a stream.","Data":null,"InnerException":{"ClassName":"System.IO.IOException","Message":"Unable to read data from the transport connection: The connection was closed.","Data":null,"InnerException":null,"HelpURL":null,"StackTraceString":"   at System.Net.ConnectStream.EndWrite(IAsyncResult asyncResult)
   at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)","RemoteStackTraceString":null,"RemoteStackIndex":0,"ExceptionMethod":"8\nEndWrite\nSystem, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089\nSystem.Net.ConnectStream\nVoid EndWrite(System.IAsyncResult)","HResult":-2146232800,"Source":"System","WatsonBuckets":null},"HelpURL":null,"StackTraceString":"   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.ValidateEnd(Task task)
   at WebJob.Utilities.HubSpotApiUtil.<SendRequestsAndHandleResponses>d__6.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.ValidateEnd(Task task)
   at WebJob.Functions.Vds.SyncContacts.<Run>d__3.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.Azure.WebJobs.Host.Executors.FunctionInvoker`2.<InvokeAsync>d__9.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.<InvokeAsync>d__24.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.<ExecuteWithWatchersAsync>d__23.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.<ExecuteWithLoggingAsync>d__22.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.ValidateEnd(Task task)
   at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.<ExecuteWithLoggingAsync>d__16.MoveNext()","RemoteStackTraceString":null,"RemoteStackIndex":0,"ExceptionMethod":"8\nThrowForNonSuccess\nmscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089\nSystem.Runtime.CompilerServices.TaskAwaiter\nVoid ThrowForNonSuccess(System.Threading.Tasks.Task)","HResult":-2146233088,"Source":"mscorlib","WatsonBuckets":null,"SafeSerializationManager":{"m_serializedStates":[{}]},"CLR_SafeSerializationManager_RealType":"System.Net.Http.HttpRequestException, System.Net.Http, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a"} ---> System.Exception : Exception has been thrown by the target of an invocation. ---> System.Exception : Unable to cast object of type 'Newtonsoft.Json.Linq.JObject' to type 'System.Runtime.Serialization.ISafeSerializationData'. 

   End of inner exception 

   End of inner exception 

   End of inner exception 

   End of inner exception

   at async Microsoft.Azure.WebJobs.DurableOrchestrationContext.CallDurableTaskFunctionAsync[TResult](String functionName,FunctionType functionType,String instanceId,RetryOptions retryOptions,Object input)

   at async WebJob.Functions.Orchestrator.Run(DurableOrchestrationContext context,TraceWriter log,CancellationToken systemToken)

That is the exception I had been troubleshooting all day and I finally got the job to run to completion without error, but that log message block still shows up repeatedly after completion.

I have seen the top level orchestration show up as executing and completing while only running for less than 1 second and often only a handful of milliseconds (while indicating no work being done since there are no log statements from within the job code being printed). I thought that it might have been because I was reusing a static instanceId in order to affect singleton behavior, but I have since stopped that and now store LastKnownInstanceID in table storage and cycle it as needed using new Guids.

I have some thoughts on maybe what could be throwing a wrench into the orchestrations and DTFX cleanly completing, but they seem unlikely so I was hoping to get some feedback before adding anymore detail to this already long winded Issue.

I can show code and share App Service details as needed. Thank you for taking the time to read the wall of text :)

P.S. One more detail that may be material to the issue is that the failure is coming from a sub-orchestration initiated by the top level orchestration. Both completing successfully, as discussed above, before those errors show up in the log.

P.P.S No calls to sub-orchestrations or activities are being made with retry.

Adding .csproj for assembly version reference,

<Project Sdk="Microsoft.NET.Sdk">
  <PropertyGroup>
    <TargetFramework>net461</TargetFramework>
  </PropertyGroup>
  <ItemGroup>
    <PackageReference Include="Dapper" Version="1.50.5" />
    <PackageReference Include="Microsoft.Azure.WebJobs.Extensions.DurableTask" Version="1.4.1" />
    <PackageReference Include="Microsoft.Azure.WebJobs.Extensions.SendGrid" Version="2.2.0" />
    <PackageReference Include="Microsoft.NET.Sdk.Functions" Version="1.0.13" />
    <PackageReference Include="System.Threading.Tasks.Extensions" Version="4.4.0" />
  </ItemGroup>
  <ItemGroup>
    <Reference Include="Microsoft.CSharp" />
    <Reference Include="System.Configuration" />
  </ItemGroup>
  <ItemGroup>
    <None Update="host.json">
      <CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
    </None>
    <None Update="local.settings.json">
      <CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
      <CopyToPublishDirectory>Never</CopyToPublishDirectory>
    </None>
  </ItemGroup>
</Project>
cgillum commented 6 years ago

Thanks for opening the issue. Just to clarify, you have a parent orchestrator which calls a sub-orchestrator, and both are succeeding, correct? In spite of this, you're seeing repeated failures in your logs for the sub-orchestrator?

A few other questions (apologies if I missed any of these details from your original post):

I would like to investigate this more deeply to see if this might be an issue with the framework. Thanks for any additional information you can share.

pluralmonad commented 6 years ago

Thanks for opening the issue. Just to clarify, you have a parent orchestrator which calls a sub-orchestrator, and both are succeeding, correct? In spite of this, you're seeing repeated failures in your logs for the sub-orchestrator?

Yes and yes. With the addition that the parent orchestrator and its subs report success while those errors continue to be logged.

Instance ID from the log: 2018-06-08T19:02:38.957 [Info] Function completed (Success, Id=fdf4da1f-9a3d-4729-a7a2-5cdbad985d34, Duration=534ms)

Region: Central US

The errors apparently never stop as all automated triggers are disabled and it is logging the error as we speak.

I cannot say for sure since the log statements with the errors do not report instance ID, but that is certainly possible.

pluralmonad commented 6 years ago

Just realized that might be the execution ID in that log statement. Here is an instance ID that was just used to start a new instance: 7aade37c-98da-4e5d-b3dd-ce943c360858

pluralmonad commented 6 years ago

It was an old execution that did not seem to be able to complete, I suppose due to the nature of that exception. I managed to catch the queue message in one of the hub control queues (these had previously appeared empty in storage explorer). It had a dequeue count over 4000. Clearing that queue seems to have stopped the log messages, so I guess the broken replays as well.

Is there a configuration value in the host file I can set to safeguard against any such similar problem in the future? Ideally I would like to not have to monitor the control queues directly. Or is simply not utilizing the "StartWithRetryAsync" call functionality enough? I said above that I am not using any retry calls, but since this instance has been failing for some time it is possible it was created with such a call in the past.

Aside from the destruction of the logs (and app insights...) I would be most concerned with preventing the accruing of a large bill from constant try/fail cycles.

cgillum commented 6 years ago

Interesting. Thanks for providing these details. I'm investigating a similar issue involving sub-orchestrations and I wonder if these two issues are related. Did you happen to save the contents of the problematic queue message? I'd be interested to know what the event type was.

cgillum commented 6 years ago

I looked at instance 7aade37c-98da-4e5d-b3dd-ce943c360858 and that one appears to have completed successfully.

I went back and searched for messages with 1000+ dequeue counts and I found two. In both cases I detected the following error which is causing the issue:

TaskOrchestrationDispatcher-736e389c57914edb8a7a42301521c04c-0: Exception while processing workItem 9c8858c0-20d8-4361-a5a8-ac873e59afe8
Exception: Microsoft.WindowsAzure.Storage.StorageException : Element 1 in the batch returned an unexpected response code.
       at Microsoft.WindowsAzure.Storage.Core.Executor.Executor.EndExecuteAsync[T](IAsyncResult result) in c:\Program Files (x86)\Jenkins\workspace\release_dotnet_master\Lib\ClassLibraryCommon\Core\Executor\Executor.cs:line 50
   at Microsoft.WindowsAzure.Storage.Table.CloudTable.EndExecuteBatch(IAsyncResult asyncResult) in c:\Program Files (x86)\Jenkins\workspace\release_dotnet_master\Lib\ClassLibraryCommon\Table\CloudTable.cs:line 207
   at Microsoft.WindowsAzure.Storage.Core.Util.AsyncExtensions.<>c__DisplayClass1`1.<CreateCallback>b__0(IAsyncResult ar) in c:\Program Files (x86)\Jenkins\workspace\release_dotnet_master\Lib\ClassLibraryCommon\Core\Util\AsyncExtensions.cs:line 66
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at DurableTask.AzureStorage.Tracking.AzureTableTrackingStore.<UploadHistoryBatch>d__42.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.ValidateEnd(Task task)
   at DurableTask.AzureStorage.Tracking.AzureTableTrackingStore.<UpdateStateAsync>d__34.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.ValidateEnd(Task task)
   at DurableTask.AzureStorage.AzureStorageOrchestrationService.<CompleteTaskOrchestrationWorkItemAsync>d__76.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.ValidateEnd(Task task)
   at DurableTask.Core.TaskOrchestrationDispatcher.<OnProcessWorkItemAsync>d__20.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at DurableTask.Core.TaskOrchestrationDispatcher.<OnProcessWorkItemSessionAsync>d__19.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at DurableTask.Core.WorkItemDispatcher`1.<ProcessWorkItemAsync>d__34.MoveNext()
Inner Exception: 

The relevant part is this:

StorageException : Element 1 in the batch returned an unexpected response code.

This tells me that something in the DurableTask.AzureStorage package is failing to append rows to your History table. I wasn't able to find much more than this, though the issue may be one of the items described here.

Any other hints you can provide which might help explain this? For example, does your parent orchestration catch the sub-orchestration failure and schedule some additional work? If so, could it potentially be a lot of work? I don't think this code takes the 4MB batch limit into consideration, and I wonder if this could be the root cause.

I've separately opened a couple related issues based on this investigation:

pluralmonad commented 6 years ago

Depending on how long those poison messages were hanging around, it very well could have been an oversized message issue. It was initially implemented with activities returning data and passing to the next step (likely over 4MB at certain steps), which was refactored after it appeared the job host was writing the data to storage between every step.

cgillum commented 6 years ago

I have made a fix for the 4 MB case (https://github.com/Azure/azure-functions-durable-extension/issues/339) and have also added a bunch of extra tracing to make it easier to diagnose these kinds of issues. Closing this for now. Next week we'll be releasing v1.5.0, which will contain these fixes. Please let us know if you still run into these problems when you upgrade to v1.5.0.