Azure / durabletask

Durable Task Framework allows users to write long running persistent workflows in C# using the async/await capabilities.
Apache License 2.0
1.52k stars 295 forks source link

Durable function stopped but with status Running - Split brain detected error #1046

Closed potomato closed 3 weeks ago

potomato commented 8 months ago

I use a small number of durable functions for frequent (per minute) singleton jobs. They are launched by a regular timer-triggered functions and each orchestration function invokes one worker function.

Periodically we find that the orchestration and worker functions have stopped being invoked, because the timer-triggered function receives Running state when querying for the durable function status, when it hasn't run for (potentially) days.

I have captured and investigated app insights logs which include Split brain detected: Another worker already updated the history for this instance - the 4 history event result(s) will be discarded". It seems the Completed status was discarded.

I don't believe another worker is running though. There was only ever one of each of HostInstanceId, InvocationId and ProcessId in logs. The error seems to have come from a Storage Precondition not met header not being satisfied.

I've attached a snippet of logs from the incident. I have more detail if required.

What exactly has happened to cause this and what can I look for in logs, or what actions can I take to mitigate this? Currently we have to go and clean up Storage Tables to make it work again. DurableFailureLogsForGH.csv

Thanks.

cgillum commented 8 months ago

I was working on an investigation recently where I saw very similar symptoms. An orchestration got stuck in a "Running" state and the root cause seems to be an unexpected "split brain" warning resulting in an ExecutionCompleted history event being discarded. It also seemed like a potentially bogus split-brain warning because there was only a single worker running.

My theory at this point is that some error was returned from Azure Storage that was miscategorized as an eTag violation. I didn't have access to Azure Storage logs to confirm what the actual error was to confirm. By chance did this happen to you recently enough that you're able to get the Azure Storage logs and see what the error was?

cgillum commented 8 months ago

Looking at your logs, regardless of whether the warning was categorized correctly or not, the fundamental issue seems to be the fact that the history table was updated to show that the orchestration completed, but the instances table was never updated due to the error, leaving this orchestration in an inconsistent state since we aren't able to have a transaction across these two tables. We don't correctly recover from this because the worker only considers the runtime status in the History table when deciding whether to discard work items and doesn't consider the instances table.

This particular case is odd because the History update operation succeeded on the Azure Storage side but was reported as a failure on the client side, preventing us from updating the Instances table with the Completed status (the Instances table is what we use for status query operations). The Azure Storage logs could help here because we'd want to confirm whether the operation actually succeeded or failed from the storage API perspective. That would at least help let us know if there's a problem in Azure Storage itself or in the Azure Storage SDK we're using.

Regardless, I think there is a behavior we need to fix on our side to double check the status of an orchestration in the Instances table before deciding to discard a work-item. I could see other corner-case failure conditions where this would be beneficial.

potomato commented 8 months ago

Thanks very much for your replies. Your comments sound promising.

Unfortunately I don't think we have diagnostic logging on the storage account where this happened. I'll check on Monday morning (it's Friday night here) but I don't think we'll get the detail we'd like for this one. I will look at environments that have logging to see if we have any Split brain errors, and if so will report back with storage logs.

But as you say, if there's a way to handle the error better and keep state more consistent then we'll be better off.

Thanks for your work on Durable functions, and also Dapr!

potomato commented 8 months ago

Hi, I checked across all our subscriptions and unfortunately we don't have any where storage logs are turned on and we have had a Split Brain trace.

If it happens in an environment with logging turned on I can capture it then.

In the meantime is there anything I can do to help progress the fix?

Thanks.

Iamjithender commented 2 months ago

Hii @potomato, @cgillum, We are also facing similar issue, where for our periodic workflows, history table is showing the state as execution completed but Instance table status is still running for particular instance. Due to these inconsistencies our workflows were not getting queued again as the instance table status is still in running state for particular instance.

How to resolve these issues and wanted to understand reason behind these inconsistencies across 2 tables? Did we release any fix for this issue @cgillum (as mentioned to fix behavior from dtfx side to double check on both the tables)?

potomato commented 2 months ago

Hi @cgillum,

We've had another case of a durable function hanging today, this time in production where we have storage logs turned on. It should run every 30seconds, but you can see a big gap when it stopped and ops team had to restart it by deleting the 'Running' Storage Table record:

"timestamp [UTC]",success,resultCode,duration,sdkVersion "9/16/2024, 9:09:37.104 AM",True,0,"14.7332","azurefunctions: 4.634.1.22670" "9/16/2024, 9:10:03.740 AM",True,0,"13.9809","azurefunctions: 4.634.1.22670" "9/16/2024, 9:10:59.999 AM",True,0,"19.3236","azurefunctions: 4.634.1.22670" "9/16/2024, 9:44:02.164 AM",True,0,"96.2632","azurefunctions: 4.634.1.22670" "9/16/2024, 9:44:49.148 AM",True,0,"20.5686","azurefunctions: 4.634.1.22670" "9/16/2024, 9:45:03.271 AM",True,0,"14.0505","azurefunctions: 4.634.1.22670"

I saw these two exceptions reported just after the last run before the gap:

"timestamp [UTC]",problemId,assembly,innermostMessage,sdkVersion,itemCount "9/16/2024, 9:11:00.119 AM","DurableTask.Core.Exceptions.OrchestrationAlreadyExistsException at DurableTask.AzureStorage.AzureStorageOrchestrationService+d113.MoveNext","DurableTask.AzureStorage, Version=1.17.0.0, Culture=neutral, PublicKeyToken=d53979610a6e89dd","An Orchestration instance with the status Running already exists.","azurefunctions: 4.634.1.22670",2 "9/16/2024, 9:11:00.121 AM","DurableTask.Core.Exceptions.OrchestrationAlreadyExistsException at DurableTask.AzureStorage.AzureStorageOrchestrationService+d113.MoveNext","DurableTask.AzureStorage, Version=1.17.0.0, Culture=neutral, PublicKeyToken=d53979610a6e89dd","An Orchestration instance with the status Running already exists.","azurefunctions: 4.634.1.22670",2

I've looked in the storage account and have StorageBlobLogs records for various blobs under ..-applease, ..-leases/, azure-webjobs-hosts/ etc - so can dig into logs and answer any queries you have (as per your message above).

These logs won't stay around for ever - I think I can pull them for up to 30 days.

Thanks.

cgillum commented 2 months ago

We've received at least one other report of this through the official Azure Functions support channel and the team has released a small workaround for cases like this. It's available in the latest release of the DurableTask.AzureStorage package: https://github.com/Azure/durabletask/releases/tag/durabletask.azurestorage-v1.17.5.

Basically, if you set AzureStorageOrchestrationServiceSettings.AllowReplayingTerminalInstances to true, then you should be able to terminate the orchestration, or send an arbitrary external event to it, to bring it to a proper completion status. You might be able to recreate the instance directly too, though I don't think that scenario was tested.

Adding @davidmrdavid for visibility.

potomato commented 1 month ago

Thanks for the update, it sounds promising.

I've been looking at various repos and trying to find out how I'd set the AllowReplayingTerminalInstances property to true for out-of-process durable functions. I presume an environment variable with the right prefix will do the trick but I can't find where it gets bound to config.

Can you advise how I can set this to true for out-of-process Azure durable functions?

davidmrdavid commented 1 month ago

I don't think we've made a release of .NET isolated that supports this yet. Sorry, I thought this was a DurableTask issue, but seems this is DF isolated.

What I need to do to unlock this in .NET isolated is simply to make a release of the worker extension that transitively depends on the release Chris linked above. I'm looking to make one such release in the coming days.

@potomato - could you ping me later on Thursday this week for a status update on this? I'll be looking to make progress on this release.

In the meantime, the procedure to enable this setting is to set it in your host.json. Nested inside of: extensions -> durableTask -> storageProvider. You should be able to try this with the latest WebJobs extension release in the in-process model, and you'll soon be able to set this in out-of-process using the same host.json approach.

davidmrdavid commented 1 month ago

@potomato: the worker extension release is out: https://github.com/Azure/azure-functions-durable-extension/releases/tag/v1.1.6-worker-extension . You should just need to reference this extension in your isolated app, and set the host.json config as noted, and it should work. Please let me know if you need more guidance, I realize the instructions for the host.json bit are a bit handwavy at this time (docs update still pending)

potomato commented 1 month ago

Hi @davidmrdavid, that's great - thanks very much (and it's not even Thursday yet!)

I've upgraded the reference and will set the property to true in non-prod environments and monitor for a while before promoting to production. I'm actually setting the flag via env var AzureFunctionsJobHostextensionsdurableTaskstorageProviderallowReplayingTerminalInstances - I can tell it's parsing it correctly because if I put nonsense in there it throws that it can't convert to bool.

How easy would it be to run queries regularly to see if the original scenario is being handled? I guess there's a lot of potential causes. If it never happens in non-prod and everything works as normal I'll promote and hopefully it'll just never happen again, but it would great to see the fix working. Will the OrchestrationAlreadyExistsException's I listed above still get logged?

Thanks again.

davidmrdavid commented 3 weeks ago

Sorry for the delayed response. Good idea on checking the presence of the feature through configuration parsing, I'll remember that trick. I don't think the OrchestrationAlreadyExistsException would get logged anymore, since the setting prevents the error in the first place.

I'm not against emitting a log here (and we're open to open source contributions, here's the relevant PR for guidance: https://github.com/Azure/durabletask/pull/1159/files), but I worry it could get noisy, and we already have a bit of a challenge with being too chatty in the logs.

How was the feature been working for you, did you end up promoting it to prod, or has the lack of a log been a blocker?

In any case, I'll be closing this thread since the release was made, but we can continue the conversation nonetheless.