Azure / azure-functions-durable-extension

Durable Task Framework extension for Azure Functions
MIT License
713 stars 270 forks source link

Update Storage SDK's to fix deadlock #1403

Open mhoeger opened 4 years ago

mhoeger commented 4 years ago

Per @ConnorMcMahon , there is a deadlock issue with the Azure Storage SDK's. We found that these deadlocks would lead to an application getting stuck to the point where it would fail to process Durable Functions triggers. Because of this, whenever any storage operation takes longer than 2 minutes, we shut down the worker with the following exception:

The operation * did not complete in '00:02:00'. Terminating the process to mitigate potential deadlock.

The only real way around this would be to update Durable Functions to use a newer version of the Storage SDK. This is a rather large undertaking, and we have no timeframe to when that would happen.

vantascloud commented 4 years ago

ah, this is what I'm suddenly experiencing.

devu1997 commented 4 years ago

Is it because of this I am intermittently experiencing the following error? This error is causing the app to stop in-between, hence is there any workaround to fix this until the new version of durable-functions will be released?

FailFast:
The operation 'ListLeases' with id '***' did not complete in '00:02:00'. Terminating the process to mitigate potential deadlock.
   at System.Environment.FailFast(System.String, System.Exception)
   at DurableTask.AzureStorage.TimeoutHandler+<ExecuteWithTimeout>d__1`1[[System.__Canon, System.Private.CoreLib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]].MoveNext()
   at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[System.__Canon, System.Private.CoreLib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[DurableTask.AzureStorage.TimeoutHandler+<ExecuteWithTimeout>d__1`1[[System.__Canon, System.Private.CoreLib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]], DurableTask.AzureStorage, Version=1.7.2.0, Culture=neutral, PublicKeyToken=d53979610a6e89dd]].MoveNext()
   at System.Threading.Tasks.AwaitTaskContinuation.RunOrScheduleAction(System.Runtime.CompilerServices.IAsyncStateMachineBox, Boolean)
   at System.Threading.Tasks.Task.RunContinuations(System.Object)
   at System.Threading.Tasks.Task`1[[System.Threading.Tasks.VoidTaskResult, System.Private.CoreLib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]].TrySetResult(System.Threading.Tasks.VoidTaskResult)
   at System.Threading.Tasks.Task+DelayPromise.Complete()
   at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
   at System.Threading.TimerQueueTimer.CallCallback()
   at System.Threading.TimerQueueTimer.Fire()
   at System.Threading.TimerQueue.FireNextTimers()
Abort trap: 6
cgillum commented 4 years ago

@devu1997 Yes, we added the FailFast you're seeing so that the app can auto-recover from the Storage SDK deadlock (without this, the process would hang indefinitely, or until the app is restarted).

Unfortunately there is no workaround. In our experience, this happens most often if there is a slower networking connection. This might not be your issue, but make sure your Azure Storage account is in the same region as your function app.

devu1997 commented 4 years ago

If the function app stops due to this will it restart on its own or do we have to do a manual restart?

cgillum commented 4 years ago

@devu1997 yes, it will restart on its own without you needing to do anything.

cgillum commented 3 years ago

FYI @ConnorMcMahon we may want to consider prioritizing this soon. It may also be a blocker for Azurite v3 support.

panmanphil commented 3 years ago

We are seeing a restart every 5 minutes, 24*7 now across all of our environments with the same ListLeases error shown above. This is even happening when there is no other user initiated activity. Kind of at a loss about what to do right now. Why every 5 minutes is also a mystery. On our production system we see spike between alternating instances.

What is the deadlock with, what competing processes are forcing the deadlock? In our case we have an app service plan hosting 2 orchestrated functions, each with their own activities but sharing the same code base.

cgillum commented 3 years ago

@panmanphil the fact that it’s every 5 minutes sounds like it’s different problem, perhaps triggered by some poison message in one of your queues. I suggest opening a new issue so we can investigate your case separately.

panmanphil commented 3 years ago

In a test environment where the every 5 minutes symptom is NOT happening, I can still get that error in the event log. I had previously done func purge-history in this environment. I have also tried disabling all but one of the functions but that made no difference. We are separately going to open an incident with support.

panmanphil commented 3 years ago

I have more information now. The every 5 minutes timing seems to just be a coincidence of how long it took a very high cpu consuming function running to cause the listLeases to fail, the host crash, the very same function call to restart automatically, and repeat the process. We got our zombie fixed, but the problem remains. We cannot run our function on sufficiently large sets of data without this deadlock mitigation, 'ListLeases' with id '***' did not complete in '00:02:00'. Terminating the process to mitigate potential deadlock., kicking in.

I was very surprised to see in tracing that the lease renewal is almost constant, over 6K traces like this in an hour: Starting lease renewal with token xxxxxx630-f7cf-4fff-ab61-xxxxxxxx. Could it simply be that the function starves cpu and makes this renewal fail in a call to ListLeases? We are working to reduce the cpu use of this function, but the reality is we moved this to a function so we could isolate the high cpu and memory use of this operation. This is year 2 with this function and it worked very well for us on large datasets over the last year. The failfast to prevent a deadlock may prevent us from using this at all.

sebastianburckhardt commented 3 years ago

I have been investigating the storage SDK deadlocks and found that one specific circumstance that seems to trigger them is running a tight spin loop on a thread after an await (azure/azure-sdk-for-net#16825). Perhaps the same happens for running other CPU-intensive work.

My current workaround is to insert Thread.Sleep(10) into the spin loop; that seems to prevent the deadlock for whatever reason. Perhaps putting one of those before your CPU-intensive function could help? This is just a guess of course, as of now I have no idea why the deadlock actually happens.

panmanphil commented 3 years ago

I wondered if that might help. We are processing business tax forms and there definitely are some heavy spin loops where we could try this, or even try it every 500th pass or something like that. Kind of not loving the idea of Thread.Sleep in 2020! But as a mitigation while this gets worked out, seems worth trying. I'll report back what we find.

We are also implementing a check to see if we are in a replay and to exit gracefully if we are so the host doesn't crash over and over again.

panmanphil commented 3 years ago

I tried various versions of adding thread.sleep, like once every 250 iterations for 100 ms, or once every 50 iterations for 50 ms. No joy here, still getting FailFast on either ListLeases or GetMessages. We found with a different type of file import, what our function is doing, importing and converting data files, it may fail at a different stage in the process, but the host is still crashing. The thread sleeps are of course slowing down the process too.

panmanphil commented 3 years ago

We continue to struggle with this. Among the many things we've been trying: thread.sleep, changed storage account type to LRS from GRS, removed some configuration for unused file sharing, split this function into multiple activities and many tweaks to reduce cpu consumption. We are working with support on this as well. One question I have, probably because I missing some understanding of how the durable functions are working, but with tracing on we see thousands of these kinds of messages, nearly 3k in 1/2 hour.

xxxxxxxxxxxxxxxxxxxxxxxxx-applease: Lease renewal with token 573866ef-0d75-a11f-903e-4494417f702d succeeded

these leases seem to be for the task hub, Is it really needed to constantly renew these leases? ListLeases is the most common call that times out and forces the FastFail

ConnorMcMahon commented 3 years ago

@panmanphil,

We plan on tackling this in the near future, as we know it is causing a lot of pain. That being said, this is a big undertaking, so I wouldn't expect it until early 2021.

As for the lease renewal, we actually do perform a lot of blob lease renewals. We have to do this because we utilize blob leases to ensure that each orchestration partition stays on at most one VM at a time. You can read more about our partition logic here. The rough equation for calculating the number of blob lease operations is Num_VMs * PartitionCount * 6 * 60 per hour. This seems like a lot, but these operations are fairly cheap in terms of cost + performance, and are functionally required for the product to work. We could maybe reduce them by a factor of 2 to 3 by changing the lease renewal to 20 or 30 seconds, but that has various consequences we would have to explore.

The real question is why they are appearing in your logs, as these are Information level (if not Trace level). That looks like it may be related to #1561, which I haven't had a chance to root cause yet.

panmanphil commented 3 years ago

As a follow-up for anybody running into this before the final fix is in place. We got our systems running with a combination of steps. First, we spent a lot of time tuning the algorithm, removing unnecessary work in tight loops, a lighter hashing algorithm and more to reduce cpu use. Good to do anyway, time well spent. Then, and taking back what I wrote earlier, we found that Thread.Sleep does actually help, but we needed to be more aggressive. In our case it was a full 2 second sleep every 1000 iterations . And finally we needed to go to a higher service plan with more ACPU. You can spike to 100% cpu, but not with multiple instances running simultaneously. This is the biggest bummer because normal cost effective horizontal scaling can't help you here, but at least we are running.

ConnorMcMahon commented 3 years ago

So this will be partially fixed in v2.4.1 with this PR https://github.com/Azure/durabletask/pull/500. This PR will allow for us to hit this deadlock a few times before restarting.

This will make the intermittent case far less likely, but in the cases where the deadlock is due to high CPU consumption, it may happen consistently enough for this to only delay the inevitable restart.

msebrenes commented 2 years ago

I see there´s a partial correction of the bug but the full fix hasn´t been released yet. Is there any update on this?
I think Connor is no longer working at Microsoft, so I was wondering if this case has been re-assigned or can be taken by any other engineer?

cgillum commented 2 years ago

This issue is still on the team's backlog. The primary issue that tracks the storage SDK upgrade is here: https://github.com/Azure/durabletask/issues/516.

tom-configo commented 2 years ago

As a follow-up for anybody running into this before the final fix is in place. ... In our case it was a full 2 second sleep every 1000 iterations ...

@panmanphil In your case, roughly how often was every 1000 iterations (ie, how often did you need to perform the 2 second sleep, in terms of a time interval)?

panmanphil commented 2 years ago

I've moved on to a new company and project and don't recall enough of the details. But the iterations would have been in a tight loop calculating among other things a hash that drove up cpu use so a short time interval I'd say.

kjlimxxxx commented 2 years ago

I just moved my function into a docker container, I found out when my function hit this error, the container is terminated with error 139 (SIGSEGV signal). Is it expected behavior?

lilyjma commented 9 months ago

Hi all, we recently released Durable Functions extension v3 preview. This new major version uses the latest Azure Storage SDKs! (It also has huge cost benefits for users of the Azure Storage backend.) Please give it a try if you can and let us know if you're still seeing deadlock issues.

Please note that upgrading to DF extension v3 may a breaking change for some .NET in-process users.

More details in this blog post