Azure / azure-storage-net

Microsoft Azure Storage Libraries for .NET
Apache License 2.0
446 stars 373 forks source link

Deadlock in GetMessage* APIs from Azure Storage Queue #890

Closed ericsuhong closed 4 years ago

ericsuhong commented 5 years ago

Which service(blob, file, queue, table) does this issue concern?

queue

Which version of the SDK was used?

Microsoft.Azure.Storage.Queue 10.0.3, but issue was still there in 9.4.2

Which platform are you using? (ex: .NET Core 2.1)

.NET Core 2.2

What problem was encountered?

We have a code where multiple role instances read simultaneously from the same queue.

Each role instance create a single queue instance and starts to read from queue in a single thread, similar to the following code:

CloudQueue queue = new CloudQueue(...);
while (true) {
    Console.WriteLine("Before get message");
    var message = queue.GetMessage(...);
    Console.WriteLine("After get message");

    // process the message

    queue.DeleteMessage(message, ...);
}

How can we reproduce the problem in the simplest way?

Try to run above code with multiple role instances (~100 instances). Some role instances will suddenly go into deadlock, stuck after saying "Before get message", but never making a progress.

Same issue happens in all GetMessage* API flavors (GetMessages, GetMessageAsync, GetMessagesAsync).

Have you found a mitigation/solution?

We found that the deadlock does not occur in WindowsAzure.Storage nuget, so this is a regression in the code with moving to Microsoft.Azure.Storage.Queue.

wadekb commented 5 years ago

Not sure if this is related, but I'm getting a similar situation with only 1 queue reader -

Console.WriteLine("Before getting messages");
var messages = await _queue.GetMessagesAsync(32, TimeSpan.FromMinutes(5), null, null);
Console.WriteLine("After getting messages");

My code will sit on the await call, then after around 15 minutes will get -

Microsoft.Azure.Storage.StorageException: Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.
...
ErrorCode:AuthenticationFailed

After this exception happens, all is OK and the reader starts processing the messages again.

Also, if I kill the service when it is stopped on the await call and start it back up (before 15 minutes has passed) it processes normally.

This happens randomly, it could work fine for a day and happen, then happen 1 hour later, then be fine for a day.

wadekb commented 5 years ago

Further to the above, I've changed my code to -

var messages = await _pingsQueue.GetMessagesAsync(
  32, 
  TimeSpan.FromMinutes(5), 
  new QueueRequestOptions() {
    RetryPolicy = new Microsoft.Azure.Storage.RetryPolicies.NoRetry(),
    MaximumExecutionTime = TimeSpan.FromSeconds(60)
  }, 
  null);

I don't get the authentication error anymore, I get -

Microsoft.Azure.Storage.StorageException: The client could not finish the operation within specified timeout.

Generally the call to GetMessagesAsync takes around 100ms, so not sure what is causing it to execute for 60 seconds before timeout.

cgillum commented 4 years ago

We are seeing similar issues in GetMessage* APIs, and also ListBlobs APIs. We are also using .NET Core 2.2, but we are using WindowsAzure.Storage v9.3.3.

Memory dump callstacks are included in the linked issue: https://github.com/Azure/azure-functions-durable-extension/issues/961

ManfredLange commented 4 years ago

Problem

CloudQueue.GetMessage() works at times then blocks and doesn't return. Single threaded program.

Runtime Environment

Update 14 Oct 2019

To diagnose further we created a simple console program that does the following:

Here are our findings:

Recommendation: If you encounter blocking calls with azure storage such as queues or blobs - e.g. GetMessage(), AddMessage(), DeleteMessage(), etc. - then check if you are running the old Azure Storage Emulator. If that is the case, you can try using the new open-source Azurite emulator.

References:

Related keywords (to make it easier for search engines): hangs - blocking call - deadlock

brobichaud commented 4 years ago

We are also seeing Azure Storage Queue calls hang using Microsoft.Azure.Storage.Queue v11.1.2. We're seeing it in an 8 year old Worker service that had been rock solid, until we migrated to this nuget from the older WindowsAzure.Storage v8.6 nuget. Now we see periodic hangs on simple methods like GetMessages() or DeleteMessage().

We only see this in our production systems that are processing high volumes of messages (approx 1 million per day), but the hang seems random. It can process a couple million messages before hanging, then I restart my container and it hangs after just tens of thousands, then back to running for a million.

This seems like a pretty fundamental bug that ought to be a very high priority, yet this dates back a full 8 months with no resolution. What gives?

brobichaud commented 4 years ago

After some experimentation we have found that the problem goes away when using the Async version of queue methods that accept a CancellationToken. We had been using the synchronous calls for years and in this nuget those sync calls we causing issues, while the async (only the calls that accept the cancellation tokens) do not exhibit the problem.

I still feel like this is a hugely destabilizing bug that ought to be a high priority. Anyone walking up to Azure and using this nuget will likely walk away from Azure thinking the storage system is unreliable.

kasobol-msft commented 4 years ago

This has been addressed in 11.1.4 .