Azure / azure-sdk-for-net

This repository is for active development of the Azure SDK for .NET. For consumers of the SDK we recommend visiting our public developer docs at https://learn.microsoft.com/dotnet/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-net.
MIT License
5.56k stars 4.82k forks source link

Multiple messages in Azure Storage Queue which contains duplicate message content, cannot find out who inserted duplicate message #37014

Open pavoldecky opened 1 year ago

pavoldecky commented 1 year ago

Library name and version

Azure.Storage.Queues 12.14.0

Query/Question

Hello

I am facing strange behaviors in the Azure storage queue with results with a couple of messages containing duplicate content.

Application is moving items from the source system to the target system. The current architecture is as follows:

  1. Windows service in VM is retrieving data from the source system and sends them to the service bus in batches. Using .NET 6
  2. Azure function is processing a batch of items from the Service bus, each item is stored in Azure Table Storage and each item is separately sent to Azure Storage Queue.
  3. Another Windows service in VM is reading data from the queue and ingressing items to the source system. When ingress is done for a specific item, the corresponding queue message is deleted from the queue

Every message in the Azure storage queue contains a single item with a unique id. I noticed that some messages in the Azure storage queue contain duplicate item content. So I have 2 messages in the queue, one message with Queue Message Id 1 and this message contains Item with id 1, and a second message with Queue Message Id 2, which contains the same item with Item id 1.

The Azure function is ingesting data to the storage queue with the following call await QueueClient.SendMessageAsync(item.ToJson(), timeToLive: TimeSpan.FromSeconds(-1));

Windows service is reading messages from the queue with the following code await QueueClient.ReceiveMessagesAsync(maxMessages: 32, visibilityTimeout: TimeSpan.FromHours(1));

Windows service deleting messages from the queue with this code await this.QueueClient.DeleteMessageAsync(MessageId, PopReceipt);

During testing, I noticed this behavior:

  1. Windows service receives in one batch 2 different queue messages with the same Item

    Retrieve item from queue [migration-py6xs2lersgbjwluevzm] {"ItemId":"50613a9a-e081-f57c-dbd6-f854d947ffe1", "QueueMessageId":"04e02661-a977-403b-859f-f3d6da632a95"}
    Retrieve item from queue [migration-py6xs2lersgbjwluevzm] {"ItemId":"50613a9a-e081-f57c-dbd6-f854d947ffe1", "QueueMessageId":"46499ca0-e6ac-42e7-9cea-56c9a10f20da"}
  2. From the function logs in App Insight, I can see the message was sent to the storage queue only once, so this call should NOT create duplicates.

  3. I enabled diagnostic settings for storage queues and logs are sent to Log analytics.

  4. The result of the following Kusto query

StorageQueueLogs
| where  ObjectKey  has '04e02661-a977-403b-859f-f3d6da632a95' or ObjectKey has '46499ca0-e6ac-42e7-9cea-56c9a10f20da'
| project  TimeGenerated, OperationName, StatusCode, DurationMs, ServerLatencyMs, ObjectKey

return this

image

As you can see, in the output I can see only one PutMessage operation for a message with id 46499ca0-e6ac-42e7-9cea-56c9a10f20da. The second message I received from the queue with id 04e02661-a977-403b-859f-f3d6da632a95 doesn't have a corresponding PutMessage operation, so I have no idea how it can be inserted into the queue. Also from the output, you can see both messages were deleted from the queue, so I assume both of them really exists in the queue. Also, the duration of the PutMessage operation took 2014 ms, while the rest of the put operations took less than 10 ms.

  1. During more digging in Log analytics I noticed a couple of failed operations in the queue
StorageQueueLogs
| where TimeGenerated > ago(12h) and StatusText !contains "Success"
| summarize count() by OperationName
| top 10 by count_ desc

image

from which most of them are related to OperationTimedOut

StorageQueueLogs
| where TimeGenerated > ago(12h) and StatusText !contains "Success"
| summarize count() by StatusText
| top 10 by count_ desc

image

When I started looking more deeply at what is happening with the problematic queue, I noticed 2 PutMessage operations with StatusText OperationTimedOut. What is very suspicious, is that for mentioned queue I have exactly 2 duplicate items.

StorageQueueLogs
| where Uri has 'migration-py6xs2lersgbjwluevzm' 
| where StatusCode == 500
| project  TimeGenerated, OperationName, StatusCode,StatusText, DurationMs, ServerLatencyMs, ObjectKey
| order by TimeGenerated asc 

image

According to the times, the PUT operation for one of the duplicate items was executed at 2023-06-14T10:02:32.4264083Z and both OperationTimedOut PutMessage operations were executed around 2023-06-14T10:02:29.5077103Z, so this is before the PutMessage operation for message 46499ca0-e6ac-42e7-9cea-56c9a10f20da were executed

So I would like some help with answers to the following questions.

  1. It is possible, that call for insert message to queue from Azure Function ended up in timeout at 2023-06-14T10:02:29.5077103Z, the retry logic in Azure.Storage.Queues SDK try the insert operation again and the retry was successful at 2023-06-14T10:02:32.4264083Z and the message was inserted with message-id 46499ca0-e6ac-42e7-9cea-56c9a10f20da with corresponding PUT logs, but it is possible, that the first failed insert operation was still running in the background and was finished successfully eventually with the creation of a second message containing duplicate content with message-id 04e02661-a977-403b-859f-f3d6da632a95, but without corresponding PutMessage operation log in log analytics?
  2. If the previous assumption is not correct do you have any idea what can cause messages with duplicate contents in Azure Storage queue and how it is possible to insert a message to the Storage queue without the corresponding PutMessage log in Log analytics?
  3. Or do you know some other things I can check to better understand what is happening inside the queue and how can I find what in the world inserted message with id 04e02661-a977-403b-859f-f3d6da632a95

To have better insight, I was able to insert 966015 messages to multiple queues and I have only 6 messages with content duplicates, so this can probably happen only during some bigger load to the queues.

According to this Kusto query, I can see I have plenty of queue operations with the longest latency around 3 minutes before the incident.

let window = 5m;
let eventTime = todatetime('2023-06-14T10:02:32.4264083Z');
StorageQueueLogs
| where TimeGenerated between ((eventTime - window) .. (eventTime + window)) 
| top 30 by DurationMs desc
| project TimeGenerated, OperationName, DurationMs, ServerLatencyMs, ClientLatencyMs = DurationMs - ServerLatencyMs

image

Any help with this issue is really appreciated, thank you very much.

Environment

Windows services using .NET6

github-actions[bot] commented 1 year ago

Thank you for your feedback. This has been routed to the support team for assistance.

github-actions[bot] commented 1 year ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @xgithubtriage.

pavoldecky commented 1 year ago

Hi

After some more digging into this issue and after adding more logs, I have more info to share.

First of all, the original problem originate when we were using the following QueueClientOptions setup

var queueClientOptions = new QueueClientOptions();
queueClientOptions.MessageEncoding = QueueMessageEncoding.None;
queueClientOptions.Retry.Mode = RetryMode.Exponential;
queueClientOptions.Retry.MaxRetries = 5;

return new QueueServiceClient(config.StorageAccount, queueClientOptions);

I changed the setup for QueueClientOptions as following

var queueClientOptions = new QueueClientOptions();
queueClientOptions.MessageEncoding = QueueMessageEncoding.None;
//queueClientOptions.Retry.Mode = RetryMode.Exponential;
queueClientOptions.Retry.MaxRetries = 0;
queueClientOptions.Retry.Delay = TimeSpan.FromMilliseconds(0);

return new QueueServiceClient(config.StorageAccount, queueClientOptions);

As you can see, in the first example, I was trying to retry any failed operation, while in the second example retry should be turned off.

I ran more tests and noticed the following behavior.

  1. For one call to insert an operation into Storage Queue I received the following exception

Failed to insert an item with ItemId [70be710c-1c3c-3018-eba5-caedc0b88921] to migration queue [migration-o2qtylpkmseg2kfhnjqr] NOTE: ItemId is our internal identification.

And here is the response from Azure

Operation could not be completed within the specified time.
RequestId:a68f3c93-7003-0071-2192-a0952d000000
Time:2023-06-16T20:41:27.5213174Z
Status: 500 (Operation could not be completed within the specified time.)
ErrorCode: OperationTimedOut

Content:
<?xml version="1.0" encoding="utf-8"?><Error><Code>OperationTimedOut</Code><Message>Operation could not be completed within the specified time.
RequestId:a68f3c93-7003-0071-2192-a0952d000000
Time:2023-06-16T20:41:27.5213174Z</Message></Error>

Headers:
Server: Windows-Azure-Queue/1.0 Microsoft-HTTPAPI/2.0
x-ms-request-id: a68f3c93-7003-0071-2192-a0952d000000
x-ms-version: 2018-11-09
x-ms-error-code: OperationTimedOut
Date: Fri, 16 Jun 2023 20:41:26 GMT
Connection: close
Content-Length: 245
Content-Type: application/xml
  1. I can see also a log in StorageQueueLogs for mentioned correlation

image

  1. So I would assume, if the PutMessage operation failed with 500 OperationTimedOut, I would expect, that message is not created in the queue, but that is not the case here.

  2. When I was receiving data from queue migration-o2qtylpkmseg2kfhnjqr for processing, I noticed, that we receive from queue one message, which contains our own item with id 70be710c-1c3c-3018-eba5-caedc0b88921, which according to the first point was NOT inserted into the queue

Retrieve item from queue [migration-o2qtylpkmseg2kfhnjqr]| {"ItemId":"70be710c-1c3c-3018-eba5-caedc0b88921", "MessageId":"5533d7b8-1d30-4bb1-beee-3589d051a895"}

  1. When I check StorageQueueLogs for mentioned Queue MessageId, again, I can not find any PutMessage operation, only DeleteMessage, which we call when the processing of a message from the queue is done. So as I cannot find any PutMessage operation for queue message id 5533d7b8-1d30-4bb1-beee-3589d051a895, I believe the message was created in Queue in a call with correlation id a68f3c93-7003-0071-2192-a0952d000000 nevertheless, this call receive 500 error

image

So now I do not understand what is happening,

  1. It is normal behavior, that if we received 500 OperationTimedOut from the queue, that message is created in the queue anyways?
  2. Why the OperationTimedOut is received after 3 Ms?

This behavior explains, why I had duplicate messages in the queue in my first question, the original failed request caused an automatic retry, but in the end, both original and retried operations inserted messages into the queue.

Thanks for any help

pgorbar commented 1 year ago

Update/Sample app:

Hello/Good morning,

we can provide you with an update. We are able to simulate this issue from the simple console application. We would like to know: 1) if something is wrong with the code, 2) if duplicates are expected/possible on OperationTimeOut exception 3) is it some other issue related to SDK or Azure?

Here is the link to the sample application which you can try or take a look at: https://github.com/pgorbar/QueueDuplicatesDemo/blob/main/QueueDuplicatesDemo/Investigation.cs

In our sample run (20 queues filled in parallel each with 50k items). With every run, we end up with some duplicates (1 - 3) in some of the queues.

Thanks for any suggestions or answers.

JerryBian commented 1 year ago

Any update on this? We are using version 12.15.0.0, and exactly same issue found.

It's very annoying.. as we didn't spot this issue in few versions back.

pavoldecky commented 1 year ago

@JerryBian we created a support ticket in Azure and received the following answer

I followed up internally and got the below update:

This is by design.  In these error conditions, a transaction may have been committed durably, but a failure and/or timeout could occur afterward.  With queues, this is a little challenging.  The guidance link says that if you get a 500: "Please query the server state before retrying the operation.".  But there isn't a great way to do this with queues (as far as we can tell) without peeking or popping messages.  And depending on the design, a Consumer may have already popped it before the Producer can Peek to see if it is there or not.  Given that this situation is unavoidable (a 500 may happen, and it may or may not have been committed on the backend), the customer will probably benefit from looking at the differences between "exactly once" and "at least once" designs and how to handle various cases.  There's a stack overflow thread that may be useful:  https://stackoverflow.com/questions/44204973/difference-between-exactly-once-and-at-least-once-guarantees

https://learn.microsoft.com/en-us/rest/api/storageservices/common-rest-api-error-codes

OperationTimedOut   Internal Server Error (500) The operation could not be completed within the permitted time. **The operation may or may not have succeeded on the server side.** Please query the server state before retrying the operation.

we hoped that we can receive more info from this GitHub issue, but unfortunately, it is without any response. So we decided to abandon Azure storage queues for our solution and use a different approach,

JerryBian commented 1 year ago

@JerryBian we created a support ticket in Azure and received the following answer

I followed up internally and got the below update:

This is by design.  In these error conditions, a transaction may have been committed durably, but a failure and/or timeout could occur afterward.  With queues, this is a little challenging.  The guidance link says that if you get a 500: "Please query the server state before retrying the operation.".  But there isn't a great way to do this with queues (as far as we can tell) without peeking or popping messages.  And depending on the design, a Consumer may have already popped it before the Producer can Peek to see if it is there or not.  Given that this situation is unavoidable (a 500 may happen, and it may or may not have been committed on the backend), the customer will probably benefit from looking at the differences between "exactly once" and "at least once" designs and how to handle various cases.  There's a stack overflow thread that may be useful:  https://stackoverflow.com/questions/44204973/difference-between-exactly-once-and-at-least-once-guarantees

https://learn.microsoft.com/en-us/rest/api/storageservices/common-rest-api-error-codes

OperationTimedOut Internal Server Error (500) The operation could not be completed within the permitted time. **The operation may or may not have succeeded on the server side.** Please query the server state before retrying the operation.

we hoped that we can receive more info from this GitHub issue, but unfortunately, it is without any response. So we decided to abandon Azure storage queues for our solution and use a different approach,

Appreciate it.

Yeah, we have same feelings.. Azure storage queue is not our best choice, we have other issues as well. Maybe Service Bus queue is an candidate.

cloudfy commented 1 year ago

Anyone tried to downgrade to a previous version ? We have not seen these on 12.11.0 and below, but can see these issues almost every day causing severe impact to our customers.

In comparing above, we do not get any 500 error in the logs, only 201 on PUT but see in the Azure diagnostics that two PUTs exist for one SDK call.

cloudfy commented 1 year ago

We've opened a case as this can be simulated using the REST API as well. As the SDK wraps the API, Azure must solve this in the API.

amnguye commented 3 months ago

Closing as this is not an SDK specific issue and it's a REST API issue