Open pavoldecky opened 1 year ago
Thank you for your feedback. This has been routed to the support team for assistance.
Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @xgithubtriage.
Hi
After some more digging into this issue and after adding more logs, I have more info to share.
First of all, the original problem originate when we were using the following QueueClientOptions setup
var queueClientOptions = new QueueClientOptions();
queueClientOptions.MessageEncoding = QueueMessageEncoding.None;
queueClientOptions.Retry.Mode = RetryMode.Exponential;
queueClientOptions.Retry.MaxRetries = 5;
return new QueueServiceClient(config.StorageAccount, queueClientOptions);
I changed the setup for QueueClientOptions as following
var queueClientOptions = new QueueClientOptions();
queueClientOptions.MessageEncoding = QueueMessageEncoding.None;
//queueClientOptions.Retry.Mode = RetryMode.Exponential;
queueClientOptions.Retry.MaxRetries = 0;
queueClientOptions.Retry.Delay = TimeSpan.FromMilliseconds(0);
return new QueueServiceClient(config.StorageAccount, queueClientOptions);
As you can see, in the first example, I was trying to retry any failed operation, while in the second example retry should be turned off.
I ran more tests and noticed the following behavior.
Failed to insert an item with ItemId [70be710c-1c3c-3018-eba5-caedc0b88921] to migration queue [migration-o2qtylpkmseg2kfhnjqr] NOTE: ItemId is our internal identification.
And here is the response from Azure
Operation could not be completed within the specified time.
RequestId:a68f3c93-7003-0071-2192-a0952d000000
Time:2023-06-16T20:41:27.5213174Z
Status: 500 (Operation could not be completed within the specified time.)
ErrorCode: OperationTimedOut
Content:
<?xml version="1.0" encoding="utf-8"?><Error><Code>OperationTimedOut</Code><Message>Operation could not be completed within the specified time.
RequestId:a68f3c93-7003-0071-2192-a0952d000000
Time:2023-06-16T20:41:27.5213174Z</Message></Error>
Headers:
Server: Windows-Azure-Queue/1.0 Microsoft-HTTPAPI/2.0
x-ms-request-id: a68f3c93-7003-0071-2192-a0952d000000
x-ms-version: 2018-11-09
x-ms-error-code: OperationTimedOut
Date: Fri, 16 Jun 2023 20:41:26 GMT
Connection: close
Content-Length: 245
Content-Type: application/xml
So I would assume, if the PutMessage operation failed with 500 OperationTimedOut, I would expect, that message is not created in the queue, but that is not the case here.
When I was receiving data from queue migration-o2qtylpkmseg2kfhnjqr for processing, I noticed, that we receive from queue one message, which contains our own item with id 70be710c-1c3c-3018-eba5-caedc0b88921, which according to the first point was NOT inserted into the queue
Retrieve item from queue [migration-o2qtylpkmseg2kfhnjqr]| {"ItemId":"70be710c-1c3c-3018-eba5-caedc0b88921", "MessageId":"5533d7b8-1d30-4bb1-beee-3589d051a895"}
So now I do not understand what is happening,
This behavior explains, why I had duplicate messages in the queue in my first question, the original failed request caused an automatic retry, but in the end, both original and retried operations inserted messages into the queue.
Thanks for any help
Hello/Good morning,
we can provide you with an update. We are able to simulate this issue from the simple console application. We would like to know: 1) if something is wrong with the code, 2) if duplicates are expected/possible on OperationTimeOut exception 3) is it some other issue related to SDK or Azure?
Here is the link to the sample application which you can try or take a look at: https://github.com/pgorbar/QueueDuplicatesDemo/blob/main/QueueDuplicatesDemo/Investigation.cs
In our sample run (20 queues filled in parallel each with 50k items). With every run, we end up with some duplicates (1 - 3) in some of the queues.
Thanks for any suggestions or answers.
Any update on this? We are using version 12.15.0.0, and exactly same issue found.
It's very annoying.. as we didn't spot this issue in few versions back.
@JerryBian we created a support ticket in Azure and received the following answer
I followed up internally and got the below update:
This is by design. In these error conditions, a transaction may have been committed durably, but a failure and/or timeout could occur afterward. With queues, this is a little challenging. The guidance link says that if you get a 500: "Please query the server state before retrying the operation.". But there isn't a great way to do this with queues (as far as we can tell) without peeking or popping messages. And depending on the design, a Consumer may have already popped it before the Producer can Peek to see if it is there or not. Given that this situation is unavoidable (a 500 may happen, and it may or may not have been committed on the backend), the customer will probably benefit from looking at the differences between "exactly once" and "at least once" designs and how to handle various cases. There's a stack overflow thread that may be useful: https://stackoverflow.com/questions/44204973/difference-between-exactly-once-and-at-least-once-guarantees
https://learn.microsoft.com/en-us/rest/api/storageservices/common-rest-api-error-codes
OperationTimedOut Internal Server Error (500) The operation could not be completed within the permitted time. **The operation may or may not have succeeded on the server side.** Please query the server state before retrying the operation.
we hoped that we can receive more info from this GitHub issue, but unfortunately, it is without any response. So we decided to abandon Azure storage queues for our solution and use a different approach,
@JerryBian we created a support ticket in Azure and received the following answer
I followed up internally and got the below update: This is by design. In these error conditions, a transaction may have been committed durably, but a failure and/or timeout could occur afterward. With queues, this is a little challenging. The guidance link says that if you get a 500: "Please query the server state before retrying the operation.". But there isn't a great way to do this with queues (as far as we can tell) without peeking or popping messages. And depending on the design, a Consumer may have already popped it before the Producer can Peek to see if it is there or not. Given that this situation is unavoidable (a 500 may happen, and it may or may not have been committed on the backend), the customer will probably benefit from looking at the differences between "exactly once" and "at least once" designs and how to handle various cases. There's a stack overflow thread that may be useful: https://stackoverflow.com/questions/44204973/difference-between-exactly-once-and-at-least-once-guarantees https://learn.microsoft.com/en-us/rest/api/storageservices/common-rest-api-error-codes OperationTimedOut Internal Server Error (500) The operation could not be completed within the permitted time. **The operation may or may not have succeeded on the server side.** Please query the server state before retrying the operation.
we hoped that we can receive more info from this GitHub issue, but unfortunately, it is without any response. So we decided to abandon Azure storage queues for our solution and use a different approach,
Appreciate it.
Yeah, we have same feelings.. Azure storage queue is not our best choice, we have other issues as well. Maybe Service Bus queue is an candidate.
Anyone tried to downgrade to a previous version ? We have not seen these on 12.11.0 and below, but can see these issues almost every day causing severe impact to our customers.
In comparing above, we do not get any 500 error in the logs, only 201 on PUT but see in the Azure diagnostics that two PUTs exist for one SDK call.
We've opened a case as this can be simulated using the REST API as well. As the SDK wraps the API, Azure must solve this in the API.
Closing as this is not an SDK specific issue and it's a REST API issue
Library name and version
Azure.Storage.Queues 12.14.0
Query/Question
Hello
I am facing strange behaviors in the Azure storage queue with results with a couple of messages containing duplicate content.
Application is moving items from the source system to the target system. The current architecture is as follows:
Every message in the Azure storage queue contains a single item with a unique id. I noticed that some messages in the Azure storage queue contain duplicate item content. So I have 2 messages in the queue, one message with Queue Message Id 1 and this message contains Item with id 1, and a second message with Queue Message Id 2, which contains the same item with Item id 1.
The Azure function is ingesting data to the storage queue with the following call
await QueueClient.SendMessageAsync(item.ToJson(), timeToLive: TimeSpan.FromSeconds(-1));
Windows service is reading messages from the queue with the following code
await QueueClient.ReceiveMessagesAsync(maxMessages: 32, visibilityTimeout: TimeSpan.FromHours(1));
Windows service deleting messages from the queue with this code
await this.QueueClient.DeleteMessageAsync(MessageId, PopReceipt);
During testing, I noticed this behavior:
Windows service receives in one batch 2 different queue messages with the same Item
From the function logs in App Insight, I can see the message was sent to the storage queue only once, so this call should NOT create duplicates.
I enabled diagnostic settings for storage queues and logs are sent to Log analytics.
The result of the following Kusto query
return this
As you can see, in the output I can see only one PutMessage operation for a message with id 46499ca0-e6ac-42e7-9cea-56c9a10f20da. The second message I received from the queue with id 04e02661-a977-403b-859f-f3d6da632a95 doesn't have a corresponding PutMessage operation, so I have no idea how it can be inserted into the queue. Also from the output, you can see both messages were deleted from the queue, so I assume both of them really exists in the queue. Also, the duration of the PutMessage operation took 2014 ms, while the rest of the put operations took less than 10 ms.
from which most of them are related to OperationTimedOut
When I started looking more deeply at what is happening with the problematic queue, I noticed 2 PutMessage operations with StatusText OperationTimedOut. What is very suspicious, is that for mentioned queue I have exactly 2 duplicate items.
According to the times, the PUT operation for one of the duplicate items was executed at 2023-06-14T10:02:32.4264083Z and both OperationTimedOut PutMessage operations were executed around 2023-06-14T10:02:29.5077103Z, so this is before the PutMessage operation for message 46499ca0-e6ac-42e7-9cea-56c9a10f20da were executed
So I would like some help with answers to the following questions.
To have better insight, I was able to insert 966015 messages to multiple queues and I have only 6 messages with content duplicates, so this can probably happen only during some bigger load to the queues.
According to this Kusto query, I can see I have plenty of queue operations with the longest latency around 3 minutes before the incident.
Any help with this issue is really appreciated, thank you very much.
Environment
Windows services using .NET6