Closed danielmarbach closed 6 years ago
Btw. the repro sample uses the HTTP URI so that wireshark can be attached. I have a full wireshark capture of a test but since it contains my unencrypted credentials I won't attach it here and it can be easily recaptured with the repro if required.
@danielmarbach , thanks for the detailed report and repro. We'll investigate this and get back to you.
@gokarnm let me know if you need help or additional information
Some more information.
I created two more branches. One is https://github.com/danielmarbach/SqsLinuxRepro/tree/client-per-receive which creates a client per concurrent receive loop. The issue still occurs
Another one is https://github.com/danielmarbach/SqsLinuxRepro/tree/multiple-starts-bash where I start the same program with just one receive loop multiple times with a bash script. The issue still occurs.
I was able to narrow it down
if you enable the following in the repro
AWSConfigs.LoggingConfig.LogTo = LoggingOptions.Console;
you can observe
AmazonSQSClient 45|2017-12-19T16:54:41.797Z|ERROR|TaskCanceledException making request ReceiveMessageRequest to http://sqs.eu-central-1.amazonaws.com/. Attempt 1. --> System.Threading.Tasks.TaskCanceledException: A task was canceled.
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.ConfiguredTaskAwaitable`1.ConfiguredTaskAwaiter.GetResult()
at System.Net.Http.HttpClient.<FinishSendAsyncUnbuffered>d__59.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Amazon.Runtime.HttpWebRequestMessage.<GetResponseAsync>d__20.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Amazon.Runtime.Internal.HttpHandler`1.<InvokeAsync>d__9`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Amazon.Runtime.Internal.Unmarshaller.<InvokeAsync>d__3`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Amazon.SQS.Internal.ValidationResponseHandler.<InvokeAsync>d__1`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Amazon.Runtime.Internal.ErrorHandler.<InvokeAsync>d__5`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Amazon.Runtime.Internal.CallbackHandler.<InvokeAsync>d__9`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Amazon.Runtime.Internal.CredentialsRetriever.<InvokeAsync>d__7`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Amazon.Runtime.Internal.RetryHandler.<InvokeAsync>d__10`1.MoveNext()
which seems to be related to
https://github.com/dotnet/corefx/issues/25800
In the cases the exception happens we are never able to receive anything on the client side.
I verified again with wireshark and the service always correctly returns the ReceiveMessageResult
Any updates on this? It is a major .NET Core blocker for us. I can try to update latest .NET Core and report back here but I would love to see some interaction happening on this issue
Sorry for the delayed response. We weren't able to reproduce this on Mac or Windows. We are pretty confident that this is a .NET Core framework issue--probably with the linux httpclient it uses.
We've discussed a few workaround that we can provide in the SDK until a new HttpClient is released in .NET Core 2.1. Once we have a workaround solution that is reasonable, we'll ping back on this thread.
@sstevenkang I'm not confident anymore that it is only related to .NET Core. I was able to repro the same problem on Windows consistently on multiple machines with FIFO queues. I updated the repro solution https://github.com/danielmarbach/SqsLinuxRepro/tree/fifo-queues // @seanfeldman
Running code from fifo-queues
branch on Windows
2018-03-05 22:45:35 (0) - Receiving
2018-03-05 22:45:35 (1) - Receiving
2018-03-05 22:45:35 (0) - Received 2 / OK / 1290
2018-03-05 22:45:35 (0) - deleting 2018-03-05 22:23:56 attempt 4 / 1
2018-03-05 22:45:35 (0) - deleting 2018-03-05 22:23:45 attempt 3 / 1
2018-03-05 22:45:35 (0) - deleted 2018-03-05 22:23:56 attempt 4 / 1
2018-03-05 22:45:36 (0) - deleted 2018-03-05 22:23:45 attempt 3 / 1
2018-03-05 22:45:36 (0) - Receiving
2018-03-05 22:45:37 (Main) - Sending attempt 1 1
2018-03-05 22:45:37 (1) - Received 1 / OK / 776
2018-03-05 22:45:37 (1) - deleting 2018-03-05 22:45:37 attempt 1 / 1
2018-03-05 22:45:37 (Main) - Sent attempt 1 1
2018-03-05 22:45:38 (1) - deleted 2018-03-05 22:45:37 attempt 1 / 1
2018-03-05 22:45:38 (1) - Receiving
2018-03-05 22:45:46 (0) - cancelled
2018-03-05 22:45:46 (1) - cancelled
2018-03-05 22:45:47 (0) - Receiving
2018-03-05 22:45:47 (1) - Receiving
2018-03-05 22:45:49 (Main) - Sending attempt 2 1
2018-03-05 22:45:49 (Main) - Sent attempt 2 1
2018-03-05 22:45:58 (0) - cancelled
2018-03-05 22:45:58 (1) - cancelled
2018-03-05 22:45:59 (0) - Receiving
2018-03-05 22:45:59 (1) - Receiving
2018-03-05 22:46:01 (Main) - Sending attempt 3 1
2018-03-05 22:46:01 (Main) - Sent attempt 3 1
2018-03-05 22:46:10 (1) - cancelled
2018-03-05 22:46:10 (0) - cancelled
2018-03-05 22:46:10 (0) - Receiving
2018-03-05 22:46:10 (1) - Receiving
2018-03-05 22:46:12 (Main) - Sending attempt 4 1
2018-03-05 22:46:13 (Main) - Sent attempt 4 1
2018-03-05 22:46:19 (0) - Received 1 / OK / 776
2018-03-05 22:46:19 (0) - deleting 2018-03-05 22:45:49 attempt 2 / 1
2018-03-05 22:46:20 (0) - deleted 2018-03-05 22:45:49 attempt 2 / 1
2018-03-05 22:46:20 (0) - Receiving
2018-03-05 22:46:22 (1) - cancelled
2018-03-05 22:46:22 (0) - cancelled
The outcome is that the messages are not picked up, see
3/5/2018 11:51:34 PM (0) - Receiving
3/5/2018 11:51:34 PM (1) - Receiving
3/5/2018 11:51:34 PM (0) - Received 3 / OK / 1747
3/5/2018 11:51:34 PM (0) - deleting 3/5/2018 11:43:58 PM attempt 4 / 1
3/5/2018 11:51:34 PM (0) - deleting 3/5/2018 11:42:41 PM attempt 3 / 1
3/5/2018 11:51:34 PM (0) - deleting 3/5/2018 11:43:46 PM attempt 3 / 1
3/5/2018 11:51:34 PM (0) - deleted 3/5/2018 11:43:58 PM attempt 4 / 1
3/5/2018 11:51:35 PM (0) - deleted 3/5/2018 11:42:41 PM attempt 3 / 1
3/5/2018 11:51:35 PM (0) - deleted 3/5/2018 11:43:46 PM attempt 3 / 1
3/5/2018 11:51:35 PM (0) - Receiving
3/5/2018 11:51:36 PM (Main) - Sending attempt 1 1
3/5/2018 11:51:36 PM (0) - Received 1 / OK / 757
3/5/2018 11:51:36 PM (0) - deleting 3/5/2018 11:51:36 PM attempt 1 / 1
3/5/2018 11:51:36 PM (Main) - Sent attempt 1 1
3/5/2018 11:51:36 PM (0) - deleted 3/5/2018 11:51:36 PM attempt 1 / 1
3/5/2018 11:51:36 PM (0) - Receiving
3/5/2018 11:51:44 PM (1) - cancelled
3/5/2018 11:51:44 PM (0) - cancelled
3/5/2018 11:51:46 PM (0) - Receiving
3/5/2018 11:51:46 PM (1) - Receiving
3/5/2018 11:51:48 PM (Main) - Sending attempt 2 1
3/5/2018 11:51:49 PM (Main) - Sent attempt 2 1
3/5/2018 11:51:56 PM (1) - cancelled
3/5/2018 11:51:56 PM (0) - cancelled
3/5/2018 11:51:59 PM (0) - Receiving
3/5/2018 11:51:59 PM (1) - Receiving
3/5/2018 11:52:01 PM (Main) - Sending attempt 3 1
3/5/2018 11:52:02 PM (Main) - Sent attempt 3 1
3/5/2018 11:52:09 PM (1) - cancelled
3/5/2018 11:52:09 PM (0) - cancelled
3/5/2018 11:52:10 PM (0) - Receiving
3/5/2018 11:52:10 PM (1) - Receiving
3/5/2018 11:52:12 PM (Main) - Sending attempt 4 1
3/5/2018 11:52:12 PM (Main) - Sent attempt 4 1
3/5/2018 11:52:20 PM (1) - cancelled
3/5/2018 11:52:20 PM (0) - cancelled
It also doesn't appear to be limited to .NET Core. If the target framework of the repro project is changed to net452
, the same problem is demonstrated.
The effect is that the message stays in flight for 30 seconds which means they are actually processed but never handed over to the upper layers of the code.
Reproed it again on freshly setup cloud Linux machine with Ubuntu 17.10 and dotnet core version 2.1.4. And like @bording said it is also possible to repro this on full framework.
Guys seriously, this is a major blocker for us. Can we get some turn around here?
For the record NET452 target
@sstevenkang any updates?
We don't have a definitive solution yet but here's an update on our running theory:
When a message is received but not deleted you have to wait for the message visibility timeout to elapse before receiving it again. That's the well known pattern.
That's what's happening here - but with a twist. It looks like there's a race condition when the request gets canceled. When the CancellationSource gets triggered control returns to the calling code but the socket isn't closed yet.
So even though a message is sent after the previous ReceiveMessage request has been canceled the socket from the previous receive is still open. SQS replies over the old socket with the new message, considers it received, and subjects the message to the visibility timeout.
We're still looking into how the CancellationToken works within the HTTP client, and possible solutions to that.
One possible workaround is to call CancellationSource.CancelAfter with a value greater than the visibility timeout (which is 30 seconds by default). Another (less desirable) is to put a Thread.Sleep after any OperationCanceledExceptions to allow the socket time to be closed.
@danielmarbach Will either of those work for you? Can you tell us more about your use case and how this is causing issues?
@vellozzi thanks for coming back to us and investigating more.
We'll investigate if we can incorporate your proposal with the CancellationTokenSource. What I'm not yet understanding is why that would change anything. The token source we currently pass to the SDK is only canceled from our perspective when we shut down the application and that is not what we are doing when we see that happening in our environments. The Thread.Sleep option is not something I would like to add to our production code TBH.
Can you tell us more about your use case and how this is causing issues?
We detected this in our acceptance testing environment. We are developing an SQS adapter for NServiceBus. We recently added support for .NET Core. Based on that we run a set of acceptance tests on .NET Core on Windows, Linux and also the full framework on Windows. The default test timeout was 20 seconds and we suddenly started seing those tests failing. When we investigated we found out that in certain circumstances the messages that we sent remain not visible for the visibility timeout and thus the tests fail. We got worried because we figured that could also happen to our customers. We got also worried because we figured it might also indicate a potential message loss (which we haven't confirmed yet). For now, we decided to lift the restriction to not run on Linux because we think in the worst case messages will get delayed by 30 seconds (which is something we'll document on our docs site). But of course we'd love to see the SDK just work ;)
Are you saying that the CancellationSource is only cancelled on shutdown and your problem isn't occurring at shutdown? If that's the case then I may not be seeing the same problem you are. The issue I'm seeing only happens when the CancellationSource is cancelled.
Can you change the sample code to take the CancellationSource out of the equation and commit that to your GitHub repo? I would do it but I want to make sure the sample accurately represents your production code.
I agree that there is no message loss and that the worst case is that you have to wait for the message visibility period to expire.
I extended the program a bit to include the received timespan by doing simple math between ApproximateFirstReceiveTimestamp
and SentTimestamp
. A run with the existing settings showed again the delay
daniel@DESKTOP-SUIHDT7:/mnt/c/p/SqsLinuxRepro/SqsRepro$ dotnet run SqsRepro.csproj -c Release
3/27/18 7:09:27 AM (0) - Receiving
3/27/18 7:09:28 AM (1) - Receiving
3/27/18 7:09:30 AM (Main) - Sending attempt 1 1
3/27/18 7:09:30 AM (Main) - Sent attempt 1 1
3/27/18 7:09:30 AM (1) - Received 1 / OK / 1131
Elapsed '00:00:00.0030000' (1) - to receive message with body '3/27/18 7:09:30 AM attempt 1 / 1'
3/27/18 7:09:30 AM (1) - deleting 3/27/18 7:09:30 AM attempt 1 / 1
3/27/18 7:09:30 AM (1) - deleted 3/27/18 7:09:30 AM attempt 1 / 1
3/27/18 7:09:30 AM (1) - Receiving
3/27/18 7:09:37 AM (1) - cancelled
3/27/18 7:09:37 AM (0) - cancelled
3/27/18 7:09:38 AM (0) - Receiving
3/27/18 7:09:38 AM (1) - Receiving
3/27/18 7:09:40 AM (Main) - Sending attempt 2 1
3/27/18 7:09:40 AM (Main) - Sent attempt 2 1
3/27/18 7:09:48 AM (0) - cancelled
3/27/18 7:09:48 AM (1) - cancelled
3/27/18 7:09:48 AM (0) - Receiving
3/27/18 7:09:48 AM (1) - Receiving
3/27/18 7:09:50 AM (Main) - Sending attempt 3 1
3/27/18 7:09:50 AM (Main) - Sent attempt 3 1
3/27/18 7:09:50 AM (1) - Received 1 / OK / 1131
Elapsed '00:00:00.0010000' (1) - to receive message with body '3/27/18 7:09:50 AM attempt 3 / 1'
3/27/18 7:09:50 AM (1) - deleting 3/27/18 7:09:50 AM attempt 3 / 1
3/27/18 7:09:50 AM (1) - deleted 3/27/18 7:09:50 AM attempt 3 / 1
3/27/18 7:09:50 AM (1) - Receiving
3/27/18 7:09:58 AM (0) - cancelled
3/27/18 7:09:58 AM (1) - cancelled
3/27/18 7:09:58 AM (0) - Receiving
3/27/18 7:09:58 AM (1) - Receiving
3/27/18 7:10:00 AM (Main) - Sending attempt 4 1
3/27/18 7:10:00 AM (Main) - Sent attempt 4 1
3/27/18 7:10:08 AM (0) - cancelled
3/27/18 7:10:08 AM (1) - cancelled
daniel@DESKTOP-SUIHDT7:/mnt/c/p/SqsLinuxRepro/SqsRepro$ dotnet run SqsRepro.csproj -c Release
3/27/18 7:11:20 AM (0) - Receiving
3/27/18 7:11:21 AM (1) - Receiving
3/27/18 7:11:21 AM (1) - Received 1 / OK / 1131
3/27/18 7:11:21 AM (0) - Received 1 / OK / 1131
Elapsed '00:01:28.4026165' (1) - to receive message with body '3/27/18 7:10:00 AM attempt 4 / 1'
3/27/18 7:11:21 AM (1) - deleting 3/27/18 7:10:00 AM attempt 4 / 1
Elapsed '00:01:48.6386170' (0) - to receive message with body '3/27/18 7:09:40 AM attempt 2 / 1'
3/27/18 7:11:21 AM (0) - deleting 3/27/18 7:09:40 AM attempt 2 / 1
3/27/18 7:11:21 AM (0) - deleted 3/27/18 7:09:40 AM attempt 2 / 1
3/27/18 7:11:21 AM (0) - Receiving
3/27/18 7:11:21 AM (1) - deleted 3/27/18 7:10:00 AM attempt 4 / 1
3/27/18 7:11:21 AM (1) - Receiving
3/27/18 7:11:23 AM (Main) - Sending attempt 1 1
3/27/18 7:11:23 AM (Main) - Sent attempt 1 1
3/27/18 7:11:23 AM (1) - Received 1 / OK / 1131
Elapsed '00:00:00.0030000' (1) - to receive message with body '3/27/18 7:11:23 AM attempt 1 / 1'
3/27/18 7:11:23 AM (1) - deleting 3/27/18 7:11:23 AM attempt 1 / 1
3/27/18 7:11:23 AM (1) - deleted 3/27/18 7:11:23 AM attempt 1 / 1
3/27/18 7:11:23 AM (1) - Receiving
3/27/18 7:11:30 AM (0) - cancelled
3/27/18 7:11:30 AM (1) - cancelled
3/27/18 7:11:30 AM (0) - Receiving
3/27/18 7:11:30 AM (1) - Receiving
3/27/18 7:11:32 AM (Main) - Sending attempt 2 1
3/27/18 7:11:32 AM (Main) - Sent attempt 2 1
3/27/18 7:11:40 AM (1) - cancelled
3/27/18 7:11:40 AM (0) - cancelled
3/27/18 7:11:40 AM (0) - Receiving
3/27/18 7:11:40 AM (1) - Receiving
3/27/18 7:11:42 AM (Main) - Sending attempt 3 1
3/27/18 7:11:43 AM (Main) - Sent attempt 3 1
3/27/18 7:11:43 AM (0) - Received 1 / OK / 1131
Elapsed '00:00:00.0020000' (0) - to receive message with body '3/27/18 7:11:42 AM attempt 3 / 1'
3/27/18 7:11:43 AM (0) - deleting 3/27/18 7:11:42 AM attempt 3 / 1
3/27/18 7:11:43 AM (0) - deleted 3/27/18 7:11:42 AM attempt 3 / 1
3/27/18 7:11:43 AM (0) - Receiving
3/27/18 7:11:50 AM (0) - cancelled
3/27/18 7:11:50 AM (1) - cancelled
3/27/18 7:11:51 AM (0) - Receiving
3/27/18 7:11:51 AM (1) - Receiving
3/27/18 7:11:53 AM (Main) - Sending attempt 4 1
3/27/18 7:11:53 AM (Main) - Sent attempt 4 1
3/27/18 7:12:01 AM (0) - cancelled
3/27/18 7:12:01 AM (1) - cancelled
Then I increased the timeout of the cancellation token source to 40 seconds which is more than the visibility timeout and still I get
daniel@DESKTOP-SUIHDT7:/mnt/c/p/SqsLinuxRepro/SqsRepro$ dotnet run SqsRepro.csproj -c Release
3/27/18 7:15:50 AM (0) - Receiving
3/27/18 7:15:51 AM (1) - Receiving
3/27/18 7:15:53 AM (Main) - Sending attempt 1 1
3/27/18 7:15:53 AM (Main) - Sent attempt 1 1
3/27/18 7:15:53 AM (0) - Received 1 / OK / 1131
Elapsed '00:00:00.0030000' (0) - to receive message with body '3/27/18 7:15:53 AM attempt 1 / 1'
3/27/18 7:15:53 AM (0) - deleting 3/27/18 7:15:53 AM attempt 1 / 1
3/27/18 7:15:53 AM (0) - deleted 3/27/18 7:15:53 AM attempt 1 / 1
3/27/18 7:15:53 AM (0) - Receiving
3/27/18 7:16:11 AM (1) - Received 0 / OK / 240
3/27/18 7:16:11 AM (1) - Receiving
3/27/18 7:16:13 AM (0) - Received 0 / OK / 240
3/27/18 7:16:13 AM (0) - Receiving
3/27/18 7:16:30 AM (1) - cancelled
3/27/18 7:16:30 AM (0) - cancelled
3/27/18 7:16:30 AM (0) - Receiving
3/27/18 7:16:30 AM (1) - Receiving
3/27/18 7:16:32 AM (Main) - Sending attempt 2 1
3/27/18 7:16:32 AM (Main) - Sent attempt 2 1
3/27/18 7:16:32 AM (0) - Received 1 / OK / 1131
Elapsed '00:00:00.0040000' (0) - to receive message with body '3/27/18 7:16:32 AM attempt 2 / 1'
3/27/18 7:16:32 AM (0) - deleting 3/27/18 7:16:32 AM attempt 2 / 1
3/27/18 7:16:32 AM (0) - deleted 3/27/18 7:16:32 AM attempt 2 / 1
3/27/18 7:16:32 AM (0) - Receiving
3/27/18 7:16:50 AM (1) - Received 0 / OK / 240
3/27/18 7:16:50 AM (1) - Receiving
3/27/18 7:16:53 AM (0) - Received 0 / OK / 240
3/27/18 7:16:53 AM (0) - Receiving
3/27/18 7:17:10 AM (1) - cancelled
3/27/18 7:17:10 AM (0) - cancelled
3/27/18 7:17:10 AM (0) - Receiving
3/27/18 7:17:10 AM (1) - Receiving
3/27/18 7:17:12 AM (Main) - Sending attempt 3 1
3/27/18 7:17:12 AM (Main) - Sent attempt 3 1
3/27/18 7:17:30 AM (0) - Received 0 / OK / 240
3/27/18 7:17:30 AM (0) - Receiving
3/27/18 7:17:30 AM (1) - Received 0 / OK / 240
3/27/18 7:17:30 AM (1) - Receiving
3/27/18 7:17:43 AM (1) - Received 1 / OK / 1131
Elapsed '00:00:37.4222483' (1) - to receive message with body '3/27/18 7:17:12 AM attempt 3 / 1'
3/27/18 7:17:43 AM (1) - deleting 3/27/18 7:17:12 AM attempt 3 / 1
3/27/18 7:17:43 AM (1) - deleted 3/27/18 7:17:12 AM attempt 3 / 1
3/27/18 7:17:43 AM (1) - Receiving
3/27/18 7:17:50 AM (0) - cancelled
3/27/18 7:17:50 AM (1) - cancelled
3/27/18 7:17:51 AM (0) - Receiving
3/27/18 7:17:51 AM (1) - Receiving
3/27/18 7:17:53 AM (Main) - Sending attempt 4 1
3/27/18 7:17:53 AM (Main) - Sent attempt 4 1
3/27/18 7:17:53 AM (0) - Received 1 / OK / 1131
Elapsed '00:00:00.0010000' (0) - to receive message with body '3/27/18 7:17:53 AM attempt 4 / 1'
3/27/18 7:17:53 AM (0) - deleting 3/27/18 7:17:53 AM attempt 4 / 1
3/27/18 7:17:53 AM (0) - deleted 3/27/18 7:17:53 AM attempt 4 / 1
3/27/18 7:17:53 AM (0) - Receiving
3/27/18 7:18:11 AM (1) - Received 0 / OK / 240
3/27/18 7:18:11 AM (1) - Receiving
3/27/18 7:18:13 AM (0) - Received 0 / OK / 240
3/27/18 7:18:13 AM (0) - Receiving
Sending attempt 3 was never received. Was only received in the next run of the repro (output not shown here)
When I change
var receiveResult = await sqsClient.ReceiveMessageAsync(new ReceiveMessageRequest
{
MaxNumberOfMessages = 10,
QueueUrl = queueUrl,
WaitTimeSeconds = 20,
AttributeNames = new List<string>
{
"SentTimestamp",
"ApproximateFirstReceiveTimestamp",
"ApproximateReceiveCount"
},
},
token).ConfigureAwait(false);
to
var receiveResult = await sqsClient.ReceiveMessageAsync(new ReceiveMessageRequest
{
MaxNumberOfMessages = 10,
QueueUrl = queueUrl,
WaitTimeSeconds = 20,
AttributeNames = new List<string>
{
"SentTimestamp",
"ApproximateFirstReceiveTimestamp",
"ApproximateReceiveCount"
},
},
CancellationToken.None).ConfigureAwait(false);
the issue disappears. It looks like when passing in the cancellation token (no matter what the timeout is) under some circumstances things leak which causes messages to be delayed for the visibility timeout period.
One could say that this is an acceptable workaround for us for now. So in the worst case we will delay the shutdown of an endpoint by 20 seconds.
I pushed an update to the repro sample
I had a chat @normj and based on the repro steps we provided we are fairly certain what we are seeing here is what @vellozzi already explained in https://github.com/aws/aws-sdk-net/issues/796#issuecomment-375494537. Since this is not a message loss scenario for us and in the worst case messages will be delayed for the duration of the visibility timeout I'm closing this issue.
We will discuss internally whether we want to pass in the token or not pass in the token.
Thanks all for chiming in!
I probably have the same issue happening with Rebus: https://stackoverflow.com/questions/67656986/a-message-received-from-sqs-by-a-stopped-aws-ecs-instance-is-handled-by-rebus-on
The suggested workaround of not passing cancellationToken into sqsClient.ReceiveMessageAsync()
but passing in CancellationToken.None
works for me as well.
@vellozzi Any luck with We're still looking into how the CancellationToken works within the HTTP client, and possible solutions to that.? I have the same issue with Rebus, which I fixed using the workaround you recommended - passing CancellationToken.None
into sqsClient.ReceiveMessageAsync()
. Cheers.
I found this: https://stackoverflow.com/a/45562557/379279 HttpClient cancellation doesn't kill underlying TCP call
: HttpClient re-uses connections as much as possible, hence the reason why the connection is not closed on cancel. When canceling a request like that, the HttpClient will just stop sending/receiving data to/from the other end. It will not send anything to inform the other end that it was cancelled.
So I guess that's it - the solution is to stop using cancellations with HttpClient
, and the workaround looks like the proper solution.
Under .NET Core with Linux messages sent by the SQS client might not be picked up by ReceiveMessages even though the messages are available in the queue, when the following factors are given:
Expected Behavior
Messages sent to the queue should be picked up by the ReceiveMessages call as soon as possible. Execute https://github.com/danielmarbach/SqsLinuxRepro under .NET Core in Windows and the following should happen:
Result: all messages consumed
Current Behavior
One or multiple messages are in flight during the execution time but never picked up the the SDK and propagated to the user code, more information see Steps to reproduce.
Possible Solution
Steps to Reproduce (for bugs)
Execute https://github.com/danielmarbach/SqsLinuxRepro under .NET Core on Linux with a concurrency higher than 2 and observe a similar pattern to:
As we can see sent attempt 2 was never received although the receive time is high enough in the repro to actually pick it up.
but the queue browser shows multiple receive attempts.
Context
At Particular Software we are preparing a Release candidate of a .NET Core SQS transport for NServiceBus and we detected this on our CI environment. Multiple of our tests failed under .NET Core under Linux only. This is currently a blocker for us.
Your Environment
.NET Core Info
dotnet --info
: