aws / aws-sdk-net

The official AWS SDK for .NET. For more information on the AWS SDK for .NET, see our web site:
http://aws.amazon.com/sdkfornet/
Apache License 2.0
2.05k stars 852 forks source link

SQS: Sent messages might not get picked up by ReceiveMessages on .NET Core running on Linux #796

Closed danielmarbach closed 6 years ago

danielmarbach commented 6 years ago

Under .NET Core with Linux messages sent by the SQS client might not be picked up by ReceiveMessages even though the messages are available in the queue, when the following factors are given:

Expected Behavior

Messages sent to the queue should be picked up by the ReceiveMessages call as soon as possible. Execute https://github.com/danielmarbach/SqsLinuxRepro under .NET Core in Windows and the following should happen:

PS C:\p\SqsLinuxRepro\SqsRepro> dotnet run
16/11/2017 10:43:35 (0) - Receiving
16/11/2017 10:43:35 (1) - Receiving
16/11/2017 10:43:37 (Main) - Sending attempt 1 1
16/11/2017 10:43:37 (1) - Received 1 / OK / 884
16/11/2017 10:43:37 (1) - deleting 16/11/2017 10:43:37 attempt 1 / 1
16/11/2017 10:43:37 (1) - deleted 16/11/2017 10:43:37 attempt 1 / 1
16/11/2017 10:43:37 (1) - Receiving
16/11/2017 10:43:38 (Main) - Sent attempt 1 1
16/11/2017 10:43:45 (0) - cancelled
16/11/2017 10:43:45 (1) - cancelled
16/11/2017 10:43:45 (0) - Receiving
16/11/2017 10:43:45 (1) - Receiving
16/11/2017 10:43:47 (Main) - Sending attempt 2 1
16/11/2017 10:43:47 (0) - Received 1 / OK / 884
16/11/2017 10:43:47 (0) - deleting 16/11/2017 10:43:47 attempt 2 / 1
16/11/2017 10:43:47 (0) - deleted 16/11/2017 10:43:47 attempt 2 / 1
16/11/2017 10:43:47 (0) - Receiving
16/11/2017 10:43:48 (Main) - Sent attempt 2 1
16/11/2017 10:43:55 (0) - cancelled
16/11/2017 10:43:55 (1) - cancelled
16/11/2017 10:43:55 (0) - Receiving
16/11/2017 10:43:55 (1) - Receiving
16/11/2017 10:43:57 (Main) - Sending attempt 3 1
16/11/2017 10:43:57 (1) - Received 1 / OK / 884
16/11/2017 10:43:57 (1) - deleting 16/11/2017 10:43:57 attempt 3 / 1
16/11/2017 10:43:58 (1) - deleted 16/11/2017 10:43:57 attempt 3 / 1
16/11/2017 10:43:58 (1) - Receiving
16/11/2017 10:43:58 (Main) - Sent attempt 3 1
16/11/2017 10:44:05 (1) - cancelled
16/11/2017 10:44:05 (0) - cancelled
16/11/2017 10:44:05 (0) - Receiving
16/11/2017 10:44:05 (1) - Receiving
16/11/2017 10:44:07 (Main) - Sending attempt 4 1
16/11/2017 10:44:08 (1) - Received 1 / OK / 884
16/11/2017 10:44:08 (1) - deleting 16/11/2017 10:44:07 attempt 4 / 1
16/11/2017 10:44:08 (1) - deleted 16/11/2017 10:44:07 attempt 4 / 1
16/11/2017 10:44:08 (1) - Receiving
16/11/2017 10:44:09 (Main) - Sent attempt 4 1
16/11/2017 10:44:15 (1) - cancelled
16/11/2017 10:44:15 (0) - cancelled

Result: all messages consumed

image

Current Behavior

One or multiple messages are in flight during the execution time but never picked up the the SDK and propagated to the user code, more information see Steps to reproduce.

Possible Solution

Steps to Reproduce (for bugs)

Execute https://github.com/danielmarbach/SqsLinuxRepro under .NET Core on Linux with a concurrency higher than 2 and observe a similar pattern to:

11/16/17 10:37:06 AM (0) - Receiving
11/16/17 10:37:06 AM (1) - Receiving
11/16/17 10:37:08 AM (Main) - Sending attempt 1 1
11/16/17 10:37:08 AM (0) - Received 1 / OK / 885
11/16/17 10:37:08 AM (0) - deleting 11/16/17 10:37:08 AM attempt 1 / 1
11/16/17 10:37:08 AM (0) - deleted 11/16/17 10:37:08 AM attempt 1 / 1
11/16/17 10:37:08 AM (0) - Receiving
11/16/17 10:37:09 AM (Main) - Sent attempt 1 1
11/16/17 10:37:16 AM (1) - cancelled
11/16/17 10:37:16 AM (0) - cancelled
11/16/17 10:37:16 AM (0) - Receiving
11/16/17 10:37:16 AM (1) - Receiving
11/16/17 10:37:18 AM (Main) - Sending attempt 2 1
11/16/17 10:37:19 AM (Main) - Sent attempt 2 1
11/16/17 10:37:26 AM (0) - cancelled
11/16/17 10:37:26 AM (1) - cancelled
11/16/17 10:37:26 AM (0) - Receiving
11/16/17 10:37:26 AM (1) - Receiving
11/16/17 10:37:28 AM (Main) - Sending attempt 3 1
11/16/17 10:37:28 AM (1) - Received 1 / OK / 885
11/16/17 10:37:28 AM (1) - deleting 11/16/17 10:37:28 AM attempt 3 / 1
11/16/17 10:37:28 AM (1) - deleted 11/16/17 10:37:28 AM attempt 3 / 1
11/16/17 10:37:28 AM (1) - Receiving
11/16/17 10:37:29 AM (Main) - Sent attempt 3 1
11/16/17 10:37:36 AM (0) - cancelled
11/16/17 10:37:36 AM (1) - cancelled
11/16/17 10:37:36 AM (0) - Receiving
11/16/17 10:37:36 AM (1) - Receiving
11/16/17 10:37:38 AM (Main) - Sending attempt 4 1
11/16/17 10:37:38 AM (1) - Received 1 / OK / 885
11/16/17 10:37:38 AM (1) - deleting 11/16/17 10:37:38 AM attempt 4 / 1
11/16/17 10:37:38 AM (1) - deleted 11/16/17 10:37:38 AM attempt 4 / 1
11/16/17 10:37:38 AM (1) - Receiving
11/16/17 10:37:39 AM (Main) - Sent attempt 4 1
11/16/17 10:37:46 AM (0) - cancelled
11/16/17 10:37:46 AM (1) - cancelled

As we can see sent attempt 2 was never received although the receive time is high enough in the repro to actually pick it up.

image

image

but the queue browser shows multiple receive attempts.

Context

At Particular Software we are preparing a Release candidate of a .NET Core SQS transport for NServiceBus and we detected this on our CI environment. Multiple of our tests failed under .NET Core under Linux only. This is currently a blocker for us.

Your Environment

.NET Core Info

.NET Command Line Tools (2.0.0)

Product Information:
 Version:            2.0.0
 Commit SHA-1 hash:  cdcd1928c9

Runtime Environment:
 OS Name:     Windows
 OS Version:  10.0.15063
 OS Platform: Windows
 RID:         win10-x64
 Base Path:   C:\Program Files\dotnet\sdk\2.0.0\

Microsoft .NET Core Shared Framework Host

  Version  : 2.0.3
  Build    : a9190d4a75f4a982ae4b4fa8d1a24526566c69df
.NET Command Line Tools (2.0.3)

Product Information:
 Version:            2.0.3
 Commit SHA-1 hash:  eb1d5ee318

Runtime Environment:
 OS Name:     ubuntu
 OS Version:  16.04
 OS Platform: Linux
 RID:         ubuntu.16.04-x64
 Base Path:   /usr/share/dotnet/sdk/2.0.3/

Microsoft .NET Core Shared Framework Host

  Version  : 2.0.3
  Build    : a9190d4a75f4a982ae4b4fa8d1a24526566c69df
<Project Sdk="Microsoft.NET.Sdk">

  <PropertyGroup>
    <OutputType>Exe</OutputType>
    <TargetFramework>netcoreapp2.0</TargetFramework>
    <CopyLocalLockFileAssemblies>true</CopyLocalLockFileAssemblies>
    <LangVersion>7.1</LangVersion>
  </PropertyGroup>

  <ItemGroup>
    <PackageReference Include="AWSSDK.SQS" Version="3.3.3.2" />
  </ItemGroup>

</Project>
danielmarbach commented 6 years ago

Original issue https://github.com/Particular/NServiceBus.AmazonSQS/issues/168

danielmarbach commented 6 years ago

Btw. the repro sample uses the HTTP URI so that wireshark can be attached. I have a full wireshark capture of a test but since it contains my unencrypted credentials I won't attach it here and it can be easily recaptured with the repro if required.

gokarnm commented 6 years ago

@danielmarbach , thanks for the detailed report and repro. We'll investigate this and get back to you.

danielmarbach commented 6 years ago

@gokarnm let me know if you need help or additional information

danielmarbach commented 6 years ago

Some more information.

I created two more branches. One is https://github.com/danielmarbach/SqsLinuxRepro/tree/client-per-receive which creates a client per concurrent receive loop. The issue still occurs

Another one is https://github.com/danielmarbach/SqsLinuxRepro/tree/multiple-starts-bash where I start the same program with just one receive loop multiple times with a bash script. The issue still occurs.

danielmarbach commented 6 years ago

I was able to narrow it down

if you enable the following in the repro

AWSConfigs.LoggingConfig.LogTo = LoggingOptions.Console;

you can observe

AmazonSQSClient 45|2017-12-19T16:54:41.797Z|ERROR|TaskCanceledException making request ReceiveMessageRequest to http://sqs.eu-central-1.amazonaws.com/. Attempt 1. --> System.Threading.Tasks.TaskCanceledException: A task was canceled.
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at System.Runtime.CompilerServices.ConfiguredTaskAwaitable`1.ConfiguredTaskAwaiter.GetResult()
   at System.Net.Http.HttpClient.<FinishSendAsyncUnbuffered>d__59.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Amazon.Runtime.HttpWebRequestMessage.<GetResponseAsync>d__20.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Amazon.Runtime.Internal.HttpHandler`1.<InvokeAsync>d__9`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Amazon.Runtime.Internal.Unmarshaller.<InvokeAsync>d__3`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Amazon.SQS.Internal.ValidationResponseHandler.<InvokeAsync>d__1`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Amazon.Runtime.Internal.ErrorHandler.<InvokeAsync>d__5`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Amazon.Runtime.Internal.CallbackHandler.<InvokeAsync>d__9`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Amazon.Runtime.Internal.CredentialsRetriever.<InvokeAsync>d__7`1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
   at Amazon.Runtime.Internal.RetryHandler.<InvokeAsync>d__10`1.MoveNext()

which seems to be related to

https://github.com/dotnet/corefx/issues/25800

In the cases the exception happens we are never able to receive anything on the client side.

I verified again with wireshark and the service always correctly returns the ReceiveMessageResult

image

danielmarbach commented 6 years ago

Any updates on this? It is a major .NET Core blocker for us. I can try to update latest .NET Core and report back here but I would love to see some interaction happening on this issue

sstevenkang commented 6 years ago

Sorry for the delayed response. We weren't able to reproduce this on Mac or Windows. We are pretty confident that this is a .NET Core framework issue--probably with the linux httpclient it uses.

We've discussed a few workaround that we can provide in the SDK until a new HttpClient is released in .NET Core 2.1. Once we have a workaround solution that is reasonable, we'll ping back on this thread.

danielmarbach commented 6 years ago

@sstevenkang I'm not confident anymore that it is only related to .NET Core. I was able to repro the same problem on Windows consistently on multiple machines with FIFO queues. I updated the repro solution https://github.com/danielmarbach/SqsLinuxRepro/tree/fifo-queues // @seanfeldman

SeanFeldman commented 6 years ago

Running code from fifo-queues branch on Windows

2018-03-05 22:45:35 (0) - Receiving
2018-03-05 22:45:35 (1) - Receiving
2018-03-05 22:45:35 (0) - Received 2 / OK / 1290
2018-03-05 22:45:35 (0) - deleting 2018-03-05 22:23:56 attempt 4 / 1
2018-03-05 22:45:35 (0) - deleting 2018-03-05 22:23:45 attempt 3 / 1
2018-03-05 22:45:35 (0) - deleted 2018-03-05 22:23:56 attempt 4 / 1
2018-03-05 22:45:36 (0) - deleted 2018-03-05 22:23:45 attempt 3 / 1
2018-03-05 22:45:36 (0) - Receiving
2018-03-05 22:45:37 (Main) - Sending attempt 1 1
2018-03-05 22:45:37 (1) - Received 1 / OK / 776
2018-03-05 22:45:37 (1) - deleting 2018-03-05 22:45:37 attempt 1 / 1
2018-03-05 22:45:37 (Main) - Sent attempt 1 1
2018-03-05 22:45:38 (1) - deleted 2018-03-05 22:45:37 attempt 1 / 1
2018-03-05 22:45:38 (1) - Receiving
2018-03-05 22:45:46 (0) - cancelled
2018-03-05 22:45:46 (1) - cancelled
2018-03-05 22:45:47 (0) - Receiving
2018-03-05 22:45:47 (1) - Receiving
2018-03-05 22:45:49 (Main) - Sending attempt 2 1
2018-03-05 22:45:49 (Main) - Sent attempt 2 1
2018-03-05 22:45:58 (0) - cancelled
2018-03-05 22:45:58 (1) - cancelled
2018-03-05 22:45:59 (0) - Receiving
2018-03-05 22:45:59 (1) - Receiving
2018-03-05 22:46:01 (Main) - Sending attempt 3 1
2018-03-05 22:46:01 (Main) - Sent attempt 3 1
2018-03-05 22:46:10 (1) - cancelled
2018-03-05 22:46:10 (0) - cancelled
2018-03-05 22:46:10 (0) - Receiving
2018-03-05 22:46:10 (1) - Receiving
2018-03-05 22:46:12 (Main) - Sending attempt 4 1
2018-03-05 22:46:13 (Main) - Sent attempt 4 1
2018-03-05 22:46:19 (0) - Received 1 / OK / 776
2018-03-05 22:46:19 (0) - deleting 2018-03-05 22:45:49 attempt 2 / 1
2018-03-05 22:46:20 (0) - deleted 2018-03-05 22:45:49 attempt 2 / 1
2018-03-05 22:46:20 (0) - Receiving
2018-03-05 22:46:22 (1) - cancelled
2018-03-05 22:46:22 (0) - cancelled
danielmarbach commented 6 years ago

The outcome is that the messages are not picked up, see

image

3/5/2018 11:51:34 PM (0) - Receiving
3/5/2018 11:51:34 PM (1) - Receiving
3/5/2018 11:51:34 PM (0) - Received 3 / OK / 1747
3/5/2018 11:51:34 PM (0) - deleting 3/5/2018 11:43:58 PM attempt 4 / 1
3/5/2018 11:51:34 PM (0) - deleting 3/5/2018 11:42:41 PM attempt 3 / 1
3/5/2018 11:51:34 PM (0) - deleting 3/5/2018 11:43:46 PM attempt 3 / 1
3/5/2018 11:51:34 PM (0) - deleted 3/5/2018 11:43:58 PM attempt 4 / 1
3/5/2018 11:51:35 PM (0) - deleted 3/5/2018 11:42:41 PM attempt 3 / 1
3/5/2018 11:51:35 PM (0) - deleted 3/5/2018 11:43:46 PM attempt 3 / 1
3/5/2018 11:51:35 PM (0) - Receiving
3/5/2018 11:51:36 PM (Main) - Sending attempt 1 1
3/5/2018 11:51:36 PM (0) - Received 1 / OK / 757
3/5/2018 11:51:36 PM (0) - deleting 3/5/2018 11:51:36 PM attempt 1 / 1
3/5/2018 11:51:36 PM (Main) - Sent attempt 1 1
3/5/2018 11:51:36 PM (0) - deleted 3/5/2018 11:51:36 PM attempt 1 / 1
3/5/2018 11:51:36 PM (0) - Receiving
3/5/2018 11:51:44 PM (1) - cancelled
3/5/2018 11:51:44 PM (0) - cancelled
3/5/2018 11:51:46 PM (0) - Receiving
3/5/2018 11:51:46 PM (1) - Receiving
3/5/2018 11:51:48 PM (Main) - Sending attempt 2 1
3/5/2018 11:51:49 PM (Main) - Sent attempt 2 1
3/5/2018 11:51:56 PM (1) - cancelled
3/5/2018 11:51:56 PM (0) - cancelled
3/5/2018 11:51:59 PM (0) - Receiving
3/5/2018 11:51:59 PM (1) - Receiving
3/5/2018 11:52:01 PM (Main) - Sending attempt 3 1
3/5/2018 11:52:02 PM (Main) - Sent attempt 3 1
3/5/2018 11:52:09 PM (1) - cancelled
3/5/2018 11:52:09 PM (0) - cancelled
3/5/2018 11:52:10 PM (0) - Receiving
3/5/2018 11:52:10 PM (1) - Receiving
3/5/2018 11:52:12 PM (Main) - Sending attempt 4 1
3/5/2018 11:52:12 PM (Main) - Sent attempt 4 1
3/5/2018 11:52:20 PM (1) - cancelled
3/5/2018 11:52:20 PM (0) - cancelled
bording commented 6 years ago

It also doesn't appear to be limited to .NET Core. If the target framework of the repro project is changed to net452, the same problem is demonstrated.

danielmarbach commented 6 years ago

The effect is that the message stays in flight for 30 seconds which means they are actually processed but never handed over to the upper layers of the code.

image

danielmarbach commented 6 years ago

Reproed it again on freshly setup cloud Linux machine with Ubuntu 17.10 and dotnet core version 2.1.4. And like @bording said it is also possible to repro this on full framework.

Guys seriously, this is a major blocker for us. Can we get some turn around here?

danielmarbach commented 6 years ago

For the record NET452 target

image

image

SeanFeldman commented 6 years ago

@sstevenkang any updates?

vellozzi commented 6 years ago

We don't have a definitive solution yet but here's an update on our running theory:

When a message is received but not deleted you have to wait for the message visibility timeout to elapse before receiving it again. That's the well known pattern.

That's what's happening here - but with a twist. It looks like there's a race condition when the request gets canceled. When the CancellationSource gets triggered control returns to the calling code but the socket isn't closed yet.

So even though a message is sent after the previous ReceiveMessage request has been canceled the socket from the previous receive is still open. SQS replies over the old socket with the new message, considers it received, and subjects the message to the visibility timeout.

We're still looking into how the CancellationToken works within the HTTP client, and possible solutions to that.

One possible workaround is to call CancellationSource.CancelAfter with a value greater than the visibility timeout (which is 30 seconds by default). Another (less desirable) is to put a Thread.Sleep after any OperationCanceledExceptions to allow the socket time to be closed.

@danielmarbach Will either of those work for you? Can you tell us more about your use case and how this is causing issues?

danielmarbach commented 6 years ago

@vellozzi thanks for coming back to us and investigating more.

We'll investigate if we can incorporate your proposal with the CancellationTokenSource. What I'm not yet understanding is why that would change anything. The token source we currently pass to the SDK is only canceled from our perspective when we shut down the application and that is not what we are doing when we see that happening in our environments. The Thread.Sleep option is not something I would like to add to our production code TBH.

Can you tell us more about your use case and how this is causing issues?

We detected this in our acceptance testing environment. We are developing an SQS adapter for NServiceBus. We recently added support for .NET Core. Based on that we run a set of acceptance tests on .NET Core on Windows, Linux and also the full framework on Windows. The default test timeout was 20 seconds and we suddenly started seing those tests failing. When we investigated we found out that in certain circumstances the messages that we sent remain not visible for the visibility timeout and thus the tests fail. We got worried because we figured that could also happen to our customers. We got also worried because we figured it might also indicate a potential message loss (which we haven't confirmed yet). For now, we decided to lift the restriction to not run on Linux because we think in the worst case messages will get delayed by 30 seconds (which is something we'll document on our docs site). But of course we'd love to see the SDK just work ;)

vellozzi commented 6 years ago

Are you saying that the CancellationSource is only cancelled on shutdown and your problem isn't occurring at shutdown? If that's the case then I may not be seeing the same problem you are. The issue I'm seeing only happens when the CancellationSource is cancelled.

Can you change the sample code to take the CancellationSource out of the equation and commit that to your GitHub repo? I would do it but I want to make sure the sample accurately represents your production code.

vellozzi commented 6 years ago

I agree that there is no message loss and that the worst case is that you have to wait for the message visibility period to expire.

danielmarbach commented 6 years ago

I extended the program a bit to include the received timespan by doing simple math between ApproximateFirstReceiveTimestamp and SentTimestamp. A run with the existing settings showed again the delay

daniel@DESKTOP-SUIHDT7:/mnt/c/p/SqsLinuxRepro/SqsRepro$ dotnet run SqsRepro.csproj -c Release
3/27/18 7:09:27 AM (0) - Receiving
3/27/18 7:09:28 AM (1) - Receiving
3/27/18 7:09:30 AM (Main) - Sending attempt 1 1
3/27/18 7:09:30 AM (Main) - Sent attempt 1 1
3/27/18 7:09:30 AM (1) - Received 1 / OK / 1131
Elapsed '00:00:00.0030000' (1) - to receive message with body '3/27/18 7:09:30 AM attempt 1 / 1'
3/27/18 7:09:30 AM (1) - deleting 3/27/18 7:09:30 AM attempt 1 / 1
3/27/18 7:09:30 AM (1) - deleted 3/27/18 7:09:30 AM attempt 1 / 1
3/27/18 7:09:30 AM (1) - Receiving
3/27/18 7:09:37 AM (1) - cancelled
3/27/18 7:09:37 AM (0) - cancelled
3/27/18 7:09:38 AM (0) - Receiving
3/27/18 7:09:38 AM (1) - Receiving
3/27/18 7:09:40 AM (Main) - Sending attempt 2 1
3/27/18 7:09:40 AM (Main) - Sent attempt 2 1
3/27/18 7:09:48 AM (0) - cancelled
3/27/18 7:09:48 AM (1) - cancelled
3/27/18 7:09:48 AM (0) - Receiving
3/27/18 7:09:48 AM (1) - Receiving
3/27/18 7:09:50 AM (Main) - Sending attempt 3 1
3/27/18 7:09:50 AM (Main) - Sent attempt 3 1
3/27/18 7:09:50 AM (1) - Received 1 / OK / 1131
Elapsed '00:00:00.0010000' (1) - to receive message with body '3/27/18 7:09:50 AM attempt 3 / 1'
3/27/18 7:09:50 AM (1) - deleting 3/27/18 7:09:50 AM attempt 3 / 1
3/27/18 7:09:50 AM (1) - deleted 3/27/18 7:09:50 AM attempt 3 / 1
3/27/18 7:09:50 AM (1) - Receiving
3/27/18 7:09:58 AM (0) - cancelled
3/27/18 7:09:58 AM (1) - cancelled
3/27/18 7:09:58 AM (0) - Receiving
3/27/18 7:09:58 AM (1) - Receiving
3/27/18 7:10:00 AM (Main) - Sending attempt 4 1
3/27/18 7:10:00 AM (Main) - Sent attempt 4 1
3/27/18 7:10:08 AM (0) - cancelled
3/27/18 7:10:08 AM (1) - cancelled

daniel@DESKTOP-SUIHDT7:/mnt/c/p/SqsLinuxRepro/SqsRepro$ dotnet run SqsRepro.csproj -c Release
3/27/18 7:11:20 AM (0) - Receiving
3/27/18 7:11:21 AM (1) - Receiving
3/27/18 7:11:21 AM (1) - Received 1 / OK / 1131
3/27/18 7:11:21 AM (0) - Received 1 / OK / 1131
Elapsed '00:01:28.4026165' (1) - to receive message with body '3/27/18 7:10:00 AM attempt 4 / 1'
3/27/18 7:11:21 AM (1) - deleting 3/27/18 7:10:00 AM attempt 4 / 1
Elapsed '00:01:48.6386170' (0) - to receive message with body '3/27/18 7:09:40 AM attempt 2 / 1'
3/27/18 7:11:21 AM (0) - deleting 3/27/18 7:09:40 AM attempt 2 / 1
3/27/18 7:11:21 AM (0) - deleted 3/27/18 7:09:40 AM attempt 2 / 1
3/27/18 7:11:21 AM (0) - Receiving
3/27/18 7:11:21 AM (1) - deleted 3/27/18 7:10:00 AM attempt 4 / 1
3/27/18 7:11:21 AM (1) - Receiving
3/27/18 7:11:23 AM (Main) - Sending attempt 1 1
3/27/18 7:11:23 AM (Main) - Sent attempt 1 1
3/27/18 7:11:23 AM (1) - Received 1 / OK / 1131
Elapsed '00:00:00.0030000' (1) - to receive message with body '3/27/18 7:11:23 AM attempt 1 / 1'
3/27/18 7:11:23 AM (1) - deleting 3/27/18 7:11:23 AM attempt 1 / 1
3/27/18 7:11:23 AM (1) - deleted 3/27/18 7:11:23 AM attempt 1 / 1
3/27/18 7:11:23 AM (1) - Receiving
3/27/18 7:11:30 AM (0) - cancelled
3/27/18 7:11:30 AM (1) - cancelled
3/27/18 7:11:30 AM (0) - Receiving
3/27/18 7:11:30 AM (1) - Receiving
3/27/18 7:11:32 AM (Main) - Sending attempt 2 1
3/27/18 7:11:32 AM (Main) - Sent attempt 2 1
3/27/18 7:11:40 AM (1) - cancelled
3/27/18 7:11:40 AM (0) - cancelled
3/27/18 7:11:40 AM (0) - Receiving
3/27/18 7:11:40 AM (1) - Receiving
3/27/18 7:11:42 AM (Main) - Sending attempt 3 1
3/27/18 7:11:43 AM (Main) - Sent attempt 3 1
3/27/18 7:11:43 AM (0) - Received 1 / OK / 1131
Elapsed '00:00:00.0020000' (0) - to receive message with body '3/27/18 7:11:42 AM attempt 3 / 1'
3/27/18 7:11:43 AM (0) - deleting 3/27/18 7:11:42 AM attempt 3 / 1
3/27/18 7:11:43 AM (0) - deleted 3/27/18 7:11:42 AM attempt 3 / 1
3/27/18 7:11:43 AM (0) - Receiving
3/27/18 7:11:50 AM (0) - cancelled
3/27/18 7:11:50 AM (1) - cancelled
3/27/18 7:11:51 AM (0) - Receiving
3/27/18 7:11:51 AM (1) - Receiving
3/27/18 7:11:53 AM (Main) - Sending attempt 4 1
3/27/18 7:11:53 AM (Main) - Sent attempt 4 1
3/27/18 7:12:01 AM (0) - cancelled
3/27/18 7:12:01 AM (1) - cancelled

Then I increased the timeout of the cancellation token source to 40 seconds which is more than the visibility timeout and still I get

daniel@DESKTOP-SUIHDT7:/mnt/c/p/SqsLinuxRepro/SqsRepro$ dotnet run SqsRepro.csproj -c Release
3/27/18 7:15:50 AM (0) - Receiving
3/27/18 7:15:51 AM (1) - Receiving
3/27/18 7:15:53 AM (Main) - Sending attempt 1 1
3/27/18 7:15:53 AM (Main) - Sent attempt 1 1
3/27/18 7:15:53 AM (0) - Received 1 / OK / 1131
Elapsed '00:00:00.0030000' (0) - to receive message with body '3/27/18 7:15:53 AM attempt 1 / 1'
3/27/18 7:15:53 AM (0) - deleting 3/27/18 7:15:53 AM attempt 1 / 1
3/27/18 7:15:53 AM (0) - deleted 3/27/18 7:15:53 AM attempt 1 / 1
3/27/18 7:15:53 AM (0) - Receiving
3/27/18 7:16:11 AM (1) - Received 0 / OK / 240
3/27/18 7:16:11 AM (1) - Receiving
3/27/18 7:16:13 AM (0) - Received 0 / OK / 240
3/27/18 7:16:13 AM (0) - Receiving
3/27/18 7:16:30 AM (1) - cancelled
3/27/18 7:16:30 AM (0) - cancelled
3/27/18 7:16:30 AM (0) - Receiving
3/27/18 7:16:30 AM (1) - Receiving
3/27/18 7:16:32 AM (Main) - Sending attempt 2 1
3/27/18 7:16:32 AM (Main) - Sent attempt 2 1
3/27/18 7:16:32 AM (0) - Received 1 / OK / 1131
Elapsed '00:00:00.0040000' (0) - to receive message with body '3/27/18 7:16:32 AM attempt 2 / 1'
3/27/18 7:16:32 AM (0) - deleting 3/27/18 7:16:32 AM attempt 2 / 1
3/27/18 7:16:32 AM (0) - deleted 3/27/18 7:16:32 AM attempt 2 / 1
3/27/18 7:16:32 AM (0) - Receiving
3/27/18 7:16:50 AM (1) - Received 0 / OK / 240
3/27/18 7:16:50 AM (1) - Receiving
3/27/18 7:16:53 AM (0) - Received 0 / OK / 240
3/27/18 7:16:53 AM (0) - Receiving
3/27/18 7:17:10 AM (1) - cancelled
3/27/18 7:17:10 AM (0) - cancelled
3/27/18 7:17:10 AM (0) - Receiving
3/27/18 7:17:10 AM (1) - Receiving
3/27/18 7:17:12 AM (Main) - Sending attempt 3 1
3/27/18 7:17:12 AM (Main) - Sent attempt 3 1
3/27/18 7:17:30 AM (0) - Received 0 / OK / 240
3/27/18 7:17:30 AM (0) - Receiving
3/27/18 7:17:30 AM (1) - Received 0 / OK / 240
3/27/18 7:17:30 AM (1) - Receiving
3/27/18 7:17:43 AM (1) - Received 1 / OK / 1131
Elapsed '00:00:37.4222483' (1) - to receive message with body '3/27/18 7:17:12 AM attempt 3 / 1'
3/27/18 7:17:43 AM (1) - deleting 3/27/18 7:17:12 AM attempt 3 / 1
3/27/18 7:17:43 AM (1) - deleted 3/27/18 7:17:12 AM attempt 3 / 1
3/27/18 7:17:43 AM (1) - Receiving
3/27/18 7:17:50 AM (0) - cancelled
3/27/18 7:17:50 AM (1) - cancelled
3/27/18 7:17:51 AM (0) - Receiving
3/27/18 7:17:51 AM (1) - Receiving
3/27/18 7:17:53 AM (Main) - Sending attempt 4 1
3/27/18 7:17:53 AM (Main) - Sent attempt 4 1
3/27/18 7:17:53 AM (0) - Received 1 / OK / 1131
Elapsed '00:00:00.0010000' (0) - to receive message with body '3/27/18 7:17:53 AM attempt 4 / 1'
3/27/18 7:17:53 AM (0) - deleting 3/27/18 7:17:53 AM attempt 4 / 1
3/27/18 7:17:53 AM (0) - deleted 3/27/18 7:17:53 AM attempt 4 / 1
3/27/18 7:17:53 AM (0) - Receiving
3/27/18 7:18:11 AM (1) - Received 0 / OK / 240
3/27/18 7:18:11 AM (1) - Receiving
3/27/18 7:18:13 AM (0) - Received 0 / OK / 240
3/27/18 7:18:13 AM (0) - Receiving

Sending attempt 3 was never received. Was only received in the next run of the repro (output not shown here)

When I change

                    var receiveResult = await sqsClient.ReceiveMessageAsync(new ReceiveMessageRequest
                        {
                            MaxNumberOfMessages = 10,
                            QueueUrl = queueUrl,
                            WaitTimeSeconds = 20,
                            AttributeNames = new List<string>
                            {
                                "SentTimestamp",
                                "ApproximateFirstReceiveTimestamp",
                                "ApproximateReceiveCount"
                            },
                        },
                        token).ConfigureAwait(false);

to

                    var receiveResult = await sqsClient.ReceiveMessageAsync(new ReceiveMessageRequest
                        {
                            MaxNumberOfMessages = 10,
                            QueueUrl = queueUrl,
                            WaitTimeSeconds = 20,
                            AttributeNames = new List<string>
                            {
                                "SentTimestamp",
                                "ApproximateFirstReceiveTimestamp",
                                "ApproximateReceiveCount"
                            },
                        },
                        CancellationToken.None).ConfigureAwait(false);

the issue disappears. It looks like when passing in the cancellation token (no matter what the timeout is) under some circumstances things leak which causes messages to be delayed for the visibility timeout period.

One could say that this is an acceptable workaround for us for now. So in the worst case we will delay the shutdown of an endpoint by 20 seconds.

danielmarbach commented 6 years ago

I pushed an update to the repro sample

danielmarbach commented 6 years ago

I had a chat @normj and based on the repro steps we provided we are fairly certain what we are seeing here is what @vellozzi already explained in https://github.com/aws/aws-sdk-net/issues/796#issuecomment-375494537. Since this is not a message loss scenario for us and in the worst case messages will be delayed for the duration of the visibility timeout I'm closing this issue.

We will discuss internally whether we want to pass in the token or not pass in the token.

Thanks all for chiming in!

xhafan commented 3 years ago

I probably have the same issue happening with Rebus: https://stackoverflow.com/questions/67656986/a-message-received-from-sqs-by-a-stopped-aws-ecs-instance-is-handled-by-rebus-on The suggested workaround of not passing cancellationToken into sqsClient.ReceiveMessageAsync() but passing in CancellationToken.None works for me as well.

xhafan commented 3 years ago

@vellozzi Any luck with We're still looking into how the CancellationToken works within the HTTP client, and possible solutions to that.? I have the same issue with Rebus, which I fixed using the workaround you recommended - passing CancellationToken.None into sqsClient.ReceiveMessageAsync(). Cheers.

xhafan commented 3 years ago

I found this: https://stackoverflow.com/a/45562557/379279 HttpClient cancellation doesn't kill underlying TCP call: HttpClient re-uses connections as much as possible, hence the reason why the connection is not closed on cancel. When canceling a request like that, the HttpClient will just stop sending/receiving data to/from the other end. It will not send anything to inform the other end that it was cancelled. So I guess that's it - the solution is to stop using cancellations with HttpClient, and the workaround looks like the proper solution.