dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.18k stars 4.72k forks source link

HttpHeaders thread safety and behavior of invalid values #61798

Closed MihaZupan closed 2 years ago

MihaZupan commented 2 years ago

The HttpHeaders collections were never thread-safe. Accessing a header may force lazy parsing of its value, resulting in modifications to the underlying data structures.

Before .NET 6.0, reading from the collection concurrently happened to be thread-safe in most cases. If the collection contained an invalid header value, that value would be removed during enumeration. In rare cases, this could cause problems from modifying the underlying Dictionary concurrently.

Starting with 6.0, less locking is performed around header parsing as it was no longer needed internally (#54130). Due to this change, multiple examples of users accessing the headers concurrently surfaced. Currently known ones to the team are gRPC's .NET client (#55898, #55896) and NewRelic's HttpClient instrumentation (https://github.com/newrelic/newrelic-dotnet-agent/issues/803). Violating thread safety in 6.0 may result in the header values being duplicated/malformed or various exceptions being thrown during enumeration/header accesses.

I am posting a few known call stacks below to help other users hitting this issue discover the root cause:

IndexOutOfRangeException ``` System.IndexOutOfRangeException: Index was outside the bounds of the array. at System.Net.Http.Headers.HttpHeaders.ReadStoreValues[T](Span`1 values, Object storeValue, HttpHeaderParser parser, Int32& currentIndex) at System.Net.Http.Headers.HttpHeaders.GetStoreValuesAsStringOrStringArray(HeaderDescriptor descriptor, Object sourceValues, String& singleValue, String[]& multiValue) at System.Net.Http.Headers.HttpHeaders.GetStoreValuesAsStringArray(HeaderDescriptor descriptor, HeaderStoreItemInfo info) at System.Net.Http.Headers.HttpHeaders.GetEnumeratorCore()+MoveNext() ```
NullReferenceException ``` System.NullReferenceException: Object reference not set to an instance of an object. at System.Net.Http.Headers.HttpHeaders.ReadStoreValues[T](Span`1 values, Object storeValue, HttpHeaderParser parser, Int32& currentIndex) at System.Net.Http.Headers.HttpHeaders.GetStoreValuesAsStringOrStringArray(HeaderDescriptor descriptor, Object sourceValues, String& singleValue, String[]& multiValue) at System.Net.Http.Headers.HttpHeaders.GetStoreValuesAsStringArray(HeaderDescriptor descriptor, HeaderStoreItemInfo info) at System.Net.Http.Headers.HttpHeaders.GetEnumeratorCore()+MoveNext() ```

InvalidCastException ``` InvalidCastException: Unable to cast object of type 'System.Collections.Generic.List`1[System.Object]' to type 'System.Net.Http.Headers.MediaTypeHeaderValue'. at System.Net.Http.Headers.HttpContentHeaders.get_ContentType() ```

There is precedent in concurrent reads being thread-safe on some collections (namely Dictionary) and users may assume HttpHeaders behaves the same.

A related issue here is that forcing the validation of an invalid value will result in its removal from the collection. These Schrödinger's headers are difficult to reason about as their existence depends on whether/how/when they are observed.

We can look to make the behavior more intuitive. We should investigate whether we can (efficiently) implement the collection in a way that satisfies the following:

Should enumerating the collection (with or without validation) while a concurrent write operation is in progress have defined behavior?

ghost commented 2 years ago

Tagging subscribers to this area: @dotnet/ncl See info in area-owners.md if you want to be subscribed.

Issue Details
The `HttpHeaders` collections were never thread-safe. Accessing a header may force lazy parsing of its value, resulting in modifications to the underlying data structures. Before .NET 6.0, reading from the collection concurrently happened to be thread-safe _in most cases_. If the collection contained an invalid header value, that value would be removed during enumeration. In rare cases, this could cause problems from modifying the underlying Dictionary concurrently. Starting with 6.0, less locking is performed around header parsing as it was no longer needed internally (#54130). Due to this change, multiple examples of users accessing the headers concurrently surfaced. Currently known ones to the team are gRPC's .NET client (#55898, #55896) and NewRelic's HttpClient instrumentation (https://github.com/newrelic/newrelic-dotnet-agent/issues/803). Violating thread safety in 6.0 may result in the header values being duplicated/malformed or various exceptions being thrown during enumeration/header accesses. I am posting a few known call stacks below to help other users hitting this issue discover the root cause:
IndexOutOfRangeException ``` System.IndexOutOfRangeException: Index was outside the bounds of the array. at System.Net.Http.Headers.HttpHeaders.ReadStoreValues[T](Span`1 values, Object storeValue, HttpHeaderParser parser, Int32& currentIndex) at System.Net.Http.Headers.HttpHeaders.GetStoreValuesAsStringOrStringArray(HeaderDescriptor descriptor, Object sourceValues, String& singleValue, String[]& multiValue) at System.Net.Http.Headers.HttpHeaders.GetStoreValuesAsStringArray(HeaderDescriptor descriptor, HeaderStoreItemInfo info) at System.Net.Http.Headers.HttpHeaders.GetEnumeratorCore()+MoveNext() ```
NullReferenceException ``` System.NullReferenceException: Object reference not set to an instance of an object. at System.Net.Http.Headers.HttpHeaders.ReadStoreValues[T](Span`1 values, Object storeValue, HttpHeaderParser parser, Int32& currentIndex) at System.Net.Http.Headers.HttpHeaders.GetStoreValuesAsStringOrStringArray(HeaderDescriptor descriptor, Object sourceValues, String& singleValue, String[]& multiValue) at System.Net.Http.Headers.HttpHeaders.GetStoreValuesAsStringArray(HeaderDescriptor descriptor, HeaderStoreItemInfo info) at System.Net.Http.Headers.HttpHeaders.GetEnumeratorCore()+MoveNext() ```
InvalidCastException ``` InvalidCastException: Unable to cast object of type 'System.Collections.Generic.List`1[System.Object]' to type 'System.Net.Http.Headers.MediaTypeHeaderValue'. at System.Net.Http.Headers.HttpContentHeaders.get_ContentType() ```

There is precedent in concurrent reads being thread-safe on some collections (namely Dictionary) and users may assume HttpHeaders behaves the same.

A related issue here is that forcing the validation of an invalid value will result in its removal from the collection. These Schrödinger's headers are difficult to reason about as their existence depends on whether/how/when they are observed.

We can look to make the behavior more intuitive. We should investigate whether we can (efficiently) implement the collection in a way that satisfies the following:

Should enumerating the collection (with or without validation) while a concurrent write operation is in progress have defined behavior?

Author: MihaZupan
Assignees: -
Labels: `area-System.Net.Http`
Milestone: 7.0.0

amin-lu-xero commented 2 years ago

I have an issue when I called DynamoDB as below. They looks like the same issue about http headers.

Screen Shot 2021-11-19 at 11 29 17 AM

I'm using AWSSDK.DynamoDBv2

MihaZupan commented 2 years ago

@amin-lu-xero the root cause is definitely the same, but looking at the AWS SDK code, it doesn't look like the problem is on their side. Are you perhaps using NewRelic for monitoring your application?

amin-lu-xero commented 2 years ago

Yes, we are using NewRelic for APM

MihaZupan commented 2 years ago

They have fixed the issue in their instrumentation https://github.com/newrelic/newrelic-dotnet-agent/issues/803 - please try updating their agent to the 9.2.0 version that was released a few hours ago.

amin-lu-xero commented 2 years ago

Thanks a lot !

Strepto commented 2 years ago

I'm seeing this exact issue in a dotnet 6 console app running from dotnet run on a TeamCity Agent. It seems very random and triggers in about 1/10 sessions when I upload around 1000 tiny files to the blob container.

I'm using Azure.Storage.Blobs v12.10.0 (Latest at the time of writing), and no other NuGet packages.

Source code (slightly modified): (Inspired by this official example: https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blob-scalable-app-upload-files?tabs=dotnet )

var blobContainerClient = new BlobContainerClient(new Uri(assetUploadToken.UploadDestination));
BlobUploadOptions uploadOptions = new BlobUploadOptions()
{
    TransferOptions = new StorageTransferOptions()
    {
        MaximumConcurrency = 8
    }
};
var tasks = new Queue<Task<Response<BlobContentInfo>>>();

foreach (string sourceFilePath in Directory.GetFiles(fileDirectoryPath, "*", searchOption)
                .Where(x => regexFileExtensions.Match(x).Success))
{
    var fileToUpload = new FileInfo(sourceFilePath);
    var blobTargetPath = $"my-files";
    var blobClient = blobContainerClient.GetBlobClient(blobTargetPath);
    var uploadTask = blobClient.UploadAsync(fileToUpload.FullName, uploadOptions);
    tasks.Enqueue(uploadTask);
}

// Run all the tasks asynchronously. The `uploadOptions` will handle concurrency.
await Task.WhenAll(tasks);

Console.WriteLine($"Uploaded {tasks.Count} files.");

Stack Trace:

System.IndexOutOfRangeException: Index was outside the bounds of the array.
   at System.Net.Http.Headers.HttpHeaders.ReadStoreValues[T](Span`1 values, Object storeValue, HttpHeaderParser parser, Int32& currentIndex)
-- SNIP (See below for full trace)
Full stack trace ``` System.IndexOutOfRangeException: Index was outside the bounds of the array. at System.Net.Http.Headers.HttpHeaders.ReadStoreValues[T](Span`1 values, Object storeValue, HttpHeaderParser parser, Int32& currentIndex) at System.Net.Http.Headers.HttpHeaders.GetStoreValuesAsStringOrStringArray(HeaderDescriptor descriptor, Object sourceValues, String& singleValue, String[]& multiValue) at System.Net.Http.Headers.HttpHeaders.GetStoreValuesAsStringArray(HeaderDescriptor descriptor, HeaderStoreItemInfo info) at System.Net.Http.Headers.HttpHeaders.TryGetValues(HeaderDescriptor descriptor, IEnumerable`1& values) at System.Net.Http.HttpConnectionPool.EstablishProxyTunnelAsync(Boolean async, HttpRequestHeaders headers, CancellationToken cancellationToken) at System.Net.Http.HttpConnectionPool.ConnectAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken) at System.Net.Http.HttpConnectionPool.CreateHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken) at System.Net.Http.HttpConnectionPool.AddHttp11ConnectionAsync(HttpRequestMessage request) at System.Threading.Tasks.TaskCompletionSourceWithCancellation`1.WaitWithCancellationAsync(CancellationToken cancellationToken) at System.Net.Http.HttpConnectionPool.GetHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken) at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken) at System.Net.Http.HttpClient.g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken) at Azure.Core.Pipeline.HttpClientTransport.ProcessAsync(HttpMessage message, Boolean async) at Azure.Core.Pipeline.ResponseBodyPolicy.ProcessAsync(HttpMessage message, ReadOnlyMemory`1 pipeline, Boolean async) at Azure.Core.Pipeline.HttpPipelineSynchronousPolicy.g__ProcessAsyncInner|4_0(HttpMessage message, ReadOnlyMemory`1 pipeline) at Azure.Core.Pipeline.RedirectPolicy.ProcessAsync(HttpMessage message, ReadOnlyMemory`1 pipeline, Boolean async) at Azure.Core.Pipeline.RetryPolicy.ProcessAsync(HttpMessage message, ReadOnlyMemory`1 pipeline, Boolean async) at Azure.Core.Pipeline.RetryPolicy.ProcessAsync(HttpMessage message, ReadOnlyMemory`1 pipeline, Boolean async) at Azure.Storage.Blobs.BlockBlobRestClient.UploadAsync(Int64 contentLength, Stream body, Nullable`1 timeout, Byte[] transactionalContentMD5, String blobContentType, String blobContentEncoding, String blobContentLanguage, Byte[] blobContentMD5, String blobCacheControl, IDictionary`2 metadata, String leaseId, String blobContentDisposition, String encryptionKey, String encryptionKeySha256, Nullable`1 encryptionAlgorithm, String encryptionScope, Nullable`1 tier, Nullable`1 ifModifiedSince, Nullable`1 ifUnmodifiedSince, String ifMatch, String ifNoneMatch, String ifTags, String blobTagsString, Nullable`1 immutabilityPolicyExpiry, Nullable`1 immutabilityPolicyMode, Nullable`1 legalHold, CancellationToken cancellationToken) at Azure.Storage.Blobs.Specialized.BlockBlobClient.UploadInternal(Stream content, BlobHttpHeaders blobHttpHeaders, IDictionary`2 metadata, IDictionary`2 tags, BlobRequestConditions conditions, Nullable`1 accessTier, BlobImmutabilityPolicy immutabilityPolicy, Nullable`1 legalHold, IProgress`1 progressHandler, String operationName, Boolean async, CancellationToken cancellationToken) at Azure.Storage.Blobs.Specialized.BlockBlobClient.<>c__DisplayClass62_0.<b__0>d.MoveNext() --- End of stack trace from previous location --- at Azure.Storage.PartitionedUploader`2.UploadInternal(Stream content, TServiceSpecificArgs args, IProgress`1 progressHandler, Boolean async, CancellationToken cancellationToken) at Azure.Storage.Blobs.BlobClient.StagedUploadInternal(Stream content, BlobUploadOptions options, Boolean async, CancellationToken cancellationToken) at Azure.Storage.Blobs.BlobClient.UploadAsync(String path, BlobUploadOptions options, CancellationToken cancellationToken) at UploadArtifacts.UploadArtifacts.ModelDistributionApiFetcher.UploadFiles(AssetUploadToken assetUploadToken, String fileDirectoryPath, String fileExtensionRegex, Boolean searchAllSubDirectories) ```

Should I also report this to the Azure.Storage.Blobs team? See: https://github.com/Azure/azure-sdk-for-net/issues/27018

MihaZupan commented 2 years ago

@Strepto are you by chance using a NewRelic agent on the machine running this code?

Strepto commented 2 years ago

No, we do not use NewRelic on the machine(s).

We do have TeamCitys Performance Monitor enabled though, if performance analysis might have an impact. https://www.jetbrains.com/help/teamcity/performance-monitor.html

Edit: I have now disabled TeamCitys Performance Monitor, and will monitor the jobs.

MihaZupan commented 2 years ago

@Strepto I opened #65379 as a possible explanation for the behavior you observed. Would it be possible for you to test the scenario without using a proxy?

Strepto commented 2 years ago

@Strepto I opened #65379 as a possible explanation for the behavior you observed. Would it be possible for you to test the scenario without using a proxy?

I cannot disable the proxy on the machines the error occurs. But I can see if I can try running the scenario on my dev-machine, which has no proxy.

karelz commented 2 years ago

Triage: We will keep it open for few more months to catch other use cases like the one in #65379. Then close it.

toddca commented 2 years ago

I believe I am hitting this issue using RestSharp 107.3.0 #https://github.com/restsharp/RestSharp/issues/1772

a-jackson commented 2 years ago

I'm hitting this issues with concurrent requests on AWS SDK for SQS. Is there any workaround? We are using a proxy where this error is seen and not getting on dev machine where there is no proxy.

MihaZupan commented 2 years ago

@a-jackson are you saying you are hitting #65379 specifically?

a-jackson commented 2 years ago

@MihaZupan I would guess that's the problem I'm having. I can't say say specifically that's it exactly. I only have error logs to go on. This is only reproducible on our servers where I'm unable to debug.

I've seen all the exceptions listed in the top post here as well as one from AWS "The request signature we calculated does not match the signature you provided." Where it appears the headers from one request are with the body of another or something to that effect although I can't confirm that.

MihaZupan commented 2 years ago

Are you able to share stack traces when this happens?

a-jackson commented 2 years ago

This is the most recent, it's not exactly one of the ones above but I think it's the same thing. I have seen the other errors as well but we don't keep logs that long on our test servers.

Stack trace ``` System.InvalidOperationException: Collection was modified; enumeration operation may not execute. at System.Collections.Generic.List`1.Enumerator.MoveNextRare() at System.Net.Http.Headers.HttpHeaders.ReadStoreValues[T](Span`1 values, Object storeValue, HttpHeaderParser parser, Int32& currentIndex) at System.Net.Http.Headers.HttpHeaders.GetStoreValuesAsStringOrStringArray(HeaderDescriptor descriptor, Object sourceValues, String& singleValue, String[]& multiValue) at System.Net.Http.Headers.HttpHeaders.GetStoreValuesAsStringArray(HeaderDescriptor descriptor, HeaderStoreItemInfo info) at System.Net.Http.HttpConnectionPool.EstablishProxyTunnelAsync(Boolean async, HttpRequestHeaders headers, CancellationToken cancellationToken) at System.Net.Http.HttpConnectionPool.ConnectAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken) at System.Net.Http.HttpConnectionPool.CreateHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken) at System.Net.Http.HttpConnectionPool.AddHttp11ConnectionAsync(HttpRequestMessage request) at System.Threading.Tasks.TaskCompletionSourceWithCancellation`1.WaitWithCancellationAsync(CancellationToken cancellationToken) at System.Net.Http.HttpConnectionPool.GetHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken) at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken) at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken) at System.Net.Http.HttpClient.g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken) at Amazon.Runtime.HttpWebRequestMessage.GetResponseAsync(CancellationToken cancellationToken) at Amazon.Runtime.Internal.HttpHandler`1.InvokeAsync[T](IExecutionContext executionContext) at Amazon.Runtime.Internal.Unmarshaller.InvokeAsync[T](IExecutionContext executionContext) at Amazon.SQS.Internal.ValidationResponseHandler.InvokeAsync[T](IExecutionContext executionContext) at Amazon.Runtime.Internal.ErrorHandler.InvokeAsync[T](IExecutionContext executionContext) at Amazon.Runtime.Internal.ErrorHandler.InvokeAsync[T](IExecutionContext executionContext) at Amazon.Runtime.Internal.CallbackHandler.InvokeAsync[T](IExecutionContext executionContext) at Amazon.Runtime.Internal.EndpointDiscoveryHandler.InvokeAsync[T](IExecutionContext executionContext) at Amazon.Runtime.Internal.EndpointDiscoveryHandler.InvokeAsync[T](IExecutionContext executionContext) at Amazon.Runtime.Internal.CredentialsRetriever.InvokeAsync[T](IExecutionContext executionContext) at Amazon.Runtime.Internal.RetryHandler.InvokeAsync[T](IExecutionContext executionContext) at Amazon.Runtime.Internal.RetryHandler.InvokeAsync[T](IExecutionContext executionContext) at Amazon.Runtime.Internal.CallbackHandler.InvokeAsync[T](IExecutionContext executionContext) at Amazon.Runtime.Internal.CallbackHandler.InvokeAsync[T](IExecutionContext executionContext) at Amazon.Runtime.Internal.ErrorCallbackHandler.InvokeAsync[T](IExecutionContext executionContext) at Amazon.Runtime.Internal.MetricsHandler.InvokeAsync[T](IExecutionContext executionContext) ```
MihaZupan commented 2 years ago

There are two main known issues users are hitting here:

In your case, it looks like you are hitting the latter - #65379.

There are possible mitigations here if you have access to how the HttpClient is created / the HttpRequestMessage or it's headers before it's sent. Is that the case with the AWS SDK?

a-jackson commented 2 years ago

I agree this looks like the latter case. It looks like it is possible to provide a custom HttpClient factory. What do I need to do?

MihaZupan commented 2 years ago

Hey @a-jackson, sorry I didn't see your reply sooner. If you have access to the handler chain, you can include a custom handler that will prevent the race condition.

var socketsHttpHandler = new SocketsHttpHandler();
var customHandler = new CustomHandler(socketsHttpHandler);
var httpClient = new HttpClient(customHandler);

public sealed class CustomHandler : DelegatingHandler
{
    public CustomHandler(HttpMessageHandler innerHandler) : base(innerHandler) { }

    protected override HttpResponseMessage Send(HttpRequestMessage request, CancellationToken cancellationToken)
    {
        _ = request.Headers.TryGetValues("User-Agent", out _); // Force parsing

        return base.Send(request, cancellationToken);
    }

    protected override Task<HttpResponseMessage> SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
    {
        _ = request.Headers.TryGetValues("User-Agent", out _); // Force parsing

        return base.SendAsync(request, cancellationToken);
    }
}
a-jackson commented 2 years ago

@MihaZupan Thanks. I've just done some testing and that seems to work. I'm going to start rolling it out to our projects where needed.

Is there any idea when this will be fixed? I see the milestone of this issue is set for 7.0.0 but we were hoping we could stick to LTS releases. Is there any chance this will be fixed in 6.x? We have several projects still working on upgrades to 6 which we've put on hold due to this.

MihaZupan commented 2 years ago

@a-jackson Is the above fix sufficient to unblock your upgrades to 7.0?

The issue will have to be addressed in main (currently planned for 7.0) first.

Given we've only heard of two users hitting this, and that a workaround exists, I think it's unlikely we would service an eventual fix for this at this time.

cc: @karelz regarding any potential servicing changes here (fix for #65379).

a-jackson commented 2 years ago

I can understand that if it's only two users, although we've had issues on every system running .Net 6.

I don't know yet if we'll continue with upgrades. We are in the process of phasing out using a proxy at all so we'll probably hold further upgrades until then.

I have another exception though that may or may not be related, just to give you some more info.

We were getting TimeoutException on requests to Dynamo DB around 5% of requests when using the proxy. The exception was 100s after the request. We tested without the proxy and didn't see a single error. These requests were not logged at the proxy at all and there's nothing else we could explain it with. Other requests, even from the same process, were being logged through the proxy at the same time which is why I thought it is probably a different manifestation of the same issue.

Stack trace ``` System.TimeoutException: A task was canceled. ---> System.Threading.Tasks.TaskCanceledException: A task was canceled. at System.Threading.Tasks.TaskCompletionSourceWithCancellation`1.WaitWithCancellationAsync(CancellationToken cancellationToken) at System.Net.Http.HttpConnectionPool.GetHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken) at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken) at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken) at System.Net.Http.HttpClient.g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken) --- End of inner exception stack trace --- at Amazon.Runtime.HttpWebRequestMessage.GetResponseAsync(CancellationToken cancellationToken) at Amazon.Runtime.Internal.HttpHandler`1.InvokeAsync[T](IExecutionContext executionContext) at Amazon.Runtime.Internal.Unmarshaller.InvokeAsync[T](IExecutionContext executionContext) at Amazon.Runtime.Internal.ErrorHandler.InvokeAsync[T](IExecutionContext executionContext) at Amazon.Runtime.Internal.ErrorHandler.InvokeAsync[T](IExecutionContext executionContext) at Amazon.Runtime.Internal.CallbackHandler.InvokeAsync[T](IExecutionContext executionContext) at Amazon.Runtime.Internal.EndpointDiscoveryHandler.InvokeAsync[T](IExecutionContext executionContext) at Amazon.Runtime.Internal.EndpointDiscoveryHandler.InvokeAsync[T](IExecutionContext executionContext) at Amazon.Runtime.Internal.CredentialsRetriever.InvokeAsync[T](IExecutionContext executionContext) at Amazon.Runtime.Internal.RetryHandler.InvokeAsync[T](IExecutionContext executionContext) at Amazon.Runtime.Internal.RetryHandler.InvokeAsync[T](IExecutionContext executionContext) at Amazon.Runtime.Internal.CallbackHandler.InvokeAsync[T](IExecutionContext executionContext) at Amazon.Runtime.Internal.CallbackHandler.InvokeAsync[T](IExecutionContext executionContext) at Amazon.Runtime.Internal.ErrorCallbackHandler.InvokeAsync[T](IExecutionContext executionContext) at Amazon.Runtime.Internal.MetricsHandler.InvokeAsync[T](IExecutionContext executionContext) ```
MihaZupan commented 2 years ago

System.Threading.Tasks.TaskCanceledException: A task was canceled. at System.Threading.Tasks.TaskCompletionSourceWithCancellation`1.WaitWithCancellationAsync at System.Net.Http.HttpConnectionPool.GetHttp11ConnectionAsync

This stack indicates that the request was indeed never sent -- it timed out before it could obtain a working connection. While the headers issue may cause connection attempts to fail more often (if the exception occurs inside System.Net.Http.HttpConnectionPool.EstablishProxyTunnelAsync like in your previous stack trace), it should never cause requests to hang without trying to establish a connection / failing outright.

The presence of this stack trace indicates that either:

a-jackson commented 2 years ago

a) Dynamo DB's client code appears to set a Connect Timeout of 3100ms. If it were this the exception would have been thrown after that long wouldn't it? Not 100s as we saw. Also the proxy logs indicated continuous activity throughout this period and we upgraded the instances it runs on to test that - it made no difference to the frequency of errors. b) We're not setting MaxConnectionsPerServer and as far as I can tell neither is the Dynamo DB client. There's an option for it which is not set. c) It looks like this to me but I wouldn't know where to begin with proving that.

We have just moved to using VPC Endpoints in AWS so Dynamo DB traffic is no longer using the proxy and we are not seeing this error now so it's not a problem for us any more.

MihaZupan commented 2 years ago

If it were this the exception would have been thrown after that long wouldn't it? Not 100s as we saw

If MaxConnectionsPerServer isn't set and ConnectTimeout is set to ~3 seconds, you shouldn't ever hit the 100s timeout before getting a connection/failing.

It's hard to say what exactly is happening without detailed info about what the connection pool was doing. If you were still interested in investigating this further, would you be able to provide traces from an app hitting such failures (would potentially include PII)?