Closed MihaZupan closed 2 years ago
Tagging subscribers to this area: @dotnet/ncl See info in area-owners.md if you want to be subscribed.
There is precedent in concurrent reads being thread-safe on some collections (namely Dictionary
) and users may assume HttpHeaders
behaves the same.
A related issue here is that forcing the validation of an invalid value will result in its removal from the collection. These Schrödinger's headers are difficult to reason about as their existence depends on whether/how/when they are observed.
We can look to make the behavior more intuitive. We should investigate whether we can (efficiently) implement the collection in a way that satisfies the following:
NonValidated
view of the collectionShould enumerating the collection (with or without validation) while a concurrent write operation is in progress have defined behavior?
Dictionary
's thread-safety modelAuthor: | MihaZupan |
---|---|
Assignees: | - |
Labels: | `area-System.Net.Http` |
Milestone: | 7.0.0 |
I have an issue when I called DynamoDB as below. They looks like the same issue about http headers.
I'm using AWSSDK.DynamoDBv2
@amin-lu-xero the root cause is definitely the same, but looking at the AWS SDK code, it doesn't look like the problem is on their side. Are you perhaps using NewRelic for monitoring your application?
Yes, we are using NewRelic for APM
They have fixed the issue in their instrumentation https://github.com/newrelic/newrelic-dotnet-agent/issues/803 - please try updating their agent to the 9.2.0 version that was released a few hours ago.
Thanks a lot !
I'm seeing this exact issue in a dotnet 6 console app running from dotnet run
on a TeamCity Agent. It seems very random and triggers in about 1/10 sessions when I upload around 1000 tiny files to the blob container.
I'm using Azure.Storage.Blobs v12.10.0
(Latest at the time of writing), and no other NuGet packages.
Source code (slightly modified): (Inspired by this official example: https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blob-scalable-app-upload-files?tabs=dotnet )
var blobContainerClient = new BlobContainerClient(new Uri(assetUploadToken.UploadDestination));
BlobUploadOptions uploadOptions = new BlobUploadOptions()
{
TransferOptions = new StorageTransferOptions()
{
MaximumConcurrency = 8
}
};
var tasks = new Queue<Task<Response<BlobContentInfo>>>();
foreach (string sourceFilePath in Directory.GetFiles(fileDirectoryPath, "*", searchOption)
.Where(x => regexFileExtensions.Match(x).Success))
{
var fileToUpload = new FileInfo(sourceFilePath);
var blobTargetPath = $"my-files";
var blobClient = blobContainerClient.GetBlobClient(blobTargetPath);
var uploadTask = blobClient.UploadAsync(fileToUpload.FullName, uploadOptions);
tasks.Enqueue(uploadTask);
}
// Run all the tasks asynchronously. The `uploadOptions` will handle concurrency.
await Task.WhenAll(tasks);
Console.WriteLine($"Uploaded {tasks.Count} files.");
Stack Trace:
System.IndexOutOfRangeException: Index was outside the bounds of the array.
at System.Net.Http.Headers.HttpHeaders.ReadStoreValues[T](Span`1 values, Object storeValue, HttpHeaderParser parser, Int32& currentIndex)
-- SNIP (See below for full trace)
Should I also report this to the Azure.Storage.Blobs team? See: https://github.com/Azure/azure-sdk-for-net/issues/27018
@Strepto are you by chance using a NewRelic agent on the machine running this code?
No, we do not use NewRelic on the machine(s).
We do have TeamCitys Performance Monitor enabled though, if performance analysis might have an impact. https://www.jetbrains.com/help/teamcity/performance-monitor.html
Edit: I have now disabled TeamCitys Performance Monitor, and will monitor the jobs.
@Strepto I opened #65379 as a possible explanation for the behavior you observed. Would it be possible for you to test the scenario without using a proxy?
@Strepto I opened #65379 as a possible explanation for the behavior you observed. Would it be possible for you to test the scenario without using a proxy?
I cannot disable the proxy on the machines the error occurs. But I can see if I can try running the scenario on my dev-machine, which has no proxy.
Triage: We will keep it open for few more months to catch other use cases like the one in #65379. Then close it.
I believe I am hitting this issue using RestSharp 107.3.0 #https://github.com/restsharp/RestSharp/issues/1772
I'm hitting this issues with concurrent requests on AWS SDK for SQS. Is there any workaround? We are using a proxy where this error is seen and not getting on dev machine where there is no proxy.
@a-jackson are you saying you are hitting #65379 specifically?
@MihaZupan I would guess that's the problem I'm having. I can't say say specifically that's it exactly. I only have error logs to go on. This is only reproducible on our servers where I'm unable to debug.
I've seen all the exceptions listed in the top post here as well as one from AWS "The request signature we calculated does not match the signature you provided." Where it appears the headers from one request are with the body of another or something to that effect although I can't confirm that.
Are you able to share stack traces when this happens?
This is the most recent, it's not exactly one of the ones above but I think it's the same thing. I have seen the other errors as well but we don't keep logs that long on our test servers.
There are two main known issues users are hitting here:
CONNECT
tunnel bug in HttpConnectionPool
- you will see exceptions around request headers (even before the request is sent)In your case, it looks like you are hitting the latter - #65379.
There are possible mitigations here if you have access to how the HttpClient
is created / the HttpRequestMessage
or it's headers before it's sent. Is that the case with the AWS SDK?
I agree this looks like the latter case. It looks like it is possible to provide a custom HttpClient factory. What do I need to do?
Hey @a-jackson, sorry I didn't see your reply sooner. If you have access to the handler chain, you can include a custom handler that will prevent the race condition.
var socketsHttpHandler = new SocketsHttpHandler();
var customHandler = new CustomHandler(socketsHttpHandler);
var httpClient = new HttpClient(customHandler);
public sealed class CustomHandler : DelegatingHandler
{
public CustomHandler(HttpMessageHandler innerHandler) : base(innerHandler) { }
protected override HttpResponseMessage Send(HttpRequestMessage request, CancellationToken cancellationToken)
{
_ = request.Headers.TryGetValues("User-Agent", out _); // Force parsing
return base.Send(request, cancellationToken);
}
protected override Task<HttpResponseMessage> SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
{
_ = request.Headers.TryGetValues("User-Agent", out _); // Force parsing
return base.SendAsync(request, cancellationToken);
}
}
@MihaZupan Thanks. I've just done some testing and that seems to work. I'm going to start rolling it out to our projects where needed.
Is there any idea when this will be fixed? I see the milestone of this issue is set for 7.0.0 but we were hoping we could stick to LTS releases. Is there any chance this will be fixed in 6.x? We have several projects still working on upgrades to 6 which we've put on hold due to this.
@a-jackson Is the above fix sufficient to unblock your upgrades to 7.0?
The issue will have to be addressed in main
(currently planned for 7.0) first.
Given we've only heard of two users hitting this, and that a workaround exists, I think it's unlikely we would service an eventual fix for this at this time.
cc: @karelz regarding any potential servicing changes here (fix for #65379).
I can understand that if it's only two users, although we've had issues on every system running .Net 6.
I don't know yet if we'll continue with upgrades. We are in the process of phasing out using a proxy at all so we'll probably hold further upgrades until then.
I have another exception though that may or may not be related, just to give you some more info.
We were getting TimeoutException on requests to Dynamo DB around 5% of requests when using the proxy. The exception was 100s after the request. We tested without the proxy and didn't see a single error. These requests were not logged at the proxy at all and there's nothing else we could explain it with. Other requests, even from the same process, were being logged through the proxy at the same time which is why I thought it is probably a different manifestation of the same issue.
System.Threading.Tasks.TaskCanceledException: A task was canceled. at System.Threading.Tasks.TaskCompletionSourceWithCancellation`1.WaitWithCancellationAsync at System.Net.Http.HttpConnectionPool.GetHttp11ConnectionAsync
This stack indicates that the request was indeed never sent -- it timed out before it could obtain a working connection.
While the headers issue may cause connection attempts to fail more often (if the exception occurs inside System.Net.Http.HttpConnectionPool.EstablishProxyTunnelAsync
like in your previous stack trace), it should never cause requests to hang without trying to establish a connection / failing outright.
The presence of this stack trace indicates that either:
ConnectTimeout
on SocketsHttpHandler
to confirm (e.g. to 15 seconds)MaxConnectionsPerServer
set and connections weren't able to keep up with incoming requests, resulting in some requests timing outa) Dynamo DB's client code appears to set a Connect Timeout of 3100ms. If it were this the exception would have been thrown after that long wouldn't it? Not 100s as we saw. Also the proxy logs indicated continuous activity throughout this period and we upgraded the instances it runs on to test that - it made no difference to the frequency of errors.
b) We're not setting MaxConnectionsPerServer
and as far as I can tell neither is the Dynamo DB client. There's an option for it which is not set.
c) It looks like this to me but I wouldn't know where to begin with proving that.
We have just moved to using VPC Endpoints in AWS so Dynamo DB traffic is no longer using the proxy and we are not seeing this error now so it's not a problem for us any more.
If it were this the exception would have been thrown after that long wouldn't it? Not 100s as we saw
If MaxConnectionsPerServer
isn't set and ConnectTimeout
is set to ~3 seconds, you shouldn't ever hit the 100s timeout before getting a connection/failing.
It's hard to say what exactly is happening without detailed info about what the connection pool was doing. If you were still interested in investigating this further, would you be able to provide traces from an app hitting such failures (would potentially include PII)?
The
HttpHeaders
collections were never thread-safe. Accessing a header may force lazy parsing of its value, resulting in modifications to the underlying data structures.Before .NET 6.0, reading from the collection concurrently happened to be thread-safe in most cases. If the collection contained an invalid header value, that value would be removed during enumeration. In rare cases, this could cause problems from modifying the underlying Dictionary concurrently.
Starting with 6.0, less locking is performed around header parsing as it was no longer needed internally (#54130). Due to this change, multiple examples of users accessing the headers concurrently surfaced. Currently known ones to the team are gRPC's .NET client (#55898, #55896) and NewRelic's HttpClient instrumentation (https://github.com/newrelic/newrelic-dotnet-agent/issues/803). Violating thread safety in 6.0 may result in the header values being duplicated/malformed or various exceptions being thrown during enumeration/header accesses.
I am posting a few known call stacks below to help other users hitting this issue discover the root cause:
IndexOutOfRangeException
``` System.IndexOutOfRangeException: Index was outside the bounds of the array. at System.Net.Http.Headers.HttpHeaders.ReadStoreValues[T](Span`1 values, Object storeValue, HttpHeaderParser parser, Int32& currentIndex) at System.Net.Http.Headers.HttpHeaders.GetStoreValuesAsStringOrStringArray(HeaderDescriptor descriptor, Object sourceValues, String& singleValue, String[]& multiValue) at System.Net.Http.Headers.HttpHeaders.GetStoreValuesAsStringArray(HeaderDescriptor descriptor, HeaderStoreItemInfo info) at System.Net.Http.Headers.HttpHeaders.GetEnumeratorCore()+MoveNext() ```NullReferenceException
``` System.NullReferenceException: Object reference not set to an instance of an object. at System.Net.Http.Headers.HttpHeaders.ReadStoreValues[T](Span`1 values, Object storeValue, HttpHeaderParser parser, Int32& currentIndex) at System.Net.Http.Headers.HttpHeaders.GetStoreValuesAsStringOrStringArray(HeaderDescriptor descriptor, Object sourceValues, String& singleValue, String[]& multiValue) at System.Net.Http.Headers.HttpHeaders.GetStoreValuesAsStringArray(HeaderDescriptor descriptor, HeaderStoreItemInfo info) at System.Net.Http.Headers.HttpHeaders.GetEnumeratorCore()+MoveNext() ```InvalidCastException
``` InvalidCastException: Unable to cast object of type 'System.Collections.Generic.List`1[System.Object]' to type 'System.Net.Http.Headers.MediaTypeHeaderValue'. at System.Net.Http.Headers.HttpContentHeaders.get_ContentType() ```There is precedent in concurrent reads being thread-safe on some collections (namely
Dictionary
) and users may assumeHttpHeaders
behaves the same.A related issue here is that forcing the validation of an invalid value will result in its removal from the collection. These Schrödinger's headers are difficult to reason about as their existence depends on whether/how/when they are observed.
We can look to make the behavior more intuitive. We should investigate whether we can (efficiently) implement the collection in a way that satisfies the following:
NonValidated
view of the collectionShould enumerating the collection (with or without validation) while a concurrent write operation is in progress have defined behavior?
Dictionary
's thread-safety model