Open davidni opened 2 years ago
Tagging subscribers to this area: @dotnet/ncl See info in area-owners.md if you want to be subscribed.
Author: | davidni |
---|---|
Assignees: | - |
Labels: | `area-System.Net.Http`, `untriaged` |
Milestone: | - |
@CarnaViire could it be related to your recent investigations in the space?
could it be related to your recent investigations in the space?
I don't think so. This issue is on Windows, and it seems to be for HTTP/2. (While mine were timeouts on Linux for HTTP/1.1)
@davidni what is the typical behavior of your app (in terms of HTTP requests)? Do you only use HTTP/2, or HTTP/1.1 as well? What timeouts do you have set, if any? Are any exceptions happening, or all requests succeed, but the queue keeps growing?
what is the typical behavior of your app (in terms of HTTP requests)?
We use YARP to proxy requests to multiple (1k+) destinations.
Do you only use HTTP/2, or HTTP/1.1 as well?
In this case HTTP/2 only. While we do both, in this case, looking at the dump, this connection pool was only handling HTTP/2 requests, and the queue that grew was the _http2RequestQueue
.
What timeouts do you have set, if any? Are any exceptions happening, or all requests succeed, but the queue keeps growing?
IslandGateway.Core.Service.ErrorHandling.ProxyException: YARP error YarpRequest
System.Net.Http.HttpRequestException: No connection could be made because the target machine actively refused it. ([REDACTED])
System.Net.Sockets.SocketException: No connection could be made because the target machine actively refused it.
IslandGateway.Core.Service.ErrorHandling.ProxyException: YARP error YarpRequest
System.Net.Http.HttpRequestException: The SSL connection could not be established, see inner exception.
System.IO.IOException: Received an unexpected EOF or 0 bytes from the transport stream.
Triage: We need to understand the root cause. Fix at minimum in 7.0, most likely also service 6.0.x.
Before 6.0, cancelling a request would result in it being removed from the request queue immediately. With the 6.0 connection pooling changes, the queue entry will be marked as canceled, but it will only be removed once an established connection starts processing the queue / a failed connection drops its initiating request.
From looking at the dump, the requests sitting in the queue are canceled and the connection attempt never completes. Since there is no connection, nothing cleans up canceled queue entries, resulting in the leak.
The main mitigation for this is to set the SocketsHttpHandler.ConnectTimeout
to a reasonable value. We're changing the default from infinite to 15s in 7.0 (#66297). You may choose to go even lower.
Doing so will give the pool a chance to retry a connection, potentially processing all the requests, or at least make some forward progress in draining the queue.
There is also a bug here in the connection pool that will make such leaks more common: when injecting a connection to the pool, we only check whether a request exists in the queue, but not whether it's already canceled. This means that we will issue a connection attempt even for already-canceled requests. What's worse is that a failed connection attempt will only remove the one initiating request from the queue before attempting to inject more connections.
As a result, you can end up in a situation like this (I hope I'm misreading something in the code):
Not only are we keeping the requests and everything associated with them rooted for a very long time, we are also wasting connection attempts on dead requests.
Given that the default socket timeout on Windows is 21s, I'm afraid just setting ConnectTimeout
to 15s wouldn't help that much. My hunch is that the exact case you've described in the second half is manifesting 😢 But in that case setting EnableMultipleHttp2Connections
to true here should help, so all the requests wouldn't wait for the single HTTP/2 connection to be established.
On our side, we need to clean up the queue at least not only on successful, but on failed attempts too, and maybe even more aggressively.
My suspicion is that for whatever reason the connection took longer than 21 seconds to fail since I saw the queue had never moved at all in the dump. As in the first connection attempt was still pending while 7k requests were added and canceled on its queue.
EnableMultipleHttp2Connections
will help once a connection is successfully established, but likely wouldn't mitigate the above problem since we allow at most 1 pending connection attempt for HTTP/2 at any given time (though we may change that in the future).
Mitigation might be to remove and throw away any cancelled requests in TryPeek
method when we check for connection injection, but this will only help up to the first "active" request in the queue.
Thinking more generally, do we need to keep request references directly in the queue? Do we even need them there - what are they used for after getting picked from the queue?
Opened #66944 to track aggressively opening more connections for HTTP/2.
Mitigation might be to remove and throw away any cancelled requests in TryPeek method when we check for connection injection, but this will only help up to the first "active" request in the queue.
Yes, this wouldn't entirely solve the problem of keeping the requests rooted, but it would be a solution to the issue of issuing connection attempts for pre-canceled requests.
Thinking more generally, do we need to keep request references directly in the queue? Do we even need them there - what are they used for after getting picked from the queue?
They are currently used to determine whether to fail the first request in the queue on connection failures. A connection attempt stores a reference to the first request in the queue. On failure, it checks whether the same request is still at the start, in which case it fails it to force forward progress. We could probably change the mechanism here to storing something other than the request (e.g. a timestamp/counter).
The initial fixes that help with this issue have been merged (#66990, #66607). A possible HTTP/2 mitigation by opening multiple connections at a time is tracked by #66944.
We could address this completely by removing queue entries as soon as the request is canceled (.NET 5 behavior). Moving back to untriaged.
It has been mitigated, but the underlying issue still exists: we're not removing entries immediately when they're canceled.
Triage: Mitigated 2 ways:
The remaining part could be done, to help reduce memory usage a bit, but it won't cause a memory leak anymore. Moving to Future.
Looks like #66990 is on track to ship in .NET 6.0.5, do we know when it would be released? Thanks all for your work on this and for providing a viable short-term mitigation!
Description
While investigating a large increase in memory usage in one of our services running on .NET 6.0.2, I investigated a memory dump and found an HttpConnectionPool containing a RequestQueue object with a very large number of items (>10k). This was problematic because those items were hanging on to HttpRequestMessage objects which in turn had large header payloads, and ultimately led to large memory consumption for this process.
Interestingly, looking at telemetry, I see no evidence of ever having more than ~500 concurrent outstanding requests in total in the entire process, across all of our connection pools. So it seems odd, to say the least, that a single connection pool would be keeping track of this many queued requests.
I haven't figured out why these requests are in the queue and/or why they aren't draining. We can share the memory dump internally if that would help.
Reproduction Steps
Unclear
Expected behavior
HttpConnectionPool shouldn't be holding on to queued requests that aren't in flight.
Actual behavior
HttpConnectionPool seems hold on to requests that aren't in flight, leading to excessive memory usage.
Regression?
We did not notice this specifically with .NET Core 3.1, but it is possible it had been there too.
Known Workarounds
N/A
Configuration
Other information
No response