HttpConnectionPool.RequestQueue leaking items

davidni commented 2 years ago

Description

While investigating a large increase in memory usage in one of our services running on .NET 6.0.2, I investigated a memory dump and found an HttpConnectionPool containing a RequestQueue object with a very large number of items (>10k). This was problematic because those items were hanging on to HttpRequestMessage objects which in turn had large header payloads, and ultimately led to large memory consumption for this process.

Interestingly, looking at telemetry, I see no evidence of ever having more than ~500 concurrent outstanding requests in total in the entire process, across all of our connection pools. So it seems odd, to say the least, that a single connection pool would be keeping track of this many queued requests.

I haven't figured out why these requests are in the queue and/or why they aren't draining. We can share the memory dump internally if that would help.

0:000> !do 00000166497AA310
Name:        System.Collections.Generic.Queue`1[[System.Net.Http.HttpConnectionPool+RequestQueue`1+QueueItem[[System.Net.Http.Http2Connection, System.Net.Http]], System.Net.Http]]
MethodTable: 00007ff8a9336fe0
EEClass:     00007ff8a8c8b4e0
Tracked Type: false
Size:        40(0x28) bytes
File:        D:\SvcFab\_App\CoreServices.IslandGatewayAppType_App42\IslandGatewayPkg.Code.1.53.01880.147-release-217a0a55-net6\System.Private.CoreLib.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007ff8a8ca8238  400209c        8 ...ivate.CoreLib]][]  0 instance 00000166579f3410 _array
00007ff8a7b69480  400209d       10         System.Int32  1 instance                0 _head
00007ff8a7b69480  400209e       14         System.Int32  1 instance            11233 _tail
00007ff8a7b69480  400209f       18         System.Int32  1 instance            11233 _size
00007ff8a7b69480  40020a0       1c         System.Int32  1 instance            11842 _version

Reproduction Steps

Unclear

Expected behavior

HttpConnectionPool shouldn't be holding on to queued requests that aren't in flight.

Actual behavior

HttpConnectionPool seems hold on to requests that aren't in flight, leading to excessive memory usage.

Regression?

We did not notice this specifically with .NET Core 3.1, but it is possible it had been there too.

Known Workarounds

N/A

Configuration

.NET 6.0.2
Microsoft Windows 10.0.17763
RuntimeIdentifier: win10-x64

Other information

No response

ghost commented 2 years ago

Tagging subscribers to this area: @dotnet/ncl See info in area-owners.md if you want to be subscribed.

Issue Details

### Description While investigating a large increase in memory usage in one of our services running on .NET 6.0.2, I investigated a memory dump and found an HttpConnectionPool containing a RequestQueue object with a very large number of items (>10k). This was problematic because those items were hanging on to HttpRequestMessage objects which in turn had large header payloads, and ultimately led to large memory consumption for this process. Interestingly, looking at telemetry, I see no evidence of ever having more than ~500 concurrent outstanding requests *in total* in the entire process, across all of our connection pools. So it seems odd, to say the least, that a single connection pool would be keeping track of this many queued requests. I haven't figured out why these requests are in the queue and/or why they aren't draining. We can share the memory dump internally if that would help. ``` 0:000> !do 00000166497AA310 Name: System.Collections.Generic.Queue`1[[System.Net.Http.HttpConnectionPool+RequestQueue`1+QueueItem[[System.Net.Http.Http2Connection, System.Net.Http]], System.Net.Http]] MethodTable: 00007ff8a9336fe0 EEClass: 00007ff8a8c8b4e0 Tracked Type: false Size: 40(0x28) bytes File: D:\SvcFab\_App\CoreServices.IslandGatewayAppType_App42\IslandGatewayPkg.Code.1.53.01880.147-release-217a0a55-net6\System.Private.CoreLib.dll Fields: MT Field Offset Type VT Attr Value Name 00007ff8a8ca8238 400209c 8 ...ivate.CoreLib]][] 0 instance 00000166579f3410 _array 00007ff8a7b69480 400209d 10 System.Int32 1 instance 0 _head 00007ff8a7b69480 400209e 14 System.Int32 1 instance 11233 _tail 00007ff8a7b69480 400209f 18 System.Int32 1 instance 11233 _size 00007ff8a7b69480 40020a0 1c System.Int32 1 instance 11842 _version ``` ### Reproduction Steps Unclear ### Expected behavior HttpConnectionPool shouldn't be holding on to queued requests that aren't in flight. ### Actual behavior HttpConnectionPool seems hold on to requests that aren't in flight, leading to excessive memory usage. ### Regression? We did not notice this specifically with .NET Core 3.1, but it is possible it had been there too. ### Known Workarounds N/A ### Configuration * .NET 6.0.2 * Microsoft Windows 10.0.17763 * RuntimeIdentifier: win10-x64 ### Other information _No response_

Author:	davidni
Assignees:	-
Labels:	`area-System.Net.Http`, `untriaged`
Milestone:	-

karelz commented 2 years ago

@CarnaViire could it be related to your recent investigations in the space?

CarnaViire commented 2 years ago

could it be related to your recent investigations in the space?

I don't think so. This issue is on Windows, and it seems to be for HTTP/2. (While mine were timeouts on Linux for HTTP/1.1)

@davidni what is the typical behavior of your app (in terms of HTTP requests)? Do you only use HTTP/2, or HTTP/1.1 as well? What timeouts do you have set, if any? Are any exceptions happening, or all requests succeed, but the queue keeps growing?

davidni commented 2 years ago

what is the typical behavior of your app (in terms of HTTP requests)?

We use YARP to proxy requests to multiple (1k+) destinations.

Do you only use HTTP/2, or HTTP/1.1 as well?

In this case HTTP/2 only. While we do both, in this case, looking at the dump, this connection pool was only handling HTTP/2 requests, and the queue that grew was the _http2RequestQueue.

What timeouts do you have set, if any? Are any exceptions happening, or all requests succeed, but the queue keeps growing?

30 seconds cancellation token when calling SocketsHttpHandler.SendAsync
We were indeed facing lots of exceptions while the queue was growing (timeouts, as well as transport errors). Though seeing exceptions like this isn't unexpected, we proxy requests to a lot of destinations and some of them can be unavailable at any point. Some of the top hitter exceptions are shown below:

IslandGateway.Core.Service.ErrorHandling.ProxyException: YARP error YarpRequest
System.Net.Http.HttpRequestException: No connection could be made because the target machine actively refused it. ([REDACTED])
System.Net.Sockets.SocketException: No connection could be made because the target machine actively refused it.

IslandGateway.Core.Service.ErrorHandling.ProxyException: YARP error YarpRequest
System.Net.Http.HttpRequestException: The SSL connection could not be established, see inner exception.
System.IO.IOException:  Received an unexpected EOF or 0 bytes from the transport stream.

karelz commented 2 years ago

Triage: We need to understand the root cause. Fix at minimum in 7.0, most likely also service 6.0.x.

MihaZupan commented 2 years ago

Before 6.0, cancelling a request would result in it being removed from the request queue immediately. With the 6.0 connection pooling changes, the queue entry will be marked as canceled, but it will only be removed once an established connection starts processing the queue / a failed connection drops its initiating request.

From looking at the dump, the requests sitting in the queue are canceled and the connection attempt never completes. Since there is no connection, nothing cleans up canceled queue entries, resulting in the leak.

The main mitigation for this is to set the SocketsHttpHandler.ConnectTimeout to a reasonable value. We're changing the default from infinite to 15s in 7.0 (#66297). You may choose to go even lower. Doing so will give the pool a chance to retry a connection, potentially processing all the requests, or at least make some forward progress in draining the queue.

There is also a bug here in the connection pool that will make such leaks more common: when injecting a connection to the pool, we only check whether a request exists in the queue, but not whether it's already canceled. This means that we will issue a connection attempt even for already-canceled requests. What's worse is that a failed connection attempt will only remove the one initiating request from the queue before attempting to inject more connections.

As a result, you can end up in a situation like this (I hope I'm misreading something in the code):

Start 1000 HTTP/2 requests to a dead endpoint
Connection pool starts 1 connection attempts
The connection takes a long time or fails to establish
All 1000 request timeout and are canceled, but they remain in the queue
The connection pool observes a connection failure, removes the first request from the queue and initiates a new connection
The new connection similarly eventually fails, removes the first request from the queue and initiates a new connection
Repeat above for every request - attempting 1000 connections for requests that are all already canceled

Not only are we keeping the requests and everything associated with them rooted for a very long time, we are also wasting connection attempts on dead requests.

CarnaViire commented 2 years ago

Given that the default socket timeout on Windows is 21s, I'm afraid just setting ConnectTimeout to 15s wouldn't help that much. My hunch is that the exact case you've described in the second half is manifesting 😢 But in that case setting EnableMultipleHttp2Connections to true here should help, so all the requests wouldn't wait for the single HTTP/2 connection to be established.

On our side, we need to clean up the queue at least not only on successful, but on failed attempts too, and maybe even more aggressively.

MihaZupan commented 2 years ago

My suspicion is that for whatever reason the connection took longer than 21 seconds to fail since I saw the queue had never moved at all in the dump. As in the first connection attempt was still pending while 7k requests were added and canceled on its queue.

EnableMultipleHttp2Connections will help once a connection is successfully established, but likely wouldn't mitigate the above problem since we allow at most 1 pending connection attempt for HTTP/2 at any given time (though we may change that in the future).

ManickaP commented 2 years ago

Mitigation might be to remove and throw away any cancelled requests in TryPeek method when we check for connection injection, but this will only help up to the first "active" request in the queue.

Thinking more generally, do we need to keep request references directly in the queue? Do we even need them there - what are they used for after getting picked from the queue?

MihaZupan commented 2 years ago

Opened #66944 to track aggressively opening more connections for HTTP/2.

Mitigation might be to remove and throw away any cancelled requests in TryPeek method when we check for connection injection, but this will only help up to the first "active" request in the queue.

Yes, this wouldn't entirely solve the problem of keeping the requests rooted, but it would be a solution to the issue of issuing connection attempts for pre-canceled requests.

Thinking more generally, do we need to keep request references directly in the queue? Do we even need them there - what are they used for after getting picked from the queue?

They are currently used to determine whether to fail the first request in the queue on connection failures. A connection attempt stores a reference to the first request in the queue. On failure, it checks whether the same request is still at the start, in which case it fails it to force forward progress. We could probably change the mechanism here to storing something other than the request (e.g. a timestamp/counter).

MihaZupan commented 2 years ago

The initial fixes that help with this issue have been merged (#66990, #66607). A possible HTTP/2 mitigation by opening multiple connections at a time is tracked by #66944.

We could address this completely by removing queue entries as soon as the request is canceled (.NET 5 behavior). Moving back to untriaged.

MihaZupan commented 2 years ago

It has been mitigated, but the underlying issue still exists: we're not removing entries immediately when they're canceled.

karelz commented 2 years ago

Triage: Mitigated 2 ways:

~~We have default ConnectTimeout~~ this has been reverted - discussion ongoing in #66297
Fixes for #66990 help to remove cancelled requests

The remaining part could be done, to help reduce memory usage a bit, but it won't cause a memory leak anymore. Moving to Future.

davidni commented 2 years ago

Looks like #66990 is on track to ship in .NET 6.0.5, do we know when it would be released? Thanks all for your work on this and for providing a viable short-term mitigation!

dotnet / runtime