dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.87k stars 4.63k forks source link

Consider detecting stalled HTTP/2 connections in SocketsHttpHandler #82236

Open MihaZupan opened 1 year ago

MihaZupan commented 1 year ago

In .NET 7, we added a mitigation in the connection pool to tear down stalled connection attempts to prevent a single stalled operation from resulting in many request failures (#71785).

When using HTTP/1.1, if a connection stops responding, the request using said connection will eventually time out and tear down the connection. But with HTTP/2, the connection is shared, and request cancellations/timeouts do not affect its lifetime. As a result, a single stalled connection can continue to live in the connection pool and result in many request failures.

I recently saw a memory dump with an Http2Connection that had 0 currently active streams, but a write channel with thousands of entries. All the requests were failing inside of Http2Connection.SendHeadersAsync as the write channel was not getting drained. There was also a pending 32 KB socket write for that connection, but it had apparently not completed/failed for a long time. Because we store a timestamp every time we read a frame, I can also see that we haven't received anything from the server for 80 hours at the time the dump was taken, despite a PooledConnectionIdleTimeout of 1 minute. Note that these were regular socket read/writes in an otherwise responsive process.

I believe we should consider employing some sort of mitigation to get rid of connections where either the writes aren't draining, or the server hasn't sent us anything for a long time.

The obvious way to achieve that would be to make use of client-initiated HTTP/2 pings that we exposed in .NET 5 (#31198), which are currently OFF by default. Judging by the comments on the original API proposal issue, some servers are very aggressive about closing connections when clients use pings, so just turning them on by default may prove too risky.

On the other hand, we could have logic that did "If FlushOutgoingBytesAsync hasn't completed in X amount of time, take the connection out of the pool" + "If header writes are going through, but we haven't received anything from the server in Y amount of time, take the connection out of the pool as it's effectively idle/unresponsive" or something along those lines.

Any thoughts?

ghost commented 1 year ago

Tagging subscribers to this area: @dotnet/ncl See info in area-owners.md if you want to be subscribed.

Issue Details
In .NET 7, we added a mitigation in the connection pool to tear down stalled connection attempts to prevent a single stalled operation from resulting in many request failures (#71785). When using HTTP/1.1, if a connection stops responding, the request using said connection will eventually time out and tear down the connection. But with HTTP/2, the connection is shared, and request cancellations/timeouts do not affect its lifetime. As a result, a single stalled connection can continue to live in the connection pool and result in many request failures. I recently saw a memory dump with an Http2Connection that had 0 currently active streams, but a [write channel](https://github.com/dotnet/runtime/blob/e927c14d162c3693bcc0e3b0e8976cc7f8a6ed4a/src/libraries/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/Http2Connection.cs#L57) with thousands of entries. All the requests were failing inside of `Http2Connection.SendHeadersAsync` as the write channel was not getting drained. There was also a pending [32 KB socket write](https://github.com/dotnet/runtime/blob/e927c14d162c3693bcc0e3b0e8976cc7f8a6ed4a/src/libraries/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/Http2Connection.cs#L1250) for that connection, but it had apparently not completed/failed for a long time. Because we store a timestamp every time we read a frame, I can also see that we haven't received anything from the server for **80 hours** at the time the dump was taken, despite a `PooledConnectionIdleTimeout` of 1 minute. Note that these were regular socket read/writes in an otherwise responsive process. I believe we should consider employing some sort of mitigation to get rid of connections where either the writes aren't draining, or the server hasn't sent us anything for a long time. The obvious way to achieve that would be to make use of client-initiated HTTP/2 pings that we exposed in .NET 5 (#31198), which are currently **OFF** by default. Judging by the [comments on the original API proposal issue](https://github.com/dotnet/runtime/issues/31198#issuecomment-663783734), some servers are very aggressive about closing connections when clients use pings, so just turning them on by default may prove too risky. On the other hand, we could have logic that did "If `FlushOutgoingBytesAsync` hasn't completed in X amount of time, take the connection out of the pool" + "If header writes are going through, but we haven't received anything from the server in Y amount of time, take the connection out of the pool as it's effectively idle/unresponsive" or something along those lines. Any thoughts?
Author: MihaZupan
Assignees: -
Labels: `area-System.Net.Http`, `tenet-reliability`
Milestone: -