apache / trafficserver

Apache Traffic Server™ is a fast, scalable and extensible HTTP/1.1 and HTTP/2 compliant caching proxy server.
https://trafficserver.apache.org/
Apache License 2.0
1.78k stars 790 forks source link

Count of concurrent connections diverges between ATS H2 client and ATS H2 server #8200

Open shinrich opened 2 years ago

shinrich commented 2 years ago

This is an issue I saw while debugging the HTTP/2 to origin branch between two ATS servers. But the issue could in theory be seen between another busy H2 client and ATS acting as the H2 server.

For ATS, Http2ConnectionState has the client_streams_in_count to track how many streams the session has active. It checks this value against the the session's max_concurrent_stream setting to make sure creating a new stream does not violate the session's max_concurrent_stream policy. Similarly, the H2 client (ATS for my branch) must be keeping similar counts to track the peer's current stream count. It should not send a new HEADER frame to the peer if that would push the peer over it's max_concurrent_stream limit. Therefore, client_streams_in_count for each machine should be kept consistent.

This consistency was not always maintained, because on the client_streams_in_count was not being decremented until the HttpSM was destroyed. Depending on caching and other operations this may be some time after the frame with the EOS was sent or received.

So the client could dutifully be decrementing the stream count on receiving the EOS frame, and then have a lower stream count than the peer and send too many HEADER frames. While this is only a STREAM error, it is likely that a DATA frame would be sent soon after resulting in a BAD stream ID error which is a connection error. This would take down the session and all active streams (which could be quite a few streams on a busy session).

In my test branch I pulled the client_streams_in_count into the Http2SStream::change_state method and Http2Stream::do_io_close and Http2Stream::initiating_close for streams that get shutdown from an odd state.

shinrich commented 2 years ago

Working with this again yesterday with my very active H2 session between two ATS boxes. Finally put an assert on ATS2 when the max_concurrent limit is set so I could look at the state of the stuck streams. The problem appears to be that a number of streams had stalled due to empty session windows. So ATS1 had decremented its peer stream count (presumably timed out/client aborted), but it did not send a reset to ATS2 so it could clean up as well. I would assume the ATS2 would time the stream out eventually as well.

I added the session/stream window split logic and increased the session window and the starvation went away.

Hitting the max_concurrent limit is really bad because the next frame on that stream will take down the entire session. However @maskit seems to indicate in issue #5675 that we can just ignore such frames instead of sending a connection error.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. Marking it stale to flag it for further consideration by the community.