elixir-mint / mint

Functional HTTP client for Elixir with support for HTTP/1 and HTTP/2 šŸŒ±
Apache License 2.0
1.37k stars 112 forks source link

Mint.HTTP2 streams stuck in `:half_closed_local` state #450

Closed daisyzhou closed 2 weeks ago

daisyzhou commented 2 months ago

Unfortunately we do not have a repro or hypothesis available, but wanted to get your help either fixing the problem in Mint or advising us on how to avoid/handle it.

We are running Mint 1.6.0, and noticed a number of our Mint.HTTP2 connections with max_concurrent_streams streams, all stuck with state: :half_closed_local. This rendered the connection unusable as it wasn't able to handle new requests.

Is this a known failure mode? Can you help figure out what the cause is?

whatyouhide commented 2 months ago

No, this is not a known failure mode and could definitely be a bug.

See the diagram in https://datatracker.ietf.org/doc/html/rfc9113#section-5.1:

CleanShot 2024-09-19 at 13 50 42@2x

We get to the :half_closed_local state in two cases:

  1. We start the stream and eventually send END_STREAM (when we finished the request usually).
  2. The server starts a stream (with a PUSH_PROMISE) and finishes sending headers.

Then, the RFC says:

A stream that is in the "half-closed (local)" state cannot be used for sending frames other than WINDOW_UPDATE, PRIORITY, and RST_STREAM.

A stream transitions from this state to "closed" when a frame is received with the END_STREAM flag set or when either peer sends a RST_STREAM frame.

An endpoint can receive any type of frame in this state. Providing flow-control credit using WINDOW_UPDATE frames is necessary to continue receiving flow-controlled frames. In this state, a receiver can ignore WINDOW_UPDATE frames, which might arrive for a short period after a frame with the END_STREAM flag set is sent.

So, Iā€™m not really sure what's happening, haven't looked at the code for a while. However, it could be that

  1. we send a full request to the server,
  2. we put the stream in :half_closed_local
  3. the server never replies with a frame with the END_STREAM flag set.

I don't think we can send RST_STREAM ourselves here because we might not have gotten a response from the server yet. If you have time to investigate this further with the help of this data, I'd be very grateful, otherwise I'll try to take a closer look at this soon šŸ™ƒ

daisyzhou commented 2 months ago

Hi @whatyouhide ,

Thanks for the context. Unfortunately I can't look into it much further since we don't have a repro either, and it only happened one time (albeit to multiple connections).

If it is indeed that the server never replies with a frame with the END_STREAM flag set, is the connection just broken forever? We saw this happen to all streams in the connections at the same time (for a few different connections), so maybe it was a networking blip that dropped the END_STREAM. If we get into this state, would you suggest just killing and restarting the connection?

ollien commented 2 weeks ago

(I'm one of @daisyzhou's colleagues, hello!)

Just adding another data point that we saw this happen again, so it's definitely not just a one-off. Unfortunately I don't have any confirmation of the theory @whatyouhide set forth above, because we weren't able to packet capture the HTTP2 flow that broke it. Our "fix" has been to kill the process owning the connections. If you have any suggestions on how we could un-stick this connection, or requests for things we can capture if it happens again, please do let us know.

ollien commented 2 weeks ago

This may have ended up being a defect in our code. Our metrics indicate that these events were preceded by a small burst in timeouts, and it seems our timeout code did not call HTTP2.cancel_request/2, which caused us to leak streams.

We'll report back if this recurs, but it may not be a Mint problem after all. Thanks for taking a look!

whatyouhide commented 2 weeks ago

Oh, mh. Yeah that totally makes sense: we don't receive anything from the server and the request just stays open. We can't even really do timeouts within Mint because connections are stateless.

Okay, sounds good. Let's close this out and reopen it if this shows up again.