envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.8k stars 4.77k forks source link

gRPC-Web requests hang over QUIC #33226

Closed qfiard closed 4 months ago

qfiard commented 6 months ago

Title: gRPC-Web requests hang over QUIC

Description: gRPC-Web requests hang over QUIC:

Screenshot 2024-03-30 at 1 54 29 PM

It appears to be due to an empty HEADERS frame being sent by Envoy at the end of the stream instead terminating it:

Screenshot 2024-03-30 at 1 53 54 PM

The bug appears to be that trailers are encoded into a QUIC frame even when they are empty, which is the case with the gRPC-Web filters which clears and converts trailers into a body chunk.

Repro steps:

Configure a gRPC Web endpoint behind Envoy (https://grpc.io/docs/platforms/web/basics/#configure-the-envoy-proxy) Expose it over QUIC to downstream clients (https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/http/http3#http-3-downstream) Send gRPC Web requests from a browser, observe that they hang.

qfiard commented 6 months ago

I have a tentative fix in https://github.com/envoyproxy/envoy/pull/33228 to fix the handling of empty trailers to terminate the stream instead of sending an empty HEADERS frame.

An alternative would be to skip the call to encodeTrailers when headers are empty. That would require broader changes to Envoy and may break the design of filters if we expect empty trailers processed by a first filter to be handled by a subsequent ones. This change is more narrowly scoped but it would be good for a maintainer to confirm that this is the right fix.

soulxu commented 6 months ago

cc @RyanTheOptimist @danzh2010

danzh2010 commented 6 months ago

Hi Quentin, thanks for reporting the issue and proposing the fix! Could you elaborate more about why an empty trailer in response could cause the stream hanging but an empty body with FIN over HTTP/3 won't? Supposedly there is some gRPC contract that HTTP/3 is violating and causing the hang? How does HTTP/2 codec handle this situation?

RyanTheOptimist commented 6 months ago

Interesting. It looks like the HTTP/2 codec also skips the encoding of empty trailers, and instead sends empty data. How odd. Well, if this is the Envoy way of doing this, then it probably makes sense for both the H2 and H3 codecs to behave similarly. But we'd likely need a runtime flag to guard this new (hopefully correct) behavior.

https://github.com/envoyproxy/envoy/blob/main/source/common/http/http2/codec_impl.cc#L630-L640

RyanTheOptimist commented 6 months ago

@danzh2010 raises a good question that I don't think I understand the answer to. It sounds like if the client talks directly to a gRPC-web server, then the empty trailers do not cause a hang. But when the client talks to a Envoy and Envoy talks to the server, then the request hangs. Do we understand what the difference here is? To confirm, is the client sending empty trailers? Or is the gRPC-Web filter creating them even when the client does not?

qfiard commented 6 months ago

Thank you both for looking into this.

It sounds like if the client talks directly to a gRPC-web server, then the empty trailers do not cause a hang.

From https://github.com/grpc/grpc-web, gRPC Web isn't meant to be directly exposed by a web server, so you won't usually find a gRPC-web server that a client can talk to directly. Instead, gRPC Web clients connect to a regular gRPC server with a proxy doing the translation, and Envoy is the default implementation of this.

gRPC-web clients connect to gRPC services via a special proxy; by default, 
gRPC-web uses [Envoy](https://www.envoyproxy.io/).

gRPC uses trailers to communicate the status of the RPC, but gRPC Web doesn't as trailers aren't exposed by web browsers (https://stackoverflow.com/questions/46373013/access-to-http-2-trailers-in-a-browser). So Envoy and the grpc_web filter is responsible for translating the trailers into a body chunk.

When using QUIC, this is happening correctly but Envoy appends an empty HEADERS frame which seems to block Chrome from considering the stream complete. Do you know if the later is expected? It is possible that this is a Chrome issue if a HEADERS frame at the end of a stream is expected to be handled gracefully (from the screenshot above, the HEADERS frame does have FIN=true). But in the meantime removing the HEADERS frame when empty should fix the issue.

Should I go ahead and add the runtime flag to my pull request, or do you prefer to take this over?

danzh2010 commented 6 months ago

Envoy and the grpc_web filter is responsible for translating the trailers into a body chunk.

When using QUIC, this is happening correctly but Envoy appends an empty HEADERS frame which seems to block Chrome from considering the stream complete. Do you know if the later is expected?

Did grpc_web filter fail to translate an empty trailer but passed it down somehow? Or did it translate a non-empty trailer into body and appended an empty one?

RyanTheOptimist commented 6 months ago

Thank you both for looking into this.

It sounds like if the client talks directly to a gRPC-web server, then the empty trailers do not cause a hang.

From https://github.com/grpc/grpc-web, gRPC Web isn't meant to be directly exposed by a web server, so you won't usually find a gRPC-web server that a client can talk to directly. Instead, gRPC Web clients connect to a regular gRPC server with a proxy doing the translation, and Envoy is the default implementation of this.

gRPC-web clients connect to gRPC services via a special proxy; by default, 
gRPC-web uses [Envoy](https://www.envoyproxy.io/).

gRPC uses trailers to communicate the status of the RPC, but gRPC Web doesn't as trailers aren't exposed by web browsers (https://stackoverflow.com/questions/46373013/access-to-http-2-trailers-in-a-browser). So Envoy and the grpc_web filter is responsible for translating the trailers into a body chunk.

When using QUIC, this is happening correctly but Envoy appends an empty HEADERS frame which seems to block Chrome from considering the stream complete. Do you know if the later is expected? It is possible that this is a Chrome issue if a HEADERS frame at the end of a stream is expected to be handled gracefully (from the screenshot above, the HEADERS frame does have FIN=true). But in the meantime removing the HEADERS frame when empty should fix the issue.

Yes, this does sound a like a Chrome bug to me, yes. At the QUIC layer, the stream is closed because a QUIC STREAM frame was received with the fin bit set. There's nothing "wrong" at the HTTP/3 layer, as far as I know, with an empty HEADERS frame. But it makes me wonder if Chrome is somehow treating empty headers as incomplete headers? Have you worked with Chrome team on this at all?

Should I go ahead and add the runtime flag to my pull request, or do you prefer to take this over?

If you could add a flag, that'd be great. I think I'd like to understand the problem a bit more clearly before we land anything, but adding a flag is a good idea.

RyanTheOptimist commented 6 months ago

Thanks for the offline discussion this morning. The server sends non-empty trailers to Envoy, and then the gRPC filters mutate the trailers (by removing the status field) which leaves the trailers empty at the end of the filter chain. It's a bit odd that Envoy chooses to call EncodeTrailers at that point, but it's not crazy :) So let's go ahead and move forward with your PR, subject to runtime-flag protection.

In addition, the Chrome behavior of having a request hang when empty HTTP/3 trailers are received is clearly a bug. I filed https://issues.chromium.org/issues/332587381, and fixed the issue with this commit https://chromium-review.googlesource.com/c/chromium/src/+/5421968 Chrome Canary should pick this up in the next day or so if you would like to confirm.

github-actions[bot] commented 5 months ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] commented 4 months ago

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.