envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
25k stars 4.81k forks source link

RST_STREAM sent to downstream client instead of remaining DATA, HEADERS on end stream #30149

Closed niloc132 closed 2 months ago

niloc132 commented 1 year ago

Title: RST_STREAM sent to downstream client instead of remaining DATA, HEADERS on end stream

Description: In some circumstances, it is possible for envoy to skip sending remaining DATA, HEADERS frames and instead send RST_STREAM, resulting in errors in the downstream client. In this case, we have a gRPC service returning data to a client, and some client implementations get an error instead of seeing the expected grpc-status: 0 trailers. The stream needs to be bidirectional, with the client having not half-closed from its side yet (which in some cases offers a potential workaround).

Downstream error reported by netty-grpc:

io.grpc.StatusRuntimeException: UNAVAILABLE: RST_STREAM closed stream. HTTP/2 error code: NO_ERROR
    at io.grpc.Status.asRuntimeException(Status.java:539)
    at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:487)
    at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
    at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
    at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)

Downstream error reported by okhttp-grpc:

io.grpc.StatusRuntimeException: INTERNAL: No error: A GRPC status of OK should have been sent
Rst Stream
    at io.grpc.Status.asRuntimeException(Status.java:537)
    at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:481)
    at io.grpc.internal.DelayedClientCall$DelayedListener$3.run(DelayedClientCall.java:489)
    at io.grpc.internal.DelayedClientCall$DelayedListener.delayOrExecute(DelayedClientCall.java:453)
    at io.grpc.internal.DelayedClientCall$DelayedListener.onClose(DelayedClientCall.java:486)
    at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:567)

No error is reported when using the Python gRPC client (which we understand to be a cython wrapper around the C++ client). We speculate that the C++ client might be handling data faster than the Java clients, and that the bug might be rooted in Envoy not being prepared to forward all remaining data for some reason.

Note that I have not validated this issue in non-gRPC contexts, but there is nothing specific about the envoy.yaml to gRPC, so it seems likely that this impacts other h2 upstream+downstream bidirectional streaming use cases as well.

Repro steps:

To try this out, we've built a sample git repository to demonstrate the issue at https://github.com/niloc132/envoy-rst-stream. Instructions are there to run the (default) Java server and Java client, but either can be replaced with a Python implementation.

Admin and Stats Output:

Include the admin output for the following endpoints: /stats, /clusters, /routes, /server_info. For more information, refer to the admin endpoint documentation. http://localhost:9090/stats https://gist.github.com/niloc132/d6b5c778ab2b79b8a8c4781dc80ac40b

http://localhost:9090/clusters

server::observability_name::server
server::default_priority::max_connections::1024
server::default_priority::max_pending_requests::1024
server::default_priority::max_requests::1024
server::default_priority::max_retries::3
server::high_priority::max_connections::1024
server::high_priority::max_pending_requests::1024
server::high_priority::max_requests::1024
server::high_priority::max_retries::3
server::added_via_api::false
server::192.168.48.2:8080::cx_active::1
server::192.168.48.2:8080::cx_connect_fail::0
server::192.168.48.2:8080::cx_total::1
server::192.168.48.2:8080::rq_active::0
server::192.168.48.2:8080::rq_error::0
server::192.168.48.2:8080::rq_success::1
server::192.168.48.2:8080::rq_timeout::0
server::192.168.48.2:8080::rq_total::1
server::192.168.48.2:8080::hostname::java-server
server::192.168.48.2:8080::health_flags::healthy
server::192.168.48.2:8080::weight::1
server::192.168.48.2:8080::region::
server::192.168.48.2:8080::zone::
server::192.168.48.2:8080::sub_zone::
server::192.168.48.2:8080::canary::false
server::192.168.48.2:8080::priority::0
server::192.168.48.2:8080::success_rate::-1
server::192.168.48.2:8080::local_origin_success_rate::-1

http://localhost:9090/routes - This doesn't appear to be a valid admin URL.

http://localhost:9090/server_info - https://gist.github.com/niloc132/6516b8fcae91a7f2c4ae23024aeb5b2c

Config: See https://github.com/niloc132/envoy-rst-stream/blob/main/envoy.yaml along with the rest of the repo to reproduce this issue

Logs: Access log entry for the request, showing 4mb of pb payload sent "successfully", despite client showing error:

envoy-rst-stream-envoy-1        | [2023-10-12T16:17:40.943Z] "POST /org.example.TestService/Hello HTTP/2" 200 - 0 4000010 190 - "-" "grpc-java-netty/1.58.0" "d02b9201-cb74-4cc6-8b80-012146aafe04" "localhost:8000" "192.168.48.2:8080"

This log has level=trace, uptsream=off, dns=off, config=off (and includes the access logs with the above line: https://gist.github.com/niloc132/521a45c0eb38c7733593c56e8c06fc02 Excerpt which appears to show that envoy says it received the trailers from upstream and sent to client:

envoy-rst-stream-envoy-1        | [2023-10-12 16:17:41.133][45][trace][router] [source/common/router/upstream_request.cc:315] [Tags: "ConnectionId":"4","StreamId":"9345982487776441659"] upstream response trailers:
envoy-rst-stream-envoy-1        | 'grpc-status', '0'
envoy-rst-stream-envoy-1        | 
envoy-rst-stream-envoy-1        | [2023-10-12 16:17:41.133][45][debug][router] [source/common/router/upstream_request.cc:469] [Tags: "ConnectionId":"4","StreamId":"9345982487776441659"] resetting pool request
envoy-rst-stream-envoy-1        | [2023-10-12 16:17:41.133][45][debug][client] [source/common/http/codec_client.cc:158] [Tags: "ConnectionId":"5"] request reset
envoy-rst-stream-envoy-1        | [2023-10-12 16:17:41.133][45][trace][main] [source/common/event/dispatcher_impl.cc:251] item added to deferred deletion list (size=1)
envoy-rst-stream-envoy-1        | [2023-10-12 16:17:41.133][45][debug][pool] [source/common/conn_pool/conn_pool_base.cc:215] [Tags: "ConnectionId":"5"] destroying stream: 0 remaining
envoy-rst-stream-envoy-1        | [2023-10-12 16:17:41.133][45][trace][main] [source/common/event/dispatcher_impl.cc:251] item added to deferred deletion list (size=2)
envoy-rst-stream-envoy-1        | [2023-10-12 16:17:41.133][45][trace][main] [source/common/event/dispatcher_impl.cc:251] item added to deferred deletion list (size=3)
envoy-rst-stream-envoy-1        | [2023-10-12 16:17:41.133][45][debug][http] [source/common/http/conn_manager_impl.cc:1759] [Tags: "ConnectionId":"4","StreamId":"9345982487776441659"] encoding trailers via codec:
envoy-rst-stream-envoy-1        | 'grpc-status', '0'
envoy-rst-stream-envoy-1        | 
envoy-rst-stream-envoy-1        | [2023-10-12 16:17:41.133][45][debug][http] [source/common/http/conn_manager_impl.cc:221] [Tags: "ConnectionId":"4","StreamId":"9345982487776441659"] doEndStream() resetting stream
envoy-rst-stream-envoy-1        | [2023-10-12 16:17:41.133][45][debug][http] [source/common/http/conn_manager_impl.cc:1798] [Tags: "ConnectionId":"4","StreamId":"9345982487776441659"] stream reset: reset reason: local reset, response details: -

Wireshark screencap showing upstream (192.168.48.2:8080) sending envoy (192.168.48.3) HEADERS with end of stream set to true, then envoy sending RST_STREAM to both downstream client (192.168.48.1) and upstream. These messages are highlighted - other DATA messages from upstream->envoy and envoy->client can also be seen.

screenshot857

Conversely, here's a successful Python client (irritatingly both streams had id 1, so be sure to check source/destination IP) screenshot858

And here's a Java client also succeeding through envoy screenshot859

I can provide the pcapng file for any of these if desired, but the example project should easily reproduce this as well.

Call Stack: N/A

Other notes: Through wireshark and through setting additional custom trailers on server responses, we've ruled out the python client getting the RST_STREAM as well but incorrectly reporting success. We also tested with both the Netty and OkHttp clients and both exhibit the error, giving a degree of confidence that this is an envoy issue rather than both implementations having the same bug.

niloc132 commented 1 year ago

The 1.27.1 release with a fix for resetting streams was just brought to my attention - I tweaked the linked project to use the v1.27.1 envoy image, and can confirm that this did not fix the issue.

kyessenov commented 1 year ago

CC @yanavlasov @alyssawilk

github-actions[bot] commented 12 months ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

devinrsmith commented 12 months ago

Adding comment to trigger activity. Hoping project member can take a look. Thanks!

alyssawilk commented 11 months ago

Sorry we haven't been able to get to this yet. It could be a codec issue or an Envoy issue and we're just now changing the underlying codec in https://github.com/envoyproxy/envoy/pull/31086. Can you see if that addresses the problem and if not I'll try to take a look?

devinrsmith commented 11 months ago

Is there an easy way to get a docker image for an outstanding PR? I see https://github.com/envoyproxy/envoy/tree/main/ci#ubuntu-envoy-image, but it looks like envoyproxy/envoy-build-ubuntu hasn't been updated recently.

alyssawilk commented 11 months ago

Maybe just keep an eye on that one - will hopefully merge this week and it'll be easier to test

github-actions[bot] commented 10 months ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

niloc132 commented 10 months ago

Can someone confirm that the linked PR (merged as https://github.com/envoyproxy/envoy/commit/b58d31272913fa8bb1bd1b69b0102d710c97d431) has no tagged release? It does appear that the issue happens less frequently with the current dev-tagged docker image, 717264bcc05a, but it does still happen.

I made no other changes to the test project linked aside from updating the envoyproxy/envoy tag.

alyssawilk commented 10 months ago

Also worth noting, while in general Envoy will process frames in the order they were sent (headers, data, reset), if you have any filters which delay processing (pause to buffer or for an RPC) I think it's possible for the reset to end up being processed before headers and data. Could that be the issue you're seeing?

niloc132 commented 10 months ago

Can you elaborate on "any filters" - do you mean on the upstream cluster ("instance")? The filters in the sample project which reproduces the issue are pretty simple (though arguably could be simpler still - websockets isn't needed/used, though I can't imagine that removing the access log or something is what you're referring to?

https://github.com/niloc132/envoy-rst-stream/blob/117744ac3ba722560d05d62a533a9edc1e6c7af5/envoy.yaml#L11-L41

It has been a few weeks, but unless I misread my own notes above, the first screenshot of wireshark shows the upstream sending DATA, DATA,... HEADERS(close), and no RST_STREAM, and envoy drops the HEADERS, instead sending RST_STREAM to both upstream and client. That is, this shouldn't be a question of envoy processing the RST_STREAM ahead of HEADERS.

With that said, my read of the grpc sequence diagrams is that an endpoint can send HEADERS(close), RST_STREAM, e.g. the server might say "here's the trailers, and I'm not accepting more messages" - but it would not be correct for the trailers to be discarded. To be clear: that is not what is happening here, but was an early theory we were considering. This may not practically exist (and if not, I'm sorry for muddying the waters here), but it seems to be expected that servers (grpc, for example) behave this way. https://httpwg.org/specs/rfc7540.html#StreamStates

alyssawilk commented 10 months ago

Sorry I may not be being clear on why this is happening but I do agree there appears to be a (at least one) legit bug where stop sending can result in a response not being sent downstream. Can't make any promises but I'll see if we can find someone to tackle it. cc @krajshiva

yanavlasov commented 10 months ago

Is there anything else in the log preceding the receipt of trailer from upstream? Is it possible request is timing out?

niloc132 commented 10 months ago

@yanavlasov Thanks for taking a look - the entire test fails in a fraction of a second, so I highly doubt there is a default timeout that low. The example linked should fail effectively instantly for each client request that is made.

The grpc service is bi-di streaming, but the server's implementation is "send a single 4MB payload and then half-close", so it functionally is a unary call - this is for the sake of simplicity, rather than required to reproduce the bug. Once all 4MB has been sent, the bug happens right away, with some dozens of DATA frames not passed to the client (and no HEADERS either, etc). The example implementation sends int32 values in the bytes payload starting from -4000000 or so and stopping at zero, so you can see in the last DATA frame that it did not include all of the expected data (i.e. the last byte wasnt 0x00000000).

alyssawilk commented 9 months ago

I'm not sure when I'll have the cycles to sort this one out but I do think it's worth tackling so self-assigning mainly so I don't lose track of it

ianebot commented 7 months ago

Hi @alyssawilk,

we experienced a similar issue with grpc-js implementation using Istio. Here you can see my comments and tests: https://github.com/grpc/grpc-node/issues/2569#issuecomment-1883015207

Do you think you could take a look?

Thanks.

yanavlasov commented 6 months ago

We are currently looking into this problem and will update this PR once we have more info.

jmarantz commented 6 months ago

/sub

yanavlasov commented 6 months ago

This may take some time to resolve as adding supporting half closed server streams is a significant change. I do not have ETA at this point.

sio4 commented 5 months ago

Hey @niloc132, Do you still have the pcap of the screenshot? I would like to analyze the pcap if you can share that with me. I have the same (or at least very close) issue, and I thought I identified the root cause, but today I need to confirm one more important point and the pcap could be very helpful to prove my point.

niloc132 commented 5 months ago

Sorry for my delay @sio4 - please email me at colinalworth@deephaven.io and I will forward them to you, otherwise you can run the sample project above to reproduce.

sio4 commented 4 months ago

Thanks @niloc132 for providing the pcap.

Indeed, it seems your issue is the similar to one of mine but not exactly the same. I had an investigation with trace log from grpc-java client and envoy's trace log and the pattern looks the same. Pcap is very clear to see the issues from networking perspective but sometimes trace log from the envoy or grpc library could help a lot especially for the high level inspection.

The brief symptom is as follow:

The smooth scenario should be:

  1. upstream server sends HEADER frame with END_STREAM at the end of the response so it becomes half-closed (local) state
  2. when the envoy received HEADER frame with END_STREAM from the upstream,
    1. it becomes half-closed (remote) for the upstream,
    2. sends out the last HEADER with END_STREAM to the downstream if there is no deferred frames,
    3. becomes half-closed (local) for downstream,
    4. then sends RST_STREAM to both upstream and downstream
  3. when the upstream server received the RST_STREAM while its state is half-closed (local), it closes the stream as normal
  4. when the downstream client received the RST_STREAM after it became half-closed (remote) by receiving all DATA and HEADER, it could generate an error depends on the error code.

In your pcap, I can see the downstream data streaming paused after packet number 166 (stream 3 is the downstream and stream 1 is upstream here). So at this point, I guess the envoy was not able to send data to the downstream anymore for some reason. My guess (based on my own issue) is that it could be the result of the onAboveWriteBufferHighWatermark event which means the downstream write buffer is full. After the moment, the pcap shows the upstream receiving is continued until HEADER with END_STREAM and finally the envoy sends out RST_STREAM.

image

This is slightly different from my issue. I thought my issue happens when the envoy got RST_STREAM while it still has deferred frames, and envoy interpreted RST_STREAM as abort signal (which is true generally) so it just abort all the action it should take and discard all the deferred frames. However in your case, it does not received RST_STREAM at the moment so I guess envoy just discard them when it sends RST_STREAM.

From my case, there is two points to be addressed:

  1. envoy should support half-closing: it should not send RST_STREAM after becoming half-closed (local), should wait until receiving END_STREAM from the peer.
    • [ ] #34461 is for this point.
  2. envoy should interpret RST_STREAM with error code NO_ERROR as OK since it does not mean there is any error on the response but the peer just want to close the stream. Also, it should not discard remaining deferred frames since they are still a proper data (NO_ERROR)
    1. https://datatracker.ietf.org/doc/html/rfc9113#name-error-codes
    2. https://datatracker.ietf.org/doc/html/rfc9113#name-http-message-framing

One another buggy behavior from your pcap is that the envoy sent RST_STREAM with error code NO_ERROR to the downstream () while it just discarded all deferred frames. It should not be NO_ERROR since it means "The associated condition is not a result of an error" but the fact is that the response is incorrect here.

image

So I guess envoy still think it did everything it should do correctly at this moment. IMHO, the critical issue here is this point.

<...> the python client getting the RST_STREAM as well but incorrectly reporting success. <...>

For this, I think the python implementation is half correct. Since it gets 200 OK at the beginning, and the RST_STREAM's error code is NO_ERROR, it is successful response from the HTTP/2's perspective. However, since it is gRPC, I think it should interpret the gRPC state on the trailer too, which is not available because envoy did not proxy the HEADER.

I hope we could have further investigation especially those uncertain parts. I suggest you to try getting trace log on the all components (gRPC server and client, and envoy) so we can see more detail events and decisions on them.

(By the way, it seems the patterns on the screenshot on your opening post and the pcap shared with me looks somewhat different)

justinfx commented 2 months ago

When might this get tagged for release? I just hit an issue in production that looks exactly the same, after having raised our python grpc client max recv higher than the default 4MB, and going through our Envoy load balancer.

Update: We have been running the latest envoy-distroless-dev image and still encountered this problem. So it doesn't seem to fix whatever variation of the problem we are seeing.

alyssawilk commented 2 months ago

Did you flip the runtime guard to enable this feature? It won't work unless envoy.reloadable_features.allow_multiplexed_upstream_half_close is set true

justinfx commented 2 months ago

@alyssawilk I didn't know it was behind a feature flag! I'm going to apply that!

DerekTBrown commented 2 months ago

I believe I am still observing this issue with the patched version (but would appreciate if someone could validate my method):

alyssawilk commented 2 months ago

Any chance you could provide a repro case and/or logs? There are definitely many reasons things can go south and while I'm unsure if there are other failure modes which would cause this particular error, it'd definitely be helpful to have the additional data to better assess.

justinfx commented 2 months ago

I too would love to see a repro. We still don't know how Envoy manages to get into this state in production. But so far we have been running ok since Sunday with a combination of:

A repro would be cool since we don't know if we have fixed it or are just lucky so far.

It does sound similar to envoy issue #8114

DerekTBrown commented 2 months ago

@justinfx @alyssawilk I spent some time digging into this. It looks like there were >1 reasons why we were seeing missing HEADERS in our environment. The runtime feature fix enabled seems to have reduced the error rate, but I am seeing grpc-go send STREAM_RST without trailers as well:

https://github.com/grpc/grpc-go/issues/7623

alyssawilk commented 1 month ago

Glad this has reduced your error rate. I don't think the information in the attached issue is enough for us to repro. Any chance again we could get more detailed info and/or logs? I'm happy to supply you with email addresses if there's info you can share but don't want posted on the public internet

yanavlasov commented 1 month ago

I'm unable to recreate this problem using the repro case from @niloc132

@DerekTBrown from your last comment it is unclear to me if the problem is related to Envoy or not. If you think this is Envoy related, please provide a repro case, preferably without Istio involved. I think opening a new Issue would be best.

justinfx commented 1 month ago

In my case we were never able to actually isolate a repro. It would occur at random. But at this point we have been running a combination of this new fix feature via the envoy dev container image, and also a few configuration tweaks, and it has been running solid for 1.5 weeks. I do not know if it was the fix in this issue, or the tweaks we made to the config, but if it is no longer happening, I am not inclined to investigate any further

DerekTBrown commented 1 month ago

@DerekTBrown from your last comment it is unclear to me if the problem is related to Envoy or not. If you think this is Envoy related, please provide a repro case, preferably without Istio involved. I think opening a new Issue would be best.

I believe this issue has been resolved on the Envoy side. I was observing both this issue, and a separate issue with the same symptoms (i.e. no trailers), but coming directly from the application itself (https://github.com/grpc/grpc-go/issues/7623). Sorry for the false alarm.

justinfx commented 2 weeks ago

Is this part of a recent tagged release? I didnt see it.

alyssawilk commented 1 week ago

Yes, but AFIK still off by default as the change was high risk and internal testing is not complete.

justinfx commented 1 week ago

I see. FWIW we have been running the first dev container that contained the commit, since it was pushed, and have not experienced the issue since.