Closed niloc132 closed 2 months ago
The 1.27.1 release with a fix for resetting streams was just brought to my attention - I tweaked the linked project to use the v1.27.1
envoy image, and can confirm that this did not fix the issue.
CC @yanavlasov @alyssawilk
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
Adding comment to trigger activity. Hoping project member can take a look. Thanks!
Sorry we haven't been able to get to this yet. It could be a codec issue or an Envoy issue and we're just now changing the underlying codec in https://github.com/envoyproxy/envoy/pull/31086. Can you see if that addresses the problem and if not I'll try to take a look?
Is there an easy way to get a docker image for an outstanding PR? I see https://github.com/envoyproxy/envoy/tree/main/ci#ubuntu-envoy-image, but it looks like envoyproxy/envoy-build-ubuntu hasn't been updated recently.
Maybe just keep an eye on that one - will hopefully merge this week and it'll be easier to test
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
Can someone confirm that the linked PR (merged as https://github.com/envoyproxy/envoy/commit/b58d31272913fa8bb1bd1b69b0102d710c97d431) has no tagged release? It does appear that the issue happens less frequently with the current dev
-tagged docker image, 717264bcc05a
, but it does still happen.
I made no other changes to the test project linked aside from updating the envoyproxy/envoy tag.
Also worth noting, while in general Envoy will process frames in the order they were sent (headers, data, reset), if you have any filters which delay processing (pause to buffer or for an RPC) I think it's possible for the reset to end up being processed before headers and data. Could that be the issue you're seeing?
Can you elaborate on "any filters" - do you mean on the upstream cluster ("instance")? The filters in the sample project which reproduces the issue are pretty simple (though arguably could be simpler still - websockets isn't needed/used, though I can't imagine that removing the access log or something is what you're referring to?
It has been a few weeks, but unless I misread my own notes above, the first screenshot of wireshark shows the upstream sending DATA, DATA,... HEADERS(close), and no RST_STREAM, and envoy drops the HEADERS, instead sending RST_STREAM to both upstream and client. That is, this shouldn't be a question of envoy processing the RST_STREAM ahead of HEADERS.
With that said, my read of the grpc sequence diagrams is that an endpoint can send HEADERS(close), RST_STREAM, e.g. the server might say "here's the trailers, and I'm not accepting more messages" - but it would not be correct for the trailers to be discarded. To be clear: that is not what is happening here, but was an early theory we were considering. This may not practically exist (and if not, I'm sorry for muddying the waters here), but it seems to be expected that servers (grpc, for example) behave this way. https://httpwg.org/specs/rfc7540.html#StreamStates
Sorry I may not be being clear on why this is happening but I do agree there appears to be a (at least one) legit bug where stop sending can result in a response not being sent downstream. Can't make any promises but I'll see if we can find someone to tackle it. cc @krajshiva
Is there anything else in the log preceding the receipt of trailer from upstream? Is it possible request is timing out?
@yanavlasov Thanks for taking a look - the entire test fails in a fraction of a second, so I highly doubt there is a default timeout that low. The example linked should fail effectively instantly for each client request that is made.
The grpc service is bi-di streaming, but the server's implementation is "send a single 4MB payload and then half-close", so it functionally is a unary call - this is for the sake of simplicity, rather than required to reproduce the bug. Once all 4MB has been sent, the bug happens right away, with some dozens of DATA frames not passed to the client (and no HEADERS either, etc). The example implementation sends int32
values in the bytes
payload starting from -4000000 or so and stopping at zero, so you can see in the last DATA frame that it did not include all of the expected data (i.e. the last byte wasnt 0x00000000
).
I'm not sure when I'll have the cycles to sort this one out but I do think it's worth tackling so self-assigning mainly so I don't lose track of it
Hi @alyssawilk,
we experienced a similar issue with grpc-js implementation using Istio. Here you can see my comments and tests: https://github.com/grpc/grpc-node/issues/2569#issuecomment-1883015207
Do you think you could take a look?
Thanks.
We are currently looking into this problem and will update this PR once we have more info.
/sub
This may take some time to resolve as adding supporting half closed server streams is a significant change. I do not have ETA at this point.
Hey @niloc132, Do you still have the pcap of the screenshot? I would like to analyze the pcap if you can share that with me. I have the same (or at least very close) issue, and I thought I identified the root cause, but today I need to confirm one more important point and the pcap could be very helpful to prove my point.
Sorry for my delay @sio4 - please email me at colinalworth@deephaven.io and I will forward them to you, otherwise you can run the sample project above to reproduce.
Thanks @niloc132 for providing the pcap.
Indeed, it seems your issue is the similar to one of mine but not exactly the same. I had an investigation with trace log from grpc-java client and envoy's trace log and the pattern looks the same. Pcap is very clear to see the issues from networking perspective but sometimes trace log from the envoy or grpc library could help a lot especially for the high level inspection.
The brief symptom is as follow:
The smooth scenario should be:
HEADER
frame with END_STREAM
at the end of the response so it becomes half-closed (local)
stateHEADER
frame with END_STREAM
from the upstream,
half-closed (remote)
for the upstream,HEADER
with END_STREAM
to the downstream if there is no deferred frames,half-closed (local)
for downstream,RST_STREAM
to both upstream and downstreamRST_STREAM
while its state is half-closed (local)
, it closes the stream as normalRST_STREAM
after it became half-closed (remote)
by receiving all DATA
and HEADER
, it could generate an error depends on the error code.In your pcap, I can see the downstream data streaming paused after packet number 166 (stream 3 is the downstream and stream 1 is upstream here). So at this point, I guess the envoy was not able to send data to the downstream anymore for some reason. My guess (based on my own issue) is that it could be the result of the onAboveWriteBufferHighWatermark
event which means the downstream write buffer is full. After the moment, the pcap shows the upstream receiving is continued until HEADER
with END_STREAM
and finally the envoy sends out RST_STREAM
.
This is slightly different from my issue. I thought my issue happens when the envoy got RST_STREAM
while it still has deferred frames, and envoy interpreted RST_STREAM
as abort signal (which is true generally) so it just abort all the action it should take and discard all the deferred frames. However in your case, it does not received RST_STREAM
at the moment so I guess envoy just discard them when it sends RST_STREAM
.
From my case, there is two points to be addressed:
RST_STREAM
after becoming half-closed (local)
, should wait until receiving END_STREAM
from the peer.
RST_STREAM
with error code NO_ERROR
as OK since it does not mean there is any error on the response but the peer just want to close the stream. Also, it should not discard remaining deferred frames since they are still a proper data (NO_ERROR)
One another buggy behavior from your pcap is that the envoy sent RST_STREAM
with error code NO_ERROR
to the downstream () while it just discarded all deferred frames. It should not be NO_ERROR
since it means "The associated condition is not a result of an error" but the fact is that the response is incorrect here.
So I guess envoy still think it did everything it should do correctly at this moment. IMHO, the critical issue here is this point.
<...> the python client getting the RST_STREAM as well but incorrectly reporting success. <...>
For this, I think the python implementation is half correct. Since it gets 200 OK at the beginning, and the RST_STREAM
's error code is NO_ERROR
, it is successful response from the HTTP/2's perspective. However, since it is gRPC, I think it should interpret the gRPC state on the trailer too, which is not available because envoy did not proxy the HEADER
.
I hope we could have further investigation especially those uncertain parts. I suggest you to try getting trace log on the all components (gRPC server and client, and envoy) so we can see more detail events and decisions on them.
(By the way, it seems the patterns on the screenshot on your opening post and the pcap shared with me looks somewhat different)
When might this get tagged for release? I just hit an issue in production that looks exactly the same, after having raised our python grpc client max recv higher than the default 4MB, and going through our Envoy load balancer.
Update: We have been running the latest envoy-distroless-dev image and still encountered this problem. So it doesn't seem to fix whatever variation of the problem we are seeing.
Did you flip the runtime guard to enable this feature? It won't work unless envoy.reloadable_features.allow_multiplexed_upstream_half_close is set true
@alyssawilk I didn't know it was behind a feature flag! I'm going to apply that!
I believe I am still observing this issue with the patched version (but would appreciate if someone could validate my method):
istio-testing/proxyv2:1.24-alpha.ceeeab0e900812087dae52d851584a73d5939290-distroless
(a version of the Istio proxy containing the patched version of Envoy).static_layer
with "envoy.reloadable_features.allow_multiplexed_upstream_half_close": true
13 INTERNAL: Received RST_STREAM with code 0 (Call ended without gRPC status)
)Any chance you could provide a repro case and/or logs? There are definitely many reasons things can go south and while I'm unsure if there are other failure modes which would cause this particular error, it'd definitely be helpful to have the additional data to better assess.
I too would love to see a repro. We still don't know how Envoy manages to get into this state in production. But so far we have been running ok since Sunday with a combination of:
A repro would be cool since we don't know if we have fixed it or are just lucky so far.
It does sound similar to envoy issue #8114
@justinfx @alyssawilk I spent some time digging into this. It looks like there were >1 reasons why we were seeing missing HEADERS
in our environment. The runtime feature fix enabled seems to have reduced the error rate, but I am seeing grpc-go
send STREAM_RST
without trailers as well:
Glad this has reduced your error rate. I don't think the information in the attached issue is enough for us to repro. Any chance again we could get more detailed info and/or logs? I'm happy to supply you with email addresses if there's info you can share but don't want posted on the public internet
I'm unable to recreate this problem using the repro case from @niloc132
@DerekTBrown from your last comment it is unclear to me if the problem is related to Envoy or not. If you think this is Envoy related, please provide a repro case, preferably without Istio involved. I think opening a new Issue would be best.
In my case we were never able to actually isolate a repro. It would occur at random. But at this point we have been running a combination of this new fix feature via the envoy dev container image, and also a few configuration tweaks, and it has been running solid for 1.5 weeks. I do not know if it was the fix in this issue, or the tweaks we made to the config, but if it is no longer happening, I am not inclined to investigate any further
@DerekTBrown from your last comment it is unclear to me if the problem is related to Envoy or not. If you think this is Envoy related, please provide a repro case, preferably without Istio involved. I think opening a new Issue would be best.
I believe this issue has been resolved on the Envoy side. I was observing both this issue, and a separate issue with the same symptoms (i.e. no trailers), but coming directly from the application itself (https://github.com/grpc/grpc-go/issues/7623). Sorry for the false alarm.
Is this part of a recent tagged release? I didnt see it.
Yes, but AFIK still off by default as the change was high risk and internal testing is not complete.
I see. FWIW we have been running the first dev container that contained the commit, since it was pushed, and have not experienced the issue since.
Title: RST_STREAM sent to downstream client instead of remaining DATA, HEADERS on end stream
Description: In some circumstances, it is possible for envoy to skip sending remaining DATA, HEADERS frames and instead send RST_STREAM, resulting in errors in the downstream client. In this case, we have a gRPC service returning data to a client, and some client implementations get an error instead of seeing the expected
grpc-status: 0
trailers. The stream needs to be bidirectional, with the client having not half-closed from its side yet (which in some cases offers a potential workaround).Downstream error reported by netty-grpc:
Downstream error reported by okhttp-grpc:
No error is reported when using the Python gRPC client (which we understand to be a cython wrapper around the C++ client). We speculate that the C++ client might be handling data faster than the Java clients, and that the bug might be rooted in Envoy not being prepared to forward all remaining data for some reason.
Note that I have not validated this issue in non-gRPC contexts, but there is nothing specific about the
envoy.yaml
to gRPC, so it seems likely that this impacts other h2 upstream+downstream bidirectional streaming use cases as well.Repro steps:
To try this out, we've built a sample git repository to demonstrate the issue at https://github.com/niloc132/envoy-rst-stream. Instructions are there to run the (default) Java server and Java client, but either can be replaced with a Python implementation.
Admin and Stats Output:
http://localhost:9090/clusters
http://localhost:9090/routes
- This doesn't appear to be a valid admin URL.http://localhost:9090/server_info
- https://gist.github.com/niloc132/6516b8fcae91a7f2c4ae23024aeb5b2cConfig: See https://github.com/niloc132/envoy-rst-stream/blob/main/envoy.yaml along with the rest of the repo to reproduce this issue
Logs: Access log entry for the request, showing 4mb of pb payload sent "successfully", despite client showing error:
This log has level=trace, uptsream=off, dns=off, config=off (and includes the access logs with the above line: https://gist.github.com/niloc132/521a45c0eb38c7733593c56e8c06fc02 Excerpt which appears to show that envoy says it received the trailers from upstream and sent to client:
Wireshark screencap showing upstream (192.168.48.2:8080) sending envoy (192.168.48.3) HEADERS with end of stream set to true, then envoy sending RST_STREAM to both downstream client (192.168.48.1) and upstream. These messages are highlighted - other DATA messages from upstream->envoy and envoy->client can also be seen.
Conversely, here's a successful Python client (irritatingly both streams had id 1, so be sure to check source/destination IP)
And here's a Java client also succeeding through envoy
I can provide the pcapng file for any of these if desired, but the example project should easily reproduce this as well.
Call Stack: N/A
Other notes: Through wireshark and through setting additional custom trailers on server responses, we've ruled out the python client getting the RST_STREAM as well but incorrectly reporting success. We also tested with both the Netty and OkHttp clients and both exhibit the error, giving a degree of confidence that this is an envoy issue rather than both implementations having the same bug.