Open shulin-sq opened 1 month ago
I don't see any other reports of similar behavior. Can you provide some details on your Envoy configuration? Can you correlate the request/response on the upstream server with request/response from Envoy? That is, where is the 30 seconds being spent? A packet trace might be useful.
@zuercher unfortunately this issue has been very hard to reproduce. It only occurs after some time in one of our prod applications and I have been unable to create a test environment that does the same thing.
for now I can share a few more suspicions in case it helps with triaging
Our traces show that the application responds quickly and the 30s is spent in envoy. The app has responded already, but envoy waits until the timeout. The timeout is set in the grpc client.
It's possible this is related. It would be useful to know if upgrading to 1.31.2 (or disabling oghttp2 in your build using the runtime flag) resolves the issue.
Assigning back to reporter to address @zuercher 's question. Based on that we could re-triage.
@zuercher do you have any tips to trigger issues seen with oghttp2? I read the previous reports and it seems like the error rate people found was very low
I'm only able to get some repros hours after one app is running in production but this affects some of our production traffic. I've tried to run a sample app over the weekend at around 20rps of http2 traffic to one pod, but only got 1 repro request over 3 days. I feel that is too low to confidently say that turning off the flag will resolve the issue.
Given the CVE and potential crash, I'd turn oghttp2 off in production even if this isn't related. It seems that you see this error frequently enough in production to be able to definitively say if it's related or not once oghttp2 is disabled.
@zuercher over 3 days we reproduced the issue in 1.31.0 only with the oghttp2 flag enabled. We will surmise that this is related to the flag.
Since it's been turned off in 1.31.2 I would guess that there's less of a chance that this will get reported again
unfortunately I was unable to reproduce this with a test app that I can upload here, and had to use a prod app but the conditions where:
I will let @zuercher decide if this issue should be closed
I'm wondering if there's plans to re-enable oghttp2 and if there's any way we can confirm that this doesn't recur.
@birenroy @paul-r-gall @RyanTheOptimist fyi.
Just to clarify, @shulin-sq , your observation is that in prod this problem repros occasionally with oghttp2 enabled, and never with oghttp2 disabled. Correct?
Thanks!
@yanavlasov as well
@jmarantz that's correct. It would eventually repro after a couple hours for all nodes with the 1.31.0 release with no modification to the flags. I then ran 1.31.0 with the oghttp2 flag turned off for 24 hours and could not get a repro under the same conditions.
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.
Title: GRPC requests erroring with upstream_reset_after_response_started{remote_reset} after timeout in envoy v1.31.0
Description: Apologies if the repro steps here are a bit bare, we are still trying to find a reasonable repro, but I wanted to open an issue in case there was an obvious related discussion about this already that I missed. My ask is this: is there something that went out in 1.31.0 that may have caused issues with how envoy handles grpc?
We recently upgraded from envoy 1.30.3 to 1.31.0. After this release we saw this behavior in our mesh:
grpc responses were erroring with upstream_reset_after_response_started{remote_reset} and timing out at the max timeout of 30000 ms
Normally these requests take <10ms and are not very large in bandwidth.
Reverting version to 1.30.3 seems to have "fixed" the issue. Our current running theory is that
here's the log line of a "normal" request (relevant parts)
the client is grpc-java-netty/1.62.2