envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.99k stars 4.81k forks source link

grpc: requests erroring with upstream_reset_after_response_started{remote_reset} after timeout in envoy v1.31.0 #36188

Open shulin-sq opened 1 month ago

shulin-sq commented 1 month ago

Title: GRPC requests erroring with upstream_reset_after_response_started{remote_reset} after timeout in envoy v1.31.0

Description: Apologies if the repro steps here are a bit bare, we are still trying to find a reasonable repro, but I wanted to open an issue in case there was an obvious related discussion about this already that I missed. My ask is this: is there something that went out in 1.31.0 that may have caused issues with how envoy handles grpc?

We recently upgraded from envoy 1.30.3 to 1.31.0. After this release we saw this behavior in our mesh:

{
        "attributes": {
            "envoy_status_code": "200",
            "bytes_received": 99,
            "timing": {
                "duration": 30001,
                "request_duration": 0,
                "response_duration": 7
            },
            "bytes_sent": 0,
            "size": 2612,
            "response": {
                "response_code_details": "upstream_reset_after_response_started{remote_reset}",
                "grpc-status": "-/-", # this format is "grpc-status": "%RESP(GRPC-STATUS)%/%TRAILER(GRPC-STATUS)%",
                "response_code": 200,
                "x-envoy-upstream-service-time": "6",
                "flags": "UR"
            },
}

grpc responses were erroring with upstream_reset_after_response_started{remote_reset} and timing out at the max timeout of 30000 ms

Normally these requests take <10ms and are not very large in bandwidth.

Reverting version to 1.30.3 seems to have "fixed" the issue. Our current running theory is that

What issue is being seen? Describe what should be happening instead of the bug, for example: Envoy should not crash, the expected value isn't returned, etc.

here's the log line of a "normal" request (relevant parts)

{
        "attributes": {
            "envoy_status_code": "200",
            "bytes_received": 100,
            "timing": {
                "duration": 11,
                "response_tx_duration": 0,
                "request_duration": 0,
                "response_duration": 11
            },
            "bytes_sent": 80,
            "size": 2565,
            "response": {
                "response_code_details": "via_upstream",
                "grpc-status": "-/0", # this format is "grpc-status": "%RESP(GRPC-STATUS)%/%TRAILER(GRPC-STATUS)%",
                "response_code": 200,
                "x-envoy-upstream-service-time": "10",
                "flags": "-"
            },
        }
    }
}

the client is grpc-java-netty/1.62.2

zuercher commented 1 month ago

I don't see any other reports of similar behavior. Can you provide some details on your Envoy configuration? Can you correlate the request/response on the upstream server with request/response from Envoy? That is, where is the 30 seconds being spent? A packet trace might be useful.

shulin-sq commented 1 month ago

@zuercher unfortunately this issue has been very hard to reproduce. It only occurs after some time in one of our prod applications and I have been unable to create a test environment that does the same thing.

for now I can share a few more suspicions in case it helps with triaging

Our traces show that the application responds quickly and the 30s is spent in envoy. The app has responded already, but envoy waits until the timeout. The timeout is set in the grpc client.

zuercher commented 1 month ago

It's possible this is related. It would be useful to know if upgrading to 1.31.2 (or disabling oghttp2 in your build using the runtime flag) resolves the issue.

jmarantz commented 1 month ago

Assigning back to reporter to address @zuercher 's question. Based on that we could re-triage.

shulin-sq commented 1 month ago

@zuercher do you have any tips to trigger issues seen with oghttp2? I read the previous reports and it seems like the error rate people found was very low

I'm only able to get some repros hours after one app is running in production but this affects some of our production traffic. I've tried to run a sample app over the weekend at around 20rps of http2 traffic to one pod, but only got 1 repro request over 3 days. I feel that is too low to confidently say that turning off the flag will resolve the issue.

zuercher commented 1 month ago

Given the CVE and potential crash, I'd turn oghttp2 off in production even if this isn't related. It seems that you see this error frequently enough in production to be able to definitively say if it's related or not once oghttp2 is disabled.

shulin-sq commented 1 month ago

@zuercher over 3 days we reproduced the issue in 1.31.0 only with the oghttp2 flag enabled. We will surmise that this is related to the flag.

Since it's been turned off in 1.31.2 I would guess that there's less of a chance that this will get reported again

unfortunately I was unable to reproduce this with a test app that I can upload here, and had to use a prod app but the conditions where:

shulin-sq commented 1 month ago

I will let @zuercher decide if this issue should be closed

I'm wondering if there's plans to re-enable oghttp2 and if there's any way we can confirm that this doesn't recur.

jmarantz commented 1 month ago

@birenroy @paul-r-gall @RyanTheOptimist fyi.

Just to clarify, @shulin-sq , your observation is that in prod this problem repros occasionally with oghttp2 enabled, and never with oghttp2 disabled. Correct?

Thanks!

jmarantz commented 1 month ago

@yanavlasov as well

shulin-sq commented 1 month ago

@jmarantz that's correct. It would eventually repro after a couple hours for all nodes with the 1.31.0 release with no modification to the flags. I then ran 1.31.0 with the oghttp2 flag turned off for 24 hours and could not get a repro under the same conditions.

github-actions[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] commented 1 week ago

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.