envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
25.14k stars 4.83k forks source link

Inconsistent timeout handling between max_stream_timeout and router global timeout #16129

Open snowp opened 3 years ago

snowp commented 3 years ago

Due to recent changes to the timeout logic (https://github.com/envoyproxy/envoy/pull/15585, https://github.com/envoyproxy/envoy/pull/13302/files) there seems to be some inconsistencies in how timeouts are enforced, leading to surprising behavior when updating Envoy:

The max_stream_timeout falls back to using the route timeout under certain conditions: https://github.com/envoyproxy/envoy/blob/main/source/common/http/conn_manager_impl.cc#L1178-L1195

Prior to all these changes, users were able to override the route timeout by setting https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/router_filter#x-envoy-upstream-rq-timeout-ms. This breaks with the above changes: even if the router takes this into account when setting its global timeout, the max_stream_timeout fires and ends the request early.

It seems to me like we should either 1) not check route->timeout on the HCM level, and only read this as part of the router where it can be overriden by the header. To address @ramaraochavali's bug, I think we would have to make sure that we configure a global timeout on the router level, or 2) include checks for x-envoy-upstream-rq-timeout-ms in the HCM refreshDurationTimeout computation.

The second option seems to be the right thing directionally (continues moving this timeout handling to the HCM), but has a few drawbacks: the semantics of x-envoy-upstream-rq-timeout-ms setting the timeout used for the upstream processing is lost, and we'd require a refreshCache in order to apply the new value of the header (previously this would not be required since it gets read by the router). The first option preserves the previous behavior better, but it becomes less clear how we'd unify these the two code paths.

@ramaraochavali @antoniovicente @alyssawilk @mattklein123

xjtian commented 3 years ago

Additionally, #13302 introduces a new behavior where the grpc-timeout header is not respected if grpc_timeout_header_max is not set on the route: link. We observe this on bidirectional GRPC streams where setting grpc-timeout on the client side to something greater than the route's timeout results in the stream closing once the route timeout is hit.

Edit: this case was uncovered by #15585 which added the logic to fall back to route timeout. Before that, the stream was not timebound.

antoniovicente commented 3 years ago

Regarding use of x-envoy-upstream-rq-timeout-ms as an override, is the intent to allow this header to adjust the timeout downwards, upwards or both? Effectively respecting the min of these timeouts may be the most tractable option.

antoniovicente commented 3 years ago

This comment thread may be relevant: https://github.com/envoyproxy/envoy/pull/15585#discussion_r607312415

If we had unified logic to determine the timeout and were to use timeout_.globaltimeout as a source of a timeout value we'ld get more consistent behavior. FilterUtility::trySetGlobalTimeout sets timeout_.globaltimeout based on the contents of the relevant request header.

mattklein123 commented 3 years ago

Per offline convo in the maintainers chat, it makes me very uncomfortable that we keep having issues in this area. I think we a) don't have a clear understanding of the behavior we want and b) don't have enough tests.

I think @snowp is going to analyze the change history but my vote is to revert as much as possible and then come back with a holistic plan on what we need to do to not break any existing users. If we do go with a runtime flag I think we need to default it to off vs. on if we know there are breaking issues for users.

snowp commented 3 years ago

Regarding use of x-envoy-upstream-rq-timeout-ms as an override, is the intent to allow this header to adjust the timeout downwards, upwards or both? Effectively respecting the min of these timeouts may be the most tractable option.

@mattklein123 might be able to fill in the historical bits here but I believe at one point the only (?) per stream timeout mechanism was enforced by the router (the "global" timeout), and this is the timeout that was modifiable by the header. This is the total amount of time that the router may spend trying to send the stream upstream, including retries. As a result this isn't really the timeout for an upstream to handle the stream (this is given by x-envoy-upstream-rq-per-try-timeout-ms instead), but also not quite the downstream timeout as it doesn't include time spent in filters before the router (e.g. time waiting for ext_authz).

I think the intent of the changes in https://github.com/envoyproxy/envoy/pull/13302 was to move this global timeout handling to the HCM and remove it from the router to make it a real downstream timeout, but this has several consequences:

Based on these I think there are a few things we need to reach consensus on:

jmarantz commented 3 years ago

@yanjunxiang-google is going to explore this; not ready to assign it yet.

markdroth commented 2 weeks ago

It looks like no one from gRPC was tagged on this issue -- I only discovered it because of #37134. In the future, it might be a good idea to loop us in on questions that affect gRPC semantics. @dfawley @ejona86

I just want to note that whatever we decide here, we need to make sure not to break gRPC timeout semantics, as documented here:

https://github.com/grpc/proposal/blob/master/A31-xds-timeout-support-and-config-selector.md#supported-fields