Open snowp opened 3 years ago
Additionally, #13302 introduces a new behavior where the grpc-timeout
header is not respected if grpc_timeout_header_max
is not set on the route: link. We observe this on bidirectional GRPC streams where setting grpc-timeout
on the client side to something greater than the route's timeout results in the stream closing once the route timeout is hit.
Edit: this case was uncovered by #15585 which added the logic to fall back to route timeout. Before that, the stream was not timebound.
Regarding use of x-envoy-upstream-rq-timeout-ms as an override, is the intent to allow this header to adjust the timeout downwards, upwards or both? Effectively respecting the min of these timeouts may be the most tractable option.
This comment thread may be relevant: https://github.com/envoyproxy/envoy/pull/15585#discussion_r607312415
If we had unified logic to determine the timeout and were to use timeout_.globaltimeout as a source of a timeout value we'ld get more consistent behavior. FilterUtility::trySetGlobalTimeout sets timeout_.globaltimeout based on the contents of the relevant request header.
Per offline convo in the maintainers chat, it makes me very uncomfortable that we keep having issues in this area. I think we a) don't have a clear understanding of the behavior we want and b) don't have enough tests.
I think @snowp is going to analyze the change history but my vote is to revert as much as possible and then come back with a holistic plan on what we need to do to not break any existing users. If we do go with a runtime flag I think we need to default it to off vs. on if we know there are breaking issues for users.
Regarding use of x-envoy-upstream-rq-timeout-ms as an override, is the intent to allow this header to adjust the timeout downwards, upwards or both? Effectively respecting the min of these timeouts may be the most tractable option.
@mattklein123 might be able to fill in the historical bits here but I believe at one point the only (?) per stream timeout mechanism was enforced by the router (the "global" timeout), and this is the timeout that was modifiable by the header. This is the total amount of time that the router may spend trying to send the stream upstream, including retries. As a result this isn't really the timeout for an upstream to handle the stream (this is given by x-envoy-upstream-rq-per-try-timeout-ms
instead), but also not quite the downstream timeout as it doesn't include time spent in filters before the router (e.g. time waiting for ext_authz).
I think the intent of the changes in https://github.com/envoyproxy/envoy/pull/13302 was to move this global timeout handling to the HCM and remove it from the router to make it a real downstream timeout, but this has several consequences:
x-envoy-upstream-rq-timeout-ms
, which the router didgrpc-timeout
and x-envoy-upstream-rq-timeout-ms
header as-is once it hit the router filterBased on these I think there are a few things we need to reach consensus on:
route.max_stream_timeout
and route.timeout
? How does the timeout headers (grpc and the Envoy one) fit into the picture?
route.timeout
via either of the timeout headers, but now only the grpc-timeout
sets the HCM's stream timeout@yanjunxiang-google is going to explore this; not ready to assign it yet.
It looks like no one from gRPC was tagged on this issue -- I only discovered it because of #37134. In the future, it might be a good idea to loop us in on questions that affect gRPC semantics. @dfawley @ejona86
I just want to note that whatever we decide here, we need to make sure not to break gRPC timeout semantics, as documented here:
Due to recent changes to the timeout logic (https://github.com/envoyproxy/envoy/pull/15585, https://github.com/envoyproxy/envoy/pull/13302/files) there seems to be some inconsistencies in how timeouts are enforced, leading to surprising behavior when updating Envoy:
The max_stream_timeout falls back to using the route timeout under certain conditions: https://github.com/envoyproxy/envoy/blob/main/source/common/http/conn_manager_impl.cc#L1178-L1195
Prior to all these changes, users were able to override the route timeout by setting https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/router_filter#x-envoy-upstream-rq-timeout-ms. This breaks with the above changes: even if the router takes this into account when setting its global timeout, the max_stream_timeout fires and ends the request early.
It seems to me like we should either 1) not check
route->timeout
on the HCM level, and only read this as part of the router where it can be overriden by the header. To address @ramaraochavali's bug, I think we would have to make sure that we configure a global timeout on the router level, or 2) include checks forx-envoy-upstream-rq-timeout-ms
in the HCMrefreshDurationTimeout
computation.The second option seems to be the right thing directionally (continues moving this timeout handling to the HCM), but has a few drawbacks: the semantics of
x-envoy-upstream-rq-timeout-ms
setting the timeout used for the upstream processing is lost, and we'd require arefreshCache
in order to apply the new value of the header (previously this would not be required since it gets read by the router). The first option preserves the previous behavior better, but it becomes less clear how we'd unify these the two code paths.@ramaraochavali @antoniovicente @alyssawilk @mattklein123