Inconsistent timeout handling between max_stream_timeout and router global timeout

snowp commented 3 years ago

Due to recent changes to the timeout logic (https://github.com/envoyproxy/envoy/pull/15585, https://github.com/envoyproxy/envoy/pull/13302/files) there seems to be some inconsistencies in how timeouts are enforced, leading to surprising behavior when updating Envoy:

The max_stream_timeout falls back to using the route timeout under certain conditions: https://github.com/envoyproxy/envoy/blob/main/source/common/http/conn_manager_impl.cc#L1178-L1195

Prior to all these changes, users were able to override the route timeout by setting https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/router_filter#x-envoy-upstream-rq-timeout-ms. This breaks with the above changes: even if the router takes this into account when setting its global timeout, the max_stream_timeout fires and ends the request early.

It seems to me like we should either 1) not check route->timeout on the HCM level, and only read this as part of the router where it can be overriden by the header. To address @ramaraochavali's bug, I think we would have to make sure that we configure a global timeout on the router level, or 2) include checks for x-envoy-upstream-rq-timeout-ms in the HCM refreshDurationTimeout computation.

The second option seems to be the right thing directionally (continues moving this timeout handling to the HCM), but has a few drawbacks: the semantics of x-envoy-upstream-rq-timeout-ms setting the timeout used for the upstream processing is lost, and we'd require a refreshCache in order to apply the new value of the header (previously this would not be required since it gets read by the router). The first option preserves the previous behavior better, but it becomes less clear how we'd unify these the two code paths.

@ramaraochavali @antoniovicente @alyssawilk @mattklein123

xjtian commented 3 years ago

Additionally, #13302 introduces a new behavior where the grpc-timeout header is not respected if grpc_timeout_header_max is not set on the route: link. We observe this on bidirectional GRPC streams where setting grpc-timeout on the client side to something greater than the route's timeout results in the stream closing once the route timeout is hit.

Edit: this case was uncovered by #15585 which added the logic to fall back to route timeout. Before that, the stream was not timebound.

antoniovicente commented 3 years ago

Regarding use of x-envoy-upstream-rq-timeout-ms as an override, is the intent to allow this header to adjust the timeout downwards, upwards or both? Effectively respecting the min of these timeouts may be the most tractable option.

antoniovicente commented 3 years ago

This comment thread may be relevant: https://github.com/envoyproxy/envoy/pull/15585#discussion_r607312415

If we had unified logic to determine the timeout and were to use timeout_.globaltimeout as a source of a timeout value we'ld get more consistent behavior. FilterUtility::trySetGlobalTimeout sets timeout_.globaltimeout based on the contents of the relevant request header.

mattklein123 commented 3 years ago

Per offline convo in the maintainers chat, it makes me very uncomfortable that we keep having issues in this area. I think we a) don't have a clear understanding of the behavior we want and b) don't have enough tests.

I think @snowp is going to analyze the change history but my vote is to revert as much as possible and then come back with a holistic plan on what we need to do to not break any existing users. If we do go with a runtime flag I think we need to default it to off vs. on if we know there are breaking issues for users.

snowp commented 3 years ago

Regarding use of x-envoy-upstream-rq-timeout-ms as an override, is the intent to allow this header to adjust the timeout downwards, upwards or both? Effectively respecting the min of these timeouts may be the most tractable option.

@mattklein123 might be able to fill in the historical bits here but I believe at one point the only (?) per stream timeout mechanism was enforced by the router (the "global" timeout), and this is the timeout that was modifiable by the header. This is the total amount of time that the router may spend trying to send the stream upstream, including retries. As a result this isn't really the timeout for an upstream to handle the stream (this is given by x-envoy-upstream-rq-per-try-timeout-ms instead), but also not quite the downstream timeout as it doesn't include time spent in filters before the router (e.g. time waiting for ext_authz).

I think the intent of the changes in https://github.com/envoyproxy/envoy/pull/13302 was to move this global timeout handling to the HCM and remove it from the router to make it a real downstream timeout, but this has several consequences:

the returned error code is different from the HCM managed timeout (408) and the router managed timeout (504)
the HCM managed timeout does not honor x-envoy-upstream-rq-timeout-ms, which the router did
it requires route resolution to change the timeout vs the old router code that would read the grpc-timeout and x-envoy-upstream-rq-timeout-ms header as-is once it hit the router filter
some edge cases where both the HCM timeout and the router timeout was disabled (the issue raised by @ramaraochavali)

Based on these I think there are a few things we need to reach consensus on:

What should be the interaction between route.max_stream_timeout and route.timeout? How does the timeout headers (grpc and the Envoy one) fit into the picture?
- Part of the confusion here comes from that we previously used to be able to update the route.timeout via either of the timeout headers, but now only the grpc-timeout sets the HCM's stream timeout
- Should the two timeout headers be treated differently? I think grpc-timeout is intended to dictate the downstream timeout so it makes sense to move this to the HCM logic, but this has implications as it changes the impact of the grpc-timeout being hit (different status codes)
As we're blurring the lines somewhat between the two different timeouts at this point, should the status code be reconciled? Should both code paths return the same kind of status code?
How should timeout header modifications during the filter chain affect the timeout? The old behavior means they take effect no matter what, with the new timeout scheme it requires a cache refresh
Does it make sense to keep a max_stream_duration timeout separate from the route timeout? Or will the max_stream_duration timeout replace the route timeout?

jmarantz commented 3 years ago

@yanjunxiang-google is going to explore this; not ready to assign it yet.

markdroth commented 2 weeks ago

It looks like no one from gRPC was tagged on this issue -- I only discovered it because of #37134. In the future, it might be a good idea to loop us in on questions that affect gRPC semantics. @dfawley @ejona86

I just want to note that whatever we decide here, we need to make sure not to break gRPC timeout semantics, as documented here:

https://github.com/grpc/proposal/blob/master/A31-xds-timeout-support-and-config-selector.md#supported-fields

envoyproxy / envoy

Inconsistent timeout handling between max_stream_timeout and router global timeout #16129