envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.98k stars 4.81k forks source link

behavior of upstream request timeout #37005

Open itsLucario opened 2 days ago

itsLucario commented 2 days ago

I was testing a use case where I have a lua filter which sleeps for 4 secs, and added before the envoy router. We have route timeout configured for 3 secs.

The expectation was the timer begins in router and the lua 4s sleep, which is before router, shouldn't affect the route timeout and request should work. But I see the upstream request timeout and sleep-in filter is affecting the upstream timeout. Upstream timeout should be for the time request sent out to time taken by service right. Time taken in envoy should not affect upstream timeout.

Example: bootstrap.yaml

admin:
  address:
    socket_address:
      protocol: TCP
      address: 127.0.0.1
      port_value: 9901
dynamic_resources:
  cds_config:
    path: ./cds.yaml
  lds_config:
    path: ./lds.yaml
node:
  cluster: test-cluster
  id: test-id

cds.yaml

resources:
- "@type": type.googleapis.com/envoy.config.cluster.v3.Cluster
  name: dynamic_forward_proxy_cluster_http
  lb_policy: CLUSTER_PROVIDED
  cluster_type:
    name: envoy.clusters.dynamic_forward_proxy
    typed_config:
      "@type": type.googleapis.com/envoy.extensions.clusters.dynamic_forward_proxy.v3.ClusterConfig
      sub_clusters_config:
        lb_policy: LEAST_REQUEST
        sub_cluster_ttl: 600s

lds.yaml

resources:
- "@type": type.googleapis.com/envoy.config.listener.v3.Listener
  name: listener_0
  address:
    socket_address:
      protocol: TCP
      address: 0.0.0.0
      port_value: 10000
  filter_chains:
  - filters:
    - name: envoy.filters.network.http_connection_manager
      typed_config:
        "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
        stat_prefix: ingress_http
        route_config:
          name: local_route
          virtual_hosts:
          - name: local_service
            domains: ["httpbin.org"]
            routes:
            - match:
                prefix: "/"
              route:
                cluster: dynamic_forward_proxy_cluster_http
                timeout: 3s
        http_filters:
        - name: envoy.filters.http.dynamic_forward_proxy
          typed_config:
            "@type": type.googleapis.com/envoy.extensions.filters.http.dynamic_forward_proxy.v3.FilterConfig
            sub_cluster_config:
              cluster_init_timeout: 5s
        - name: envoy.filters.http.lua
          typed_config:
            "@type": type.googleapis.com/envoy.extensions.filters.http.lua.v3.Lua
            default_source_code:
              inline_string: |
                function envoy_on_request(request_handle)
                  request_handle:logInfo("Sleeping for 4 seconds")
                  local start = os.time()
                  repeat until os.time() > start + 4
                end
        - name: envoy.filters.http.router
          typed_config:
            "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router

Apply the above yaml and run curl -vL -H 'Host: httpbin.org' http://localhost:10000/status/200

The expectation was that requests shouldn't timeout as upstream isn't taking time, but envoy's lua filter was. Am I assuming something wrong? Can someone explain this in detail

I see the enable timer happens on router decodeHeaders on end_stream (i.e., downstream request received completely) but behaviour is not the same.

shreemaan-abhishek commented 1 day ago

I was testing a use case where I have a lua filter which sleeps for 4 secs, and added before the envoy router.

There is no guarantee mentioned in the doc that the lua code would run before the router, no?

I just discovered that removing the dynamic_forward_proxy http filter (that contains cluster_init_timeout) fixed the 504 Gateway Timeout error. Maybe this is what you are looking for?

image

KBaichoo commented 1 day ago

But I see the upstream request timeout and sleep-in filter is affecting the upstream timeout. Upstream timeout should be for the time request sent out to time taken by service right.

What does the access logs say, specifically the response flags? https://www.envoyproxy.io/docs/envoy/latest/configuration/observability/access_log/usage#:~:text=%25RESPONSE_FLAGS%25%20/%20%25RESPONSE_FLAGS_LONG%25,Possible%20values%20are%3A

I suspect that you're running into another error as the upstream timeout occurs after the full request is complete.

Simplifying the experiment to just lua + router should help tease this out.