cilium / cilium

eBPF-based Networking, Security, and Observability
https://cilium.io
Apache License 2.0
19.7k stars 2.88k forks source link

CFP: Configurable HTTP/gRPC request timeout options for north-south traffic (GatewayAPI/Ingress) #23798

Open chancez opened 1 year ago

chancez commented 1 year ago

Cilium Feature Proposal

Is your feature request related to a problem?

Currently I'm using nginx-ingress and I when I want to expose hubble-relay outside the cluster, I need to configure the Nginx grpc_read_timeout option in order to avoid hubble observe --follow from disconnecting when there are no flows being transmitted. The same is true if I was writing data to a gRPC connection (I don't have an example for this to point at).

Another similar example is for websockets. Some internal web applications I'm using maintain long lived websocket connections where no data is transmitted for a long period of time. In these cases, if no data is transmitted for long enough, the connection gets closed. With nginx-ingress I can adjust the read/write timeouts to handle this better by using the annotations nginx.ingress.kubernetes.io/proxy-read-timeout and nginx.ingress.kubernetes.io/proxy-send-timeout, which map to the proxy_read_timeout and proxy_send_timeout options in upstream Nginx.

Describe the feature you'd like

Currently Cilium does not support exposing gRPC services via Ingress, and needs #21928 for gRPC support in GatewayAPI, so this is potentially preemptive.

I'd like the ability to tune the "read/write" timeouts of an HTTP or gRPC request. I know that envoy uses different settings, and breaks things down a bit more, but being able to tune these options is important for connections that aren't always transmitting data.

youngnick commented 1 year ago

I agree timeouts are super important, and am in the process of working on a GEP (GEP-1742, kubernetes-sigs/gateway-api#1742, kubernetes-sigs/gateway-api#1744 is the PR to add the actual design.)

Once I've got that a bit further along (I'm hoping to define a request timeout, analogous to the nginx proxy-send-timeout value, and an idle timeout, probably for the whole connection), then I think we can look at adding timeouts to Ingress that use the same concepts.

For reference, (and because I'm proud of my diagram), here's all the configurable Envoy timeouts we have available, on a standard HTTP request flow, where CM is the Connection Manager level settings, R is the Route level settings, and Cluster is the cluster level settings:

sequenceDiagram
    participant C as Client
    participant P as Envoy
    participant U as Upstream
    C->>P: Connection Started
    activate P
    Note left of P: transport_socket_connect_timeout for TLS
    deactivate P
    C->>P: Starts sending Request
    activate C
    activate P
    activate P
    C->>P: Finishes Headers
    note left of P: CM request_headers_timeout
        C->>P: Finishes request
    deactivate P
    activate U
    note left of U: Cluster connect_timeout
    deactivate U
    P->>U: Connection Started
        activate U
    note right of U: CM idle_timeout<br />CM max_connection_duration
    P->>U: Starts sending Request
    P->>U: Finishes Headers
    note left of P: CM request_timeout
        P->>U: Finishes request
    deactivate P
    activate U
    U->>P: Starts Response
    U->>P: Finishes Headers
        note right of U: R timeout<br/>R per_try_timeout<br/>R per_try_idle_timeout
    U->>P: Finishes Response
    deactivate U
    P->>C: Starts Response
    P->>C: Finishes Headers
    P->>C: Finishes Response
    Note left of C: CM stream_idle_timeout<br />R idle_timeout<br />CM,R max_stream_duration<br/>TCP proxy idle_timeout<br />TCP protocol idle_timeout
    deactivate C
    Note right of P: Repeat if connection sharing
    U->>C: Connection ended
    deactivate U
github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

youngnick commented 1 year ago

The upstream effort is still ongoing here, but this is definitely still relevant.

nadavbuc commented 4 months ago

Hey - is there an east-west max_connection_duration config per service/route? i saw it can be configured globally, but couldn't find any example to use Envoy CRDs to set it up

youngnick commented 4 months ago

Not currently. Once we support GAMMA (see #22512 for that work), that will include the request timeout, which functions as a max connection duration.

nadavbuc commented 4 months ago

Actually it's achievable with envoyConfig as well example: (this is based on the circuit break example, just added timeout on the route) Can probably be configured in a different way to achieve exactly same functionially as max connection duration

apiVersion: cilium.io/v2
kind: CiliumClusterwideEnvoyConfig
metadata:
  name: envoy-circuit-breaker
spec:
  services:
    - name: cilium-target
      namespace: default
  resources:
    - "@type": type.googleapis.com/envoy.config.listener.v3.Listener
      name: envoy-lb-listener
      filter_chains:
        - filters:
            - name: envoy.filters.network.http_connection_manager
              typed_config:
                "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
                stat_prefix: envoy-lb-listener
                rds:
                  route_config_name: lb_route
                use_remote_address: true
                skip_xff_append: true
                http_filters:
                  - name: envoy.filters.http.router
                    typed_config:
                      "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
    - "@type": type.googleapis.com/envoy.config.route.v3.RouteConfiguration
      name: lb_route
      virtual_hosts:
        - name: "lb_route"
          domains: [ "*" ]
          routes:
            - match:
                prefix: "/"
              route:
                timeout: 30s
                weighted_clusters:
                  clusters:
                    - name: "default/cilium-target"
                      weight: 100
    - "@type": type.googleapis.com/envoy.config.cluster.v3.Cluster
      name: "default/cilium-target"
      connect_timeout: 5s
      lb_policy: ROUND_ROBIN
      type: EDS
      circuit_breakers:
        thresholds:
        - priority: "DEFAULT"
          max_requests: 2
          max_pending_requests: 1
      outlier_detection:
        split_external_local_origin_errors: true
        consecutive_local_origin_failure: 2