envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.73k stars 4.75k forks source link

Envoy is Not Retrying Requests on 429 Response with Multiple Clusters #35877

Open ronyrv13 opened 3 weeks ago

ronyrv13 commented 3 weeks ago

I am using Envoy to route traffic between multiple clusters. I've configured Envoy to retry requests when a 429 (Too Many Requests) response is received. However, despite configuring the retry_policy with retriable_status_codes: [429], Envoy does not appear to retry the request to the other cluster when a 429 response is received from the first cluster.

Envoy Configuration

Below is a simplified version of the configuration I am using:

static_resources:
  listeners:
  - name: https
    address:
      socket_address:
        address: 0.0.0.0
        port_value: 8443    
    filter_chains:
    - filters:
      - name: envoy.filters.network.http_connection_manager
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          stat_prefix: ingress_http
          access_log:
          - name: envoy.access_loggers.stdout
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.access_loggers.stream.v3.StdoutAccessLog
          http_filters:
          - name: envoy.filters.http.router
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
          route_config:
            name: local_route
            virtual_hosts:
            - name: local_service
              domains: ["*"]
              retry_policy:
                retriable_status_codes: [429]
                num_retries: 1
                retry_host_predicate:
                - name: envoy.retry_host_predicates.previous_hosts
                  typed_config:
                    "@type": type.googleapis.com/envoy.extensions.retry.host.previous_hosts.v3.PreviousHostsPredicate
              routes:
              - match:
                  prefix: "/"
                route:
                  weighted_clusters:
                    clusters:
                    - name: aoai_gpt35_cluster_1
                      weight: 100
                      host_rewrite_literal: ckgpt35-01.openai.azure.com
                      request_headers_to_remove:
                      - "api-key"
                      request_headers_to_add:
                      - header:
                          key: "api-key"
                          value: "f8d706XXXX"
                    - name: aoai_gpt35_cluster_2
                      weight: 100
                      host_rewrite_literal: ckgpt35-02.openai.azure.com
                      request_headers_to_remove:
                      - "api-key"
                      request_headers_to_add:
                      - header:
                          key: "api-key"
                          value: "c322eXXX"
      transport_socket:
        name: envoy.transport_sockets.tls
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.DownstreamTlsContext
          common_tls_context:
            tls_certificates:
            - certificate_chain: {filename: "certs/server.pem"}
              private_key: {filename: "certs/server.pem"}
            alpn_protocols: [ "h2", "http/1.1" ]

  clusters:
    - name: aoai_gpt35_cluster_1
      type: LOGICAL_DNS
      dns_lookup_family: V4_ONLY
      per_connection_buffer_limit_bytes: 512000
      load_assignment:
        cluster_name: aoai_gpt35_cluster_1
        endpoints:
        - lb_endpoints:
          - endpoint:
              address:
                socket_address:
                  address: ckgpt35-01.openai.azure.com
                  port_value: 443
      transport_socket:
        name: envoy.transport_sockets.tls
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext

    - name: aoai_gpt35_cluster_2
      type: LOGICAL_DNS
      dns_lookup_family: V4_ONLY
      per_connection_buffer_limit_bytes: 512000
      load_assignment:
        cluster_name: aoai_gpt35_cluster_2
        endpoints:
        - lb_endpoints:
          - endpoint:
              address:
                socket_address:
                  address: ckgpt35-02.openai.azure.com
                  port_value: 443
      transport_socket:
        name: envoy.transport_sockets.tls
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext

Expected Behavior

When the primary cluster (aoai_gpt35_cluster_1) returns a 429 status code, Envoy should retry the request on the secondary cluster (aoai_gpt35_cluster_2).

Actual Behavior

Envoy does not retry the request on the secondary cluster after receiving a 429 response from the primary cluster. Instead, the request fails after receiving the 429 response, without retrying to the other cluster as expected.

If this is a configuration issue, I would appreciate guidance on how to properly configure retries for multiple clusters. Otherwise, it may be a bug with retry handling in the presence of multiple clusters.

ggreenway commented 2 weeks ago

I don't think a retry policy will cause a different choice from the weighted_clusters. I believe one of the clusters is chosen, and then all retries will happen within that cluster.

You could probably work around this by having a single cluster with both DNS names, and have different weights within the cluster endpoints.

ronyrv13 commented 2 weeks ago

Thank you, @ggreenway, for your response. I did check that option, but I couldn't test it because each of my backend endpoints requires a different API key in the headers.

Essentially, we need to manage different API keys for different endpoints, and I didn't find any filter or derivative that supports this with a single cluster in Envoy.

If we can address this issue, I could easily switch back to a single cluster with multiple endpoints. Do you have any suggestions for overcoming this header blocker?

vikaschoudhary16 commented 2 weeks ago

you can create two aggregated clusters like: AgCluster1:

In aggregated clusters, sub clusters are in priority order. For example in AgCluster1, aoai_gpt35_cluster_1 is higher priority and all traffic will be sent to it and if it fails traffic will be forwarded to aoai_gpt35_cluster_2 Similarly in AgCluster2, aoai_gpt35_cluster_2 is higher priority with failover possibility to aoai_gpt35_cluster_1

Now you can put these two aggregated clusters in weighted clusters with 100-100 weight:

Traffic will go 50-50 to each, where within each aggregated cluster, highest priority sub-cluster has a backup/failover sub-cluster