envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.69k stars 4.75k forks source link

Dynamic forward proxy with sub_clusters_config get stuck with wait for warmup #35171

Open itsLucario opened 1 month ago

itsLucario commented 1 month ago

Title: DynamicForwardProxy with sub_clusters_config get stuck with Sub cluster warming timeout after dynamic CDS add/update

Description: I am trying to use dynamic_forward_proxy with sub_clusters_config to get the benefits that it provides. Things work fine at start and envoy is able to forward the request to upstream but the moment we receive the CDS and the cluster initialization happens and the DFPClusters gets removed. Once the DFPCluster removed after CDS, all the further requests results in Sub cluster warming timeout.

We initially noticed this with Istio on CDS add/update and tried to reproduce the same on envoy with filesystem-based dynamic config.

We were able to identify the root cause for this and a potential fix! The dynamic_forward_proxy maintains a cluster_map_ of its own. When CDS trigger happens, DFPClusters aren't part of the CDS and cluster manager removes it, but a stale entry gets left over in the cluster_map_ of dynamic_forward_proxy cluster. When this happens, the further requests dynamic_forward_proxy assumes that the cluster is already present and waits for a warmup and eventually times out as the cluster is already removed (Ref). This keeps happening until dynamic_forward_proxy removes it from its map after ttl expire.

Potential Fix: We tried a potential fix, When enter the wait-for warmup condition (Ref), we check again with cluster manager if the cluster is present. If not, clear the cluster from DFP map and trigger the add subcluster again so the cluster gets added back to cluster manager and DFP map.

If you feel above-mentioned is a bug and not intended behavior, I can try to work on my first contribution to envoy. (Super excited 😄 )

Repro steps:

Include sample requests, environment, etc. All data and inputs

Below are the configuration I'm using to reproduce the issue:

# bootstrap.yaml
admin:
  address:
    socket_address:
      protocol: TCP
      address: 127.0.0.1
      port_value: 9901
dynamic_resources:
  cds_config:
    path: ./cds.yaml
  lds_config:
    path: ./lds.yaml
node:
  cluster: test-cluster
  id: test-id
# cds.yaml
resources:
- "@type": type.googleapis.com/envoy.config.cluster.v3.Cluster
  name: dynamic_forward_proxy_cluster_http
  lb_policy: CLUSTER_PROVIDED
  cluster_type:
    name: envoy.clusters.dynamic_forward_proxy
    typed_config:
      "@type": type.googleapis.com/envoy.extensions.clusters.dynamic_forward_proxy.v3.ClusterConfig
      sub_clusters_config:
        lb_policy: LEAST_REQUEST
        sub_cluster_ttl: 600s
# lds.yaml
resources:
- "@type": type.googleapis.com/envoy.config.listener.v3.Listener
  name: listener_0
  address:
    socket_address:
      protocol: TCP
      address: 0.0.0.0
      port_value: 10000
  filter_chains:
  - filters:
    - name: envoy.filters.network.http_connection_manager
      typed_config:
        "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
        stat_prefix: ingress_http
        route_config:
          name: local_route
          virtual_hosts:
          - name: local_service
            domains: ["*"]
            routes:
            - match:
                prefix: "/"
              route:
                cluster: dynamic_forward_proxy_cluster_http
        http_filters:
        - name: envoy.filters.http.dynamic_forward_proxy
          typed_config:
            "@type": type.googleapis.com/envoy.extensions.filters.http.dynamic_forward_proxy.v3.FilterConfig
            sub_cluster_config:
              cluster_init_timeout: 5s
        - name: envoy.filters.http.router
          typed_config:
            "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router

Config:

Include the config used to configure Envoy. Tried with below versions and the bug exists

346cc3385269016f7c36ad15a23a7b382348f7af/1.28.6-dev/Modified/RELEASE/BoringSSL
10e1e425fd84cc2cde985179655da08f68c9cb30/1.29.3/Distribution/RELEASE/BoringSSL

Logs:

Include the access logs and the Envoy logs. accesslogs.log

doujiang24 commented 1 month ago

@itsLucario Thanks for you reporting, seems it's a bug.

DFPClusters aren't part of the CDS and cluster manager removes it

Or maybe, we could make cluster manager do not to remove those DFPClusters?

itsLucario commented 1 month ago

@doujiang24 Yeah, that should work! We can avoid removing clusters during onConfigUpdate if their names match the DFPCluster prefix. This will help prevent frequent removal of DFPClusters if there are continuous CDS pushes.

Handling it in DFP filter to check if the cluster exists in the TLS would prevent the need for DFP-specific changes in cluster manager or cds_api_helper. However, this would also result in removing DFP clusters on every CDS update, which would increase DNS queries and affect performance.

I think it would be better to fix this at the cluster manager level like you mentioned. That way, we can avoid constantly deleting and recreating DFP clusters, and it won't impact performance.

In either of the case, kindly let me know. I would be happy to try and contribute the fix for this. Thank you

doujiang24 commented 1 month ago

We can avoid removing clusters during onConfigUpdate if their names match the DFPCluster prefix

Maybe we can add a new flag in cluster, instead of prefix match.

It's okay to file a PR from my side, as the first contributor of sub_clusters_config, but final decision from OWNERS @mattklein123 @alyssawilk

github-actions[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] commented 2 weeks ago

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

itsLucario commented 1 week ago

Can this issue be reopened?