emissary-ingress / emissary

open source Kubernetes-native API gateway for microservices built on the Envoy Proxy
https://www.getambassador.io
Apache License 2.0
4.36k stars 684 forks source link

Enabling Active HealthCheck causes `ClusterLoadAssignment` to be empty. #5792

Open alanprot opened 2 weeks ago

alanprot commented 2 weeks ago

Describe the bug Hi,

I'm trying to enable active health checks on a specific Mapping, which uses a KubernetesEndpointResolver. Upon configuring the healthcheck we can see that in some ambassador pods, all upstream hosts for the cluster associated to the mapping seem to disappear, and it stays missing until something else changes in the cluster (a pod scale up for instance). This causes these Ambassador pods to return 503 errors, as no upstream targets are found.

Here's the configuration I'm trying to add to the Mapping:

  health_checks:
  - unhealthy_threshold: 50
    healthy_threshold: 1
    interval: "15s"
    timeout: "10s"
    health_check:
      http:
        path: /ready
        expected_statuses:
          - max: 300
            min: 199

This issue only affects some Emissary pods. When comparing pods that experience the issue with those that work correctly, the Envoy config dump (with EDS info) reveals that the list of hosts is empty for the faulty pods:

Bad:

    {
     "endpoint_config": {
      "@type": "type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment",
      "cluster_name": "k8s/namespace-1/service-a/443",
      "policy": {
       "overprovisioning_factor": 140
      }
     }
    },

Good:

endpoint_config": {
      "@type": "type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment",
      "cluster_name": "k8s/namespace-1/service-a/443",
      "endpoints": [
       {
        "locality": {},
        "lb_endpoints": [
         {
          "endpoint": {
           "address": {
            "socket_address": {
             "address": "10.1.130.52",
             "port_value": 8080
            }
           },
           "health_check_config": {}
          },
          "health_status": "HEALTHY",
          "load_balancing_weight": 1
         },
         {
          "endpoint": {
           "address": {
            "socket_address": {
             "address": "10.1.165.169",
             "port_value": 8080
            }
           },
           "health_check_config": {}
          },
          "health_status": "HEALTHY",
          "load_balancing_weight": 1
         },
         {
          "endpoint": {
           "address": {
            "socket_address": {
             "address": "10.1.196.153",
             "port_value": 8080
            }
           },
           "health_check_config": {}
          },
          "health_status": "HEALTHY",
          "load_balancing_weight": 1
         }
        ]
       }
      ],
      "policy": {
       "overprovisioning_factor": 140
      }
     }
    }

I haven't been able to pinpoint why some pods have an empty ClusterLoadAssignment, but it seems like a race condition, possibly in the service that populates the assignment via EDS. The issue occurs randomly in different pods if i keep removing and adding back the healthcheck config.

To Reproduce Steps to reproduce the behavior:

  1. Create an Mapping with KubernetesEndpointResolver and service pointing to an k8s service. Ex:
apiVersion: getambassador.io/v3alpha1
kind: Mapping
metadata:
  name: mapping-1
spec:
  hostname: "*"
  ambassador_id: [ emissary ]
  load_balancer:
    policy: round_robin
  prefix: /prefix
  service: https://service-a.default:443
  1. Modify the mapping to add the healthcheck
apiVersion: getambassador.io/v3alpha1
kind: Mapping
metadata:
  name: mapping-1
spec:
  hostname: "*"
  ambassador_id: [ emissary ]
  load_balancer:
    policy: round_robin
  prefix: /prefix
  service: https://service-a.default:443
  health_checks:
  - unhealthy_threshold: 50
    healthy_threshold: 1
    interval: "15s"
    timeout: "10s"
    health_check:
      http:
        path: /ready
        expected_statuses:
          - max: 300
            min: 199
  1. Port forward to the ambassador pod and check that the instances of service-a.default cluster are not registered on the ClusterLoadAssignment (http://localhost:8001/config_dump?resource=&mask=&name_regex=&include_eds=on)

Expected behavior Configuring the healthcheck should not wipe the instances of the cluster associated to the mapping resource.

Versions (please complete the following information):

alanprot commented 2 weeks ago

It seems that this may have the same root cause of https://github.com/emissary-ingress/emissary/pull/4447

Seems when we add the healthcheck, we create a new cluster with the same name and can trigger this bug...

alanprot commented 2 weeks ago

Ok...

Indeed seems the same root cause of https://github.com/emissary-ingress/emissary/pull/4447

Seems that anything that changes the cluster object, can trigger this bug:

I kept changing the connect_timeout_ms or cluster_idle_timeout_ms on the mapping and could reproduce this problem as well -> No healthcheck configured at all:

Cluster Object:

"cluster": {
      "@type": "type.googleapis.com/envoy.config.cluster.v3.Cluster",
      "name": "cluster_https___servoce-1_namespace-a_443_o-32E2014365DD7432-0",
      "type": "EDS",
      "eds_cluster_config": {
       "eds_config": {
        "ads": {},
        "resource_api_version": "V3"
       },
       "service_name": "k8s/servoce-1/namespace-a/443"
      },
      "connect_timeout": "2s",
      "dns_lookup_family": "V4_ONLY",
      "transport_socket": {
       "name": "envoy.transport_sockets.tls",
       "typed_config": {
        "@type": "type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext",
        "common_tls_context": {}
       }
      },
      "alt_stat_name": "distributor_cortex_443",
      "common_http_protocol_options": {
       "idle_timeout": "90s"
      }
     },
     "last_updated": "2024-10-08T00:52:01.553Z"
    },