envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.7k stars 4.75k forks source link

Tcp active healthchecks fail during application deploy #35719

Open jan-machacek-kosik opened 3 weeks ago

jan-machacek-kosik commented 3 weeks ago

Title: Tcp active healthcheck fails during application deploy

Description: During every application deploy in Kubernetes, it happens that TCP active health checks fail. Is it possible that the Envoy health check is checking pods that are not ready yet or those that are no longer ready?

In the Envoy OpenTelemetry metrics, you can only see how many health checks failed, but not which pods or IP addresses they were targeting.

Is there a way to determine from any metric which member (IP address or pod name) failed the health check?

Here is our healtcheckc configuration:

"health_checks": [
  {
     "timeout":"10s",
     "interval":"30s",
     "unhealthy_threshold":3,
     "healthy_threshold":1,
     "tcp_health_check": {
       }
   }
]
adisuissa commented 3 weeks ago

In Envoy it is possible to get the current health status of an endpoint by using the admin-interface. In addition, Envoy supports an event_logger (see here) for the health checks, that may provide more information.

jan-machacek-kosik commented 3 weeks ago

Hi @adisuissa, Thank You for your reply.

we enabled event_logger and it turns out those failures were on the old/currently removing members. We solved this by setting ignore_health_on_host_removal to true.

We encountered another problem. We get 503 errors during deployment as all members are temporarily in unhealthy state. We tried setting ignore_new_hosts_until_first_hc to false but it didn't help.

We know pods are succesfully deployed because we have rigorous startup/liveness/readiness probes and Argo Rollout analysis before we switch blue-green versions.

When we disable active healthcheck everything works fine.

Any idea what could be the cause?

We are trying to achieve zero-downtime blue-green deployment with active healthchecks for our microservices using Envoy as a Gateway.

Cluster configuration:


{
   "version_info":"b0c07ee734f23d9f889ff74e00fd0ac27133e95240cc3c103c69158a217c561a",
   "cluster":{
      "@type":"type.googleapis.com/envoy.config.cluster.v3.Cluster",
      "name":"httproute/jm2-eoc-k3w/int-https/rule/3",
      "type":"EDS",
      "eds_cluster_config":{
         "eds_config":{
            "ads":{

            },
            "resource_api_version":"V3"
         },
         "service_name":"httproute/jm2-eoc-k3w/int-https/rule/3"
      },
      "connect_timeout":"60s",
      "per_connection_buffer_limit_bytes":32768,
      "lb_policy":"LEAST_REQUEST",
      "health_checks":[
         {
            "timeout":"1s",
            "interval":"30s",
            "unhealthy_threshold":3,
            "healthy_threshold":1,
            "tcp_health_check":{

            },
            "event_logger":[
               {
                  "name":"stdout-log",
                  "typed_config":{
                     "@type":"type.googleapis.com/envoy.extensions.health_check.event_sinks.file.v3.HealthCheckEventFileSink",
                     "event_log_path":"/proc/self/fd/1"
                  }
               }
            ]
         }
      ],
      "circuit_breakers":{
         "thresholds":[
            {
               "max_retries":1024
            }
         ]
      },
      "dns_lookup_family":"V4_ONLY",
      "outlier_detection":{

      },
      "common_lb_config":{
         "healthy_panic_threshold":{

         },
         "locality_weighted_lb_config":{

         }
      },
      "ignore_health_on_host_removal":true
   },
   "last_updated":"2024-08-21T04:25:40.519Z"
}