envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.73k stars 4.75k forks source link

Failure percentage based outlier detection time does not increase on successive ejections #26758

Closed wmouchere closed 11 months ago

wmouchere commented 1 year ago

Title: Failure percentage based outlier detection time does not increase on successive ejections

Description: In a cluster configured to detect outliers based on failure percentage, when a host gets ejected multiple times in a row, envoy does not properly increase the time of ejection.

Here is an example of what the membership_healthy looks like over time in my setup. In this case, interval is 30s, baseEjectionTime is 60s and maxEjectionTime is 300s.

image

Repro steps: Set up a cluster with a host that only sends back 500 codes, and configure the outlier detection to use failure percentage. Then put some load for a few minutes and observe as the host is ejected and un-ejected at a regular pace.

Config:

       {
        "version_info": "2023-04-14T14:54:23Z/1608",
        "cluster": {
         "@type": "type.googleapis.com/envoy.config.cluster.v3.Cluster",
         "name": "outbound|80|ivts-31677c-piv|egressgateway.istio-gateways.svc.cluster.local",
         "type": "EDS",
         "eds_cluster_config": {
          "eds_config": {
           "ads": {},
           "initial_fetch_timeout": "0s",
           "resource_api_version": "V3"
          },
          "service_name": "outbound|80|ivts-31677c-piv|egressgateway.istio-gateways.svc.cluster.local"
         },
         ...
         "outlier_detection": {
          "consecutive_5xx": 5,
          "interval": "30s",
          "base_ejection_time": "60s",
          "max_ejection_percent": 100,
          "enforcing_consecutive_5xx": 0,
          "enforcing_success_rate": 0,
          "enforcing_consecutive_gateway_failure": 0,
          "failure_percentage_threshold": 25,
          "enforcing_failure_percentage": 100,
          "enforcing_failure_percentage_local_origin": 100,
          "failure_percentage_minimum_hosts": 0,
          "failure_percentage_request_volume": 50,
          "max_ejection_time": "300s",
          "max_ejection_time_jitter": "0s"
         },
         ...
       }

Version: I am using envoy in an istio 1.15.1 setup, this is the version of envoy that I could retrieve - "94bd57194ed66b70e231dbf22a7771a9e9e43a74/1.23.2-dev/Clean/RELEASE/BoringSSL"

wmouchere commented 1 year ago

I am not very familiar with the code base, but I would say that the issue lies around here

https://github.com/envoyproxy/envoy/blob/e15b814095c59ae6f195575369b9300846eed47d/source/common/upstream/outlier_detection_impl.cc#L742-L760

Because checkHostForUneject decreases ejectTimeBackoff() before the ejection that will happen in processSuccessRateEjections.

cpakulski commented 1 year ago

It maybe related to https://github.com/envoyproxy/envoy/issues/21142. Let me look at this.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

cpakulski commented 1 year ago

Update. I replicated the issue using one endpoint which constantly returned 503. I noticed the node was immediately declared unhealthy after it was un-ejected. According to the algorithm the node should stay healthy for the length of interval (30s) and then ejected based on failure percentage. Working on the fix.

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

cpakulski commented 1 year ago

Still planning to fix it!