Closed wmouchere closed 11 months ago
I am not very familiar with the code base, but I would say that the issue lies around here
Because checkHostForUneject
decreases ejectTimeBackoff()
before the ejection that will happen in processSuccessRateEjections
.
It maybe related to https://github.com/envoyproxy/envoy/issues/21142. Let me look at this.
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
Update. I replicated the issue using one endpoint which constantly returned 503. I noticed the node was immediately declared unhealthy after it was un-ejected. According to the algorithm the node should stay healthy for the length of interval (30s) and then ejected based on failure percentage. Working on the fix.
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
Still planning to fix it!
Title: Failure percentage based outlier detection time does not increase on successive ejections
Description: In a cluster configured to detect outliers based on failure percentage, when a host gets ejected multiple times in a row, envoy does not properly increase the time of ejection.
Here is an example of what the membership_healthy looks like over time in my setup. In this case,
interval
is 30s,baseEjectionTime
is 60s andmaxEjectionTime
is 300s.Repro steps: Set up a cluster with a host that only sends back 500 codes, and configure the outlier detection to use failure percentage. Then put some load for a few minutes and observe as the host is ejected and un-ejected at a regular pace.
Config:
Version: I am using envoy in an istio 1.15.1 setup, this is the version of envoy that I could retrieve - "94bd57194ed66b70e231dbf22a7771a9e9e43a74/1.23.2-dev/Clean/RELEASE/BoringSSL"