Outlier detection healthcheck UNEJECT is confusing

NersesAM commented 3 months ago

Title: Outlier detection healthcheck UNEJECT is confusing

Description: I had the following cluster configuration, where I wanted to disable consecutive_5xx completely and only enable it for consecutive_gateway_failure :

     "outlier_detection": {
       "consecutive_5xx": 100000, // this is deliberate. I wanted to turn it off completely as I don't want it to trigger on 500s
       "interval": "2s",
       "base_ejection_time": "10s",
       "max_ejection_percent": 33,
       "enforcing_consecutive_5xx": 0, // This is also supposed to turn off consecutive_5xx
       "success_rate_minimum_hosts": 2,
       "success_rate_request_volume": 20,
       "consecutive_gateway_failure": 3,
       "enforcing_consecutive_gateway_failure": 100,
       "max_ejection_time": "30s",
       "max_ejection_time_jitter": "1s"
      },

After trying to kill a instance in the cluster I would see these kind of logs in the outlier detection logs

{
   action: EJECT
   cluster_name: bounty
   eject_consecutive_event: {
   }
   enforced: true
   num_ejections: 1
   timestamp: 2024-07-08T18:00:50.740Z
   type: CONSECUTIVE_GATEWAY_FAILURE
   upstream_url: 10.0.209.198:24467
}

{
   action: UNEJECT
   cluster_name: bounty
   enforced: false
   num_ejections: 1
   secs_since_last_action: 7
   timestamp: 2024-07-08T18:00:58.521Z
   type: CONSECUTIVE_5XX
   upstream_url: 10.0.209.198:24467
}

{
   action: EJECT
   cluster_name: bounty
   eject_consecutive_event: {
   }
   enforced: true
   num_ejections: 2
   secs_since_last_action: 0
   timestamp: 2024-07-08T18:00:58.536Z
   type: CONSECUTIVE_GATEWAY_FAILURE
   upstream_url: 10.0.209.198:24467
}

The UNEJECT coming from type CONSECUTIVE_5XX was very confusing as:

EJECT was triggered by Gateway failure and not 500
It unejected from CONSECUTIVE_5XX even though the config is set to a very high threshold and enforcing_consecutive_5xx is set to 0.
The log line also had enforced: false which should have meant it didn't action but in fact it actually did UNEJECT. The documentation is also confusing as it states to only be relevant for action eject, but is logged for uneject also If action is eject, specifies if the ejection was enforced. true means the host was ejected. false means the event was logged but the host was not actually ejected.

It took me a while to figure out why was this happening and that in fact this was because I had my healthcheck configuration set with unhealthy_threshold: 2 and it was the healthcheck(even if first failed, cluster was still considered healthy) that was triggering the UNEJECT and not the CONSECUTIVE_5XX. I managed to get my desired configuration by setting successful_active_health_check_uneject_host to false. It would probably have helped if the section about health checking in the docs were under ejection algorithm and not under gRPC (proposing that change under #35185)

Expected Behaviour Can the logging be changed so that UNEJECT events triggered by an active health check are of type HEALTH_CHECK instead of defaulting to CONSECUTIVE_5XX?

Relevant Links: https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/outlier#ejection-algorithm https://www.envoyproxy.io/docs/envoy/latest/api-v3/data/cluster/v3/outlier_detection_event.proto#data-cluster-v3-outlierdetectionevent

cpakulski commented 3 months ago

I agree that this is confusing, but unfortunately it is a "feature" of the current implementation. The same type of event is used for EJECT and UNEJECT and some fields are completely not necessary for UNEJECT, one of them is ejection type. When UNEJECT event is sent, type is left with default value and because it is enum, it is interpreted as CONSECUTIVE_5XX.

The second problem is that internally, outlier detection does not store info about which type caused ejection.

github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

NersesAM commented 2 months ago

I think it still needs to be addressed

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

NersesAM commented 1 month ago

․

github-actions[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

NersesAM commented 1 week ago

keep it

envoyproxy / envoy

Outlier detection healthcheck UNEJECT is confusing #35186