Open NersesAM opened 3 months ago
I agree that this is confusing, but unfortunately it is a "feature" of the current implementation. The same type of event is used for EJECT and UNEJECT and some fields are completely not necessary for UNEJECT, one of them is ejection type
. When UNEJECT event is sent, type
is left with default value and because it is enum, it is interpreted as CONSECUTIVE_5XX.
The second problem is that internally, outlier detection does not store info about which type caused ejection.
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
I think it still needs to be addressed
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
․
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
keep it
Title: Outlier detection healthcheck UNEJECT is confusing
Description: I had the following cluster configuration, where I wanted to disable consecutive_5xx completely and only enable it for consecutive_gateway_failure :
After trying to kill a instance in the cluster I would see these kind of logs in the outlier detection logs
The UNEJECT coming from type CONSECUTIVE_5XX was very confusing as:
enforced: false
which should have meant it didn't action but in fact it actually did UNEJECT. The documentation is also confusing as it states to only be relevant for action eject, but is logged for uneject alsoIf action is eject, specifies if the ejection was enforced. true means the host was ejected. false means the event was logged but the host was not actually ejected.
It took me a while to figure out why was this happening and that in fact this was because I had my healthcheck configuration set with
unhealthy_threshold: 2
and it was the healthcheck(even if first failed, cluster was still considered healthy) that was triggering the UNEJECT and not the CONSECUTIVE_5XX. I managed to get my desired configuration by setting successful_active_health_check_uneject_host to false. It would probably have helped if the section about health checking in the docs were under ejection algorithm and not under gRPC (proposing that change under #35185)Expected Behaviour Can the logging be changed so that UNEJECT events triggered by an active health check are of type HEALTH_CHECK instead of defaulting to CONSECUTIVE_5XX?
Relevant Links: https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/upstream/outlier#ejection-algorithm https://www.envoyproxy.io/docs/envoy/latest/api-v3/data/cluster/v3/outlier_detection_event.proto#data-cluster-v3-outlierdetectionevent