envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
25.03k stars 4.82k forks source link

Add additional health status to Endpoints #22120

Open ramaraochavali opened 2 years ago

ramaraochavali commented 2 years ago

When a controlplane has to send "Unready" k8s endpoints to Envoy, when Envoy runs in to panic mode, Envoy sends traffic to "Unready" endpoints. This breaks the Kubernetes contract that "Unready" endpoints should never receive traffic.

In Istio, we send "Unready" endpoints with "Unhealthy" status. So the panic threshold algorithm can not distinguish between "Unready" vs. "Unhealthy".

Please refer to https://github.com/istio/istio/issues/18367 for why we send "Unready" endpoints. This also needed for warmup duration (slow start mode) so that intermittent readiness failures (after the service is ready for the first time) does not cause spurious warmups.

The feature request is to add additional status for "UNREADY" endpoints, that load balancer can exclude while considering hosts during panic threshold scenarios.

ramaraochavali commented 2 years ago

@howardjohn fyi

wbpcode commented 2 years ago

cc @snowp

wbpcode commented 2 years ago

Thanks for your suggestions. I have some questions. 🤔 For a endpoint, when will you treat it as unready? If there are some intermittent readiness failures, how the Istio distinguish unready and unhealthy? If a endpoint is killed by the liveness and then is restarted, how the Istio distinguish unready and unhealthy?

ramaraochavali commented 2 years ago

For a endpoint, when will you treat it as unready?

It is based on readiness probe that user has defined in K8s

If there are some intermittent readiness failures, how the Istio distinguish unready and unhealthy?

It does not distinguish between unready and unhealthy. All unready endpoints are sent as unhealthy (due to lack of separate status, which is this feature request for) to Envoy. Based on passive health checking (outlier detection), Envoy marks them as Unhealthy.

If a endpoint is killed by the liveness and then is restarted, how the Istio distinguish unready and unhealthy?

Same as above

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

ramaraochavali commented 2 years ago

not stale

ramaraochavali commented 2 years ago

Still is needed

ramaraochavali commented 2 years ago

@wbpcode Do you think it is reasonable to add this status?

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

ramaraochavali commented 2 years ago

@mattklein123 can you please mark this "help wanted"?