Open ramaraochavali opened 2 years ago
@howardjohn fyi
cc @snowp
Thanks for your suggestions. I have some questions. 🤔 For a endpoint, when will you treat it as unready? If there are some intermittent readiness failures, how the Istio distinguish unready and unhealthy? If a endpoint is killed by the liveness and then is restarted, how the Istio distinguish unready and unhealthy?
For a endpoint, when will you treat it as unready?
It is based on readiness probe that user has defined in K8s
If there are some intermittent readiness failures, how the Istio distinguish unready and unhealthy?
It does not distinguish between unready and unhealthy. All unready endpoints are sent as unhealthy (due to lack of separate status, which is this feature request for) to Envoy. Based on passive health checking (outlier detection), Envoy marks them as Unhealthy.
If a endpoint is killed by the liveness and then is restarted, how the Istio distinguish unready and unhealthy?
Same as above
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
not stale
Still is needed
@wbpcode Do you think it is reasonable to add this status?
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
@mattklein123 can you please mark this "help wanted"?
When a controlplane has to send "Unready" k8s endpoints to Envoy, when Envoy runs in to panic mode, Envoy sends traffic to "Unready" endpoints. This breaks the Kubernetes contract that "Unready" endpoints should never receive traffic.
In Istio, we send "Unready" endpoints with "Unhealthy" status. So the panic threshold algorithm can not distinguish between "Unready" vs. "Unhealthy".
Please refer to https://github.com/istio/istio/issues/18367 for why we send "Unready" endpoints. This also needed for warmup duration (slow start mode) so that intermittent readiness failures (after the service is ready for the first time) does not cause spurious warmups.
The feature request is to add additional status for "UNREADY" endpoints, that load balancer can exclude while considering hosts during panic threshold scenarios.