Our current health check implementation is flawed. The external load balancer should not be responsible for health checking the pods since Kubernetes is already doing that. It instead should be health checking the worker nodes themselves. This is also dependent on the externalTrafficPolicy set on the Service itself.
If the externalTrafficPolicy=Local then a health check node port is created to serve health checks for the node. If there are 1 or more active pods on the node, then the health check will return 200 and it will indicate how many healthy pods are running. If there is no active pods, then it will return a 503 and will fail the health checks. It is necessary for the LB to health check this dedicated endpoint to ensure we respect the lifecycle of the worker node and the pods running on it.
If the externalTrafficPolicy=Cluster then we should health check kube-proxy running on the node. This is controlled by --healthz-bind-address which defaults to 0.0.0.0:10256. Health checking kube-proxy allows us to follow the lifecycle of the node such as whether it is up and ready to serve traffic, or whether we should stop routing traffic because it is tainted, is being removed due to scaling down, or being removed due to maintenance. In this case, kube proxy is responsible for understanding the health of each individual pod and managing which pods should receive traffic. In our current behaviour, if a pod goes unhealthy, it can incorrectly mark entire worker nodes as unhealthy which isn't the case. If/when we make the switch to full cilium, we will need to make sure we set --kube-proxy-replacement-healthz-bind-address.
This change will delegate pod healthiness to Kubernetes, which it should have always done. It will instead start health checking the Kubernetes components themselves to ensure they are up and ready to process traffic, and we can stop sending traffic to them when they are no longer healthy or are being drained for any number of reasons. Pod unhealthiness will no longer mark nodes as unhealthy. This should also fix a number of edge cases customers are experiencing when using cluster autoscaling or pod autoscalers.
Our current health check implementation is flawed. The external load balancer should not be responsible for health checking the pods since Kubernetes is already doing that. It instead should be health checking the worker nodes themselves. This is also dependent on the externalTrafficPolicy set on the Service itself.
If the
externalTrafficPolicy=Local
then a health check node port is created to serve health checks for the node. If there are 1 or more active pods on the node, then the health check will return 200 and it will indicate how many healthy pods are running. If there is no active pods, then it will return a 503 and will fail the health checks. It is necessary for the LB to health check this dedicated endpoint to ensure we respect the lifecycle of the worker node and the pods running on it.If the
externalTrafficPolicy=Cluster
then we should health check kube-proxy running on the node. This is controlled by--healthz-bind-address
which defaults to0.0.0.0:10256
. Health checking kube-proxy allows us to follow the lifecycle of the node such as whether it is up and ready to serve traffic, or whether we should stop routing traffic because it is tainted, is being removed due to scaling down, or being removed due to maintenance. In this case, kube proxy is responsible for understanding the health of each individual pod and managing which pods should receive traffic. In our current behaviour, if a pod goes unhealthy, it can incorrectly mark entire worker nodes as unhealthy which isn't the case. If/when we make the switch to full cilium, we will need to make sure we set--kube-proxy-replacement-healthz-bind-address
.This change will delegate pod healthiness to Kubernetes, which it should have always done. It will instead start health checking the Kubernetes components themselves to ensure they are up and ready to process traffic, and we can stop sending traffic to them when they are no longer healthy or are being drained for any number of reasons. Pod unhealthiness will no longer mark nodes as unhealthy. This should also fix a number of edge cases customers are experiencing when using cluster autoscaling or pod autoscalers.