digitalocean / digitalocean-cloud-controller-manager

Kubernetes cloud-controller-manager for DigitalOcean (beta)
Apache License 2.0
518 stars 146 forks source link

Fix Load Balancer Health Checks #724

Closed bbassingthwaite closed 1 month ago

bbassingthwaite commented 2 months ago

Our current health check implementation is flawed. The external load balancer should not be responsible for health checking the pods since Kubernetes is already doing that. It instead should be health checking the worker nodes themselves. This is also dependent on the externalTrafficPolicy set on the Service itself.

If the externalTrafficPolicy=Local then a health check node port is created to serve health checks for the node. If there are 1 or more active pods on the node, then the health check will return 200 and it will indicate how many healthy pods are running. If there is no active pods, then it will return a 503 and will fail the health checks. It is necessary for the LB to health check this dedicated endpoint to ensure we respect the lifecycle of the worker node and the pods running on it.

If the externalTrafficPolicy=Cluster then we should health check kube-proxy running on the node. This is controlled by --healthz-bind-address which defaults to 0.0.0.0:10256. Health checking kube-proxy allows us to follow the lifecycle of the node such as whether it is up and ready to serve traffic, or whether we should stop routing traffic because it is tainted, is being removed due to scaling down, or being removed due to maintenance. In this case, kube proxy is responsible for understanding the health of each individual pod and managing which pods should receive traffic. In our current behaviour, if a pod goes unhealthy, it can incorrectly mark entire worker nodes as unhealthy which isn't the case. If/when we make the switch to full cilium, we will need to make sure we set --kube-proxy-replacement-healthz-bind-address.

This change will delegate pod healthiness to Kubernetes, which it should have always done. It will instead start health checking the Kubernetes components themselves to ensure they are up and ready to process traffic, and we can stop sending traffic to them when they are no longer healthy or are being drained for any number of reasons. Pod unhealthiness will no longer mark nodes as unhealthy. This should also fix a number of edge cases customers are experiencing when using cluster autoscaling or pod autoscalers.

timoreimann commented 2 months ago

Forgot one point: can you please update the annotations documentation as well accordingly?