carvel-dev / kapp-controller

Continuous delivery and package management for Kubernetes.
https://carvel.dev/kapp-controller
Apache License 2.0
268 stars 105 forks source link

Health Check (Liveness and Readiness probe for kapp controller) #771

Open leonard520 opened 2 years ago

leonard520 commented 2 years ago

Describe the problem/challenge you have I am wondering if health check is needed to maintain kapp controller in a production environment to help detect whether the controller still works

Describe the solution you'd like Like other Kubernetes operators, add meaningful liveness and readiness probe

Anything else you would like to add:


Vote on this request

This is an invitation to the community to vote on issues, to help us prioritize our backlog. Use the "smiley face" up to the right of this comment to vote.

👍 "I would like to see this addressed as soon as possible" 👎 "There are other more important things to focus on right now"

We are also happy to receive and review Pull Requests if you want to help working on this issue.

joe-kimmel-vmw commented 2 years ago

@leonard520 thanks for filing this issue!

Do you have any specific situations or use-cases where you've been impacted?

For instance, as I understand it the general point of a liveness probe is so that the deployment can be self-healing and the control-plane will restart the container when it crashes. Just to make sure this behavior was working, I added a panic to one of our reconcilers and redeployed. I created a resource to trigger the panic, and was able to observe that we crashed and then recovered from the crash:

kubectl get pods -n kapp-controller --watch                                                      [1/07/22| 3:53PM]
NAME                              READY   STATUS    RESTARTS   AGE
kapp-controller-7dfb84b86-98lgt   2/2     Running   0          3m31s
kapp-controller-7dfb84b86-98lgt   1/2     Error     0          4m24s
kapp-controller-7dfb84b86-98lgt   2/2     Running   1 (2s ago)   4m25s

I hope you can share more with us about how you're using kapp-controller, and how we can improve the experience!

leonard520 commented 2 years ago

Hi @joe-kimmel-vmw , from my understanding, if no defined liveness probe, k8s will try to restart the pod when the PID 1 process crash. However, I am not sure there is a case the main thread is still running but the controller is not able to be response anymore. I see k8s operators will provide the default health check, so just wondering if kapp controller needs the similar enhancement.

I think they are using https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/healthz#CheckHandler

cppforlife commented 2 years ago

However, I am not sure there is a case the main thread is still running but the controller is not able to be response anymore.

we havent observed direct need for liveness probe yet, but in theory, yeah, it might be helpful if something makes controller-runtime goroutines "stuck". i think it's probably worth waiting for anybody to report a problem where this might have been useful so that we understand the underlying conditions.

srm09 commented 2 years ago

Coming from the CAPI/CAPV side of the table, kapp-controller is used to deploy the CPI/CSI pods which eventually move the nodes to a Ready state. We have a use case, wherein a MachineHealthCheck can delete a node which is not in the Ready state. Due to some logic (CAPI does not drain this node), which causes the kapp-controller pod to show up as Running even when the underlying VM for the node has been deleted. The side effect of this is that the CPI/CSI packages never get deployed on the cluster which keeps it in a stalled state forever (since kapp-controller pod is reported as Running) when in reality there is no pod running.

joe-kimmel-vmw commented 2 years ago

thanks @srm09 - tbh this sounds like buggy behavior elsewhere in the system, but you're saying it could be fixed if we had health checks?

srm09 commented 2 years ago

I wouldn't say buggy behavior, but sometimes due to environment delays/network latency node provisioning might take time. CAPI uses the provider id as a unique identifier for the node, which in case of vSphere is set by CPI. Due to the above delays, we have seen situations arise wherein the node gets deleted but the state of the kapp-controller pods are shown as Running, but when we query the logs the api-server shows an unreachable error indicating the state of the kapp pod is stale/misleading. Not even being pedantic about the state, the more important parts of package reconciliation never happen as the pod is not existent.

srm09 commented 2 years ago

Probably, I can simulate a similar scenario in this way. I can delete the VM on which kapp-controller is running. Am pretty sure this would cause a similar situation to arise.

cppforlife commented 2 years ago

which causes the kapp-controller pod to show up as Running even when the underlying VM for the node has been deleted.

im not sure how readiness/liveness probe would help with that. if the node the is deleted, then in theory all pods on that node should be deleted as well. i believe kubelet is the one who maintains state of readiness/liveness probes and it would be gone, so there would be nothing to change them on a pod resource.

srm09 commented 2 years ago

I have seen issues during testing wherein the VM corresponding to a node got deleted, without the node being actually (drained/cordoned)removed from the cluster. This causes the kapp-controller pod to show as Running even though when you query for, say, the logs for the pod, the api-server shows the error that the pod IP isn't reachable. At least in our case, kapp is being used to deploy some critical cluster addons and non-existent/zombie kapp keeps the cluster in a stalled state.

cppforlife commented 2 years ago

^ agreed and understood, but i dont see how readinessProbe and livenessProbe can help with this particular scenario since afaik they are executed by the VM itself (kubelet) and if VM is gone. overall this sounds like a general kubernetes problem -- what is responsible for deleting pods and nodes for which underlying VMs are gone?