Undeploy hangs because pods are stuck in Terminating state on NotReady nodes

NearNodeFlash / NearNodeFlash.github.io

View this document https://nearnodeflash.github.io/

Apache License 2.0

3 stars 3 forks source link

Undeploy hangs because pods are stuck in Terminating state on NotReady nodes #123

Closed roehrich-hpe closed 5 months ago

roehrich-hpe commented 5 months ago

If our software is running on the system, and one of the rabbit nodes disappears from the network and eventually transitions to NotReady, then any NNF pods that may have been on that node will be in Terminating state. With this condition, an nnf-deploy undeploy will hang while trying to delete the DaemonSets. To unstick this, we have to manually force delete all of the pods for our DaemonSets.

We need a better way to handle this.

more smarts in the deploy.sh scripts in each submodule?
investigate whether the 'force' option in ArgoCD's Application delete step will already handle this.
?

bdevcich commented 5 months ago

Something to consider here: we need to make sure we're only targeting Terminating pods that are on nodes that no longer exist. If we're reaching too far and killing pods that may be stuck in a Terminating state for other reasons (mounts) on a healthy node, then we may be impacting the kernel of that host.

roehrich-hpe commented 5 months ago

We've addressed this by not removing k8s namespaces when we undeploy the services. It was the attempt to delete the namespace that was causing 'kubectl delete' to hang, because of the Pod stuck in Terminating state within that namespace.

By leaving the namespace, the 'kubectl delete' does not hang, and k8s is left to clean up the pod whenever it is able to do so.

We've put this same fix into each of our repos. For reference, here's the one for nnf-sos: https://github.com/NearNodeFlash/nnf-sos/pull/251