gardener / machine-controller-manager

Declarative way of managing machines for Kubernetes cluster
Apache License 2.0
256 stars 117 forks source link

Check, update node label on machine obj prior to drain,termination #887

Closed elankath closed 9 months ago

elankath commented 9 months ago

What this PR does / why we need it:

When a Node is never associated with its Machine. Ie the machine object never has the machine.Labels[v1alpha1.NodeLabelKey] set after the machine creation, then during the deletion flow, our Node object is not deleted. (Label updation can be missed if the machine object update transiently fails)

Then after some time, the dangling Node object gets the NotManagedByMCM annotation.

Which issue(s) this PR fixes: Fixes #875

Special notes for your reviewer:

Release note:

Fix for edge case of Node object deletion missed during machine termination.
elankath commented 9 months ago

Teested fix on GCP (node label is same as machine name for this provider). Removed the node label and initiated machine deletion. Node label is now set again prior to drain and deletion. Node deletion successfully occurs even when node label is missing.

I1226 10:08:31.120914   86315 machine.go:128] reconcileClusterMachine: Start for "shoot--i034796--g1-w1-z1-788d9-hlgnx" with phase:"Terminating", description:"Set machine status to termination. Now, getting VM Status"
I1226 10:08:33.742750   86315 machine_util.go:1685] Updating "node" label on machine "shoot--i034796--g1-w1-z1-788d9-hlgnx" to "shoot--i034796--g1-w1-z1-788d9-hlgnx"
I1226 10:08:33.926901   86315 machine_util.go:1696] Updated "node" label on machine "shoot--i034796--g1-w1-z1-788d9-hlgnx" to "shoot--i034796--g1-w1-z1-788d9-hlgnx
I1226 10:08:41.699711   86315 machine_util.go:1104] Normal delete/drain has been triggerred for machine "shoot--i034796--g1-w1-z1-788d9-hlgnx"
...
I1226 10:11:05.334091   86315 machine_controller.go:131] VM "gce:///OMITTED/shoot--i034796--g1-w1-z1-788d9-hlgnx" for Machine "shoot--i034796--g1-w1-z1-788d9-hlgnx" was terminated succesfully
I1226 10:11:10.681394   86315 machine_util.go:1357] Deleting node "shoot--i034796--g1-w1-z1-788d9-hlgnx" associated with machine "shoot--i034796--g1-w1-z1-788d9-hlgnx"
I1226 10:11:16.055535   86315 machine.go:648] Machine "shoot--i034796--g1-w1-z1-788d9-hlgnx" with providerID "gce:///OMITTED/shoot--i034796--g1-w1-z1-788d9-hlgnx" and nodeName "shoot--i034796--g1-w1-z1-788d9-hlgnx" deleted successfully
elankath commented 9 months ago

Teested fix on AWS (node label is diff from machine name for this provider). Removed the node label and initiated machine deletion. Node label is now set again prior to drain and deletion. Node deletion successfully occurs even when node label is missing.

I1226 10:37:06.233292   90959 machine_util.go:1685] Updating "node" label on machine "shoot--i034796--aw3-a-z1-9cc57-q6qbl" to "ip-10-180-29-167.eu-west-1.compute.internal"
I1226 10:37:06.405762   90959 machine_util.go:1696] Updated "node" label on machine "shoot--i034796--aw3-a-z1-9cc57-q6qbl" to "ip-10-180-29-167.eu-west-1.compute.internal"
I1226 10:37:55.994989   90959 core.go:285] Machine deletion request has been recieved for "shoot--i034796--aw3-a-z1-9cc57-q6qbl"
I1226 10:37:56.382539   90959 core.go:311] VM "aws:///eu-west-1/i-078071299a1bce4ca" for Machine "shoot--i034796--aw3-a-z1-9cc57-q6qbl" was terminated successfully
I1226 10:38:01.724111   90959 machine_util.go:1357] Deleting node "ip-10-180-29-167.eu-west-1.compute.internal" associated with machine "shoot--i034796--aw3-a-z1-9cc57-q6qbl"
I1226 10:38:01.724126   90959 machine_util.go:1365] Deletion of Node Object "ip-10-180-29-167.eu-west-1.compute.internal" is successful. Initiate machine object finalizer removal
I1226 10:38:07.079213   90959 machine.go:648] Machine "shoot--i034796--aw3-a-z1-9cc57-q6qbl" with providerID "aws:///eu-west-1/i-078071299a1bce4ca" and nodeName "ip-10-180-29-167.eu-west-1.compute.internal" deleted successfully