Edge case where Node Deletion is missed if machine 'node' label is not present

elankath commented 11 months ago

How to categorize this issue?

/area robustness /kind bug /priority 2

What happened:

When a Node is never associated with its Machine. Ie the machine object never has the machine.Labels[v1alpha1.NodeLabelKey] set after the machine creation, then during the deletion flow, our Node object is not deleted. (Label up-dation can be missed if the machine object update transiently fails)

Then after some time, the dangling Node object gets the NotManagedByMCM annotation.

What you expected to happen: Node object should always be deleted prior to the instance VM Termination and Machine object deletion, even if the association was missed during instance creation.

How to reproduce it (as minimally and precisely as possible):

Launch a Machine and then remove its node label.
Then delete the machine, triggering the delete flow.
After machine object is deleted, the corresponding Node is still present.

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version): any
Cloud provider or hardware configuration: any
Others:

gardener-robot commented 11 months ago

@elankath You have mentioned internal references in the public. Please check.

gardener-robot commented 11 months ago

@elankath You have mentioned internal references in the public. Please check.

elankath commented 10 months ago

Teested fix on GCP (node label is same as machine name for this provider). Removed the node label and initiated machine deletion. Node label is now set again prior to drain and deletion. Node deletion successfully occurs even when node label is missing.

I1226 10:08:31.120914   86315 machine.go:128] reconcileClusterMachine: Start for "shoot--i034796--g1-w1-z1-788d9-hlgnx" with phase:"Terminating", description:"Set machine status to termination. Now, getting VM Status"
I1226 10:08:33.742750   86315 machine_util.go:1685] Updating "node" label on machine "shoot--i034796--g1-w1-z1-788d9-hlgnx" to "shoot--i034796--g1-w1-z1-788d9-hlgnx"
I1226 10:08:33.926901   86315 machine_util.go:1696] Updated "node" label on machine "shoot--i034796--g1-w1-z1-788d9-hlgnx" to "shoot--i034796--g1-w1-z1-788d9-hlgnx
I1226 10:08:41.699711   86315 machine_util.go:1104] Normal delete/drain has been triggerred for machine "shoot--i034796--g1-w1-z1-788d9-hlgnx"
...
I1226 10:11:05.334091   86315 machine_controller.go:131] VM "gce:///OMITTED/shoot--i034796--g1-w1-z1-788d9-hlgnx" for Machine "shoot--i034796--g1-w1-z1-788d9-hlgnx" was terminated succesfully
I1226 10:11:10.681394   86315 machine_util.go:1357] Deleting node "shoot--i034796--g1-w1-z1-788d9-hlgnx" associated with machine "shoot--i034796--g1-w1-z1-788d9-hlgnx"
I1226 10:11:16.055535   86315 machine.go:648] Machine "shoot--i034796--g1-w1-z1-788d9-hlgnx" with providerID "gce:///OMITTED/shoot--i034796--g1-w1-z1-788d9-hlgnx" and nodeName "shoot--i034796--g1-w1-z1-788d9-hlgnx" deleted successfully

gardener / machine-controller-manager

Edge case where Node Deletion is missed if machine 'node' label is not present #875