gardener / machine-controller-manager

Declarative way of managing machines for Kubernetes cluster
Apache License 2.0
255 stars 116 forks source link

Add a way to temporarily prevent node deletion a.k.a Freeze machine #818

Open etiennnr opened 1 year ago

etiennnr commented 1 year ago

How to categorize this issue?

/area quality robustness /kind enhancement /priority 3

What would you like to be added: A way to temporarily prevent node from getting deleted. For eg, when we cordon/drain a node and investigate it, sometimes it gets deleted automatically because it's not healthy. It would be really useful to be able to keep a node alive to investigate it and find the root cause of a given problem.

It could be something like an annotation to add to a node resource (ideally not machine since shoot owner might also find this useful). I also think this should add another annotation with something like a timeout threshold (that can be increased if needs be) to prevent people from forgetting a node with that state.

Update 2Aug meeting with Etienne

Investigation would be needed in following phases:

Terminating WON'T need any investigation as the resources are in deletion phase, and could have been partly deleted by the time , machine is marked to be ignored from deletion.

Why is this needed: This would be useful to troubleshoot nodes that are suddenly stop working as expected (RCA purposes)

rishabh-11 commented 8 months ago

Post Grooming Decision:-

The annotation will have a timer. The machine will be deleted after the timer expires. Setting the annotation during rolling update is allowed. If the rolling update is cancelled/paused (option not yet available), the machine will still be considered frozen until the annotation is removed. We won't drain machines before freezing. No option to unfreeze the machine will be made available. Once the timer expires, the machine will be terminated.

Two options:-

  1. Have a separate machine deployment per worker pool dedicated to hosting frozen machines. This will not have a corresponding node group. It will not be a part of the rolling update. CA won't play a part in this, as no node group will be associated with the special machine deployment.
  2. To include this machine in the machine deployment replica count. Suspend any life cycle operations on this machine. This may cause the rolling update to be blocked. In this approach, CA will have to be adapted to ignore the frozen machines part of this machine deployment.

We need to check the code to figure out which option is more feasible.