Open etiennnr opened 1 year ago
Post Grooming Decision:-
The annotation will have a timer. The machine will be deleted after the timer expires. Setting the annotation during rolling update is allowed. If the rolling update is cancelled/paused (option not yet available), the machine will still be considered frozen until the annotation is removed. We won't drain machines before freezing. No option to unfreeze the machine will be made available. Once the timer expires, the machine will be terminated.
Two options:-
We need to check the code to figure out which option is more feasible.
How to categorize this issue?
/area quality robustness /kind enhancement /priority 3
What would you like to be added: A way to temporarily prevent node from getting deleted. For eg, when we cordon/drain a node and investigate it, sometimes it gets deleted automatically because it's not healthy. It would be really useful to be able to keep a node alive to investigate it and find the root cause of a given problem.
It could be something like an annotation to add to a
node
resource (ideally notmachine
since shoot owner might also find this useful). I also think this should add another annotation with something like a timeout threshold (that can be increased if needs be) to prevent people from forgetting a node with that state.Update 2Aug meeting with Etienne
Investigation would be needed in following phases:
Pending
(machine is not joining cases)Unknown
machine (pods not working so cordon/drain node and then inspect)Running
machine (pods not working, but machineRunning
, probably because the issue couldn't be tracked through a node condition)Terminating
WON'T need any investigation as the resources are in deletion phase, and could have been partly deleted by the time , machine is marked to be ignored from deletion.Why is this needed: This would be useful to troubleshoot nodes that are suddenly stop working as expected (RCA purposes)