MCM marking Unknown machines Running on Unfreezing

himanshu-kun commented 2 years ago

How to categorize this issue?

/area robustness /kind bug /priority 3

What would you like to be added: Currently machine controller gets frozen if we can't reach KAPI using the domain provided in kubeconfig. Its gets unfrozen again once KAPI is reachable but we mark all the Unknown machines as Running again . So that logic needs to be removed or maybe enhanced.

Why is this needed:

This state change without knowing why or when they were marked Unknown is not a good practice. Although its there to prevent a scenario where we start marking the machines as Failed since they had been Unknown , but the reason for Unknown was KAPI unreachable and not kubelet down or node unhealthy.

NOTE

If this is done , MCM FAQs and docs needs to be updated as well, and unit tests needs to adapted. Also if its possible improvement in FAQ to explain difference b/w API server unreachable freezing of machine controller and overshooting freezing of machineset and machinedeployment is needed.

himanshu-kun commented 2 years ago

The machines go to Unknown state only from Running state as also mentioned here https://github.com/gardener/machine-controller-manager/pull/180#discussion_r228443386

So marking the Unknown to Running makes sense , but doing it just to reset the health check doesn't. It could be reset by keeping the state as Unknown just changing the LastUpdateTime to time.Now()

In that way no harm would be made to the Unknown machine.

elankath commented 1 year ago

We discussed this in the grooming and note our observations and a proposed solution. Our design will need to be refined when we pick this up for implementation.

Justification for Current Behaviour

This code to mark Unknown Machines back to Running was introduced in the machine safety controller reconcile function to address a specific corner case:

In the regular health check reconciliation of the machine controller, say that a node was detected as not healthy. (node ready condition false or one of the problem node conditions was true). Then the machine controller moves the Machine Phase to Unknown. Note that the Machine Phase is moved back to Running, if the node becomes healthy again. However, if the Machine Phase remains Unknown for the duration of the configured health-check-timeout, then it is transitioned to phase:Failed. (Failed machines subsequently go through the node drain, vm-deletion and node-deletion flows which will not be described here.). Expression: : if time.Now()-machineStatusLastUpdateTime > timeoutDuration & unhealthy then mark phase as Failed
1. Let us say that the API Server went down after a machine M1 was just transitioned to Unknown phase.
2. The safety controller freezes the MC once the API server (either control or target) is detected as down. This disables all the regular reconcile activities: lifecycle/drain/health-check etc. The only guy who runs is the safety controller.
3. The API Server came back up after some time.
4. The safety controller unfreezes the MC after a time that exceeds the health-check-timeout.
5. The health check now runs. The machine M1 is seen to in Unknown state and notes that the Node associated with this machine is still un-healthy since the KCM is yet to update the node conditions. But the backing VM instance is actually healthy - the KCM simply hasn't had the chance yet to update the latest Node conditions.
6. The MC sees that the Machine has been in Unknown state for a time exceeding the configured health-check-timeout.
7. The MC transitions the Machine phase to Failed. The drain and deletion flows are subsequently initiated.
8. Unfortunately, a healthy machine was shutdown.

To address this, the current machine safety controller also moves back phase: Unknown machines back to phase: Running state after freezing the MC controller. So, effectively the health-check is reset. When the next health check runs, then any machines that are healthy remain in Running. Any machines who are not health are transitioned back to Unknown.

Our default health-check-timeout is 10min.

Issue

The problem with the above hack is that if machine went un-healthy at time: T1, was unhealthy for 9min :T2=T1+9 , and API server went down and came back up after 3 minutes: T3=T2+3, then we again wait for 10 min T4=T3+10 before moving machine to Failed (in case it didn't pass the health-check). We effectively double the wait for an truly un-healthy machine before draining and deleting the same. This causes a lot of delay.

Solution Proposal

reconcileMachineSafetyAPIserver should only toggle freeze or unfreeze , it should not set the Unknown machines to Running. Phase transition is the job of the health check. Individual actors should have segregated responsibilities.
reconcileHealth instead of blindly moving all Unknown and un-healthy machines to Failed phase, also checks to see if the MC was frozen and un-frozen since its last run time. If so, then it gives a unfrozenGracePeriod so that KCM gets the opportunity to update the Node conditions. OLD: if time.Now()-machineStatusLastUpdateTime > timeoutDuration & unhealthy then mark phase as Failed NEW: if time.Now()-machineStatusLastUpdateTime-unfrozenGracePeriod> timeoutDuration then mark phase as Failed
The above ensures that we do not needlessly delay moving a truly failed machine to the Failed phase. It also ensures that we do not spuriously move a truly Running machine to the Failed phase.

gardener / machine-controller-manager