gardener / machine-controller-manager

Declarative way of managing machines for Kubernetes cluster
Apache License 2.0
256 stars 117 forks source link

MCM marking Unknown machines Running on Unfreezing #680

Open himanshu-kun opened 2 years ago

himanshu-kun commented 2 years ago

How to categorize this issue?

/area robustness /kind bug /priority 3

What would you like to be added: Currently machine controller gets frozen if we can't reach KAPI using the domain provided in kubeconfig. Its gets unfrozen again once KAPI is reachable but we mark all the Unknown machines as Running again . So that logic needs to be removed or maybe enhanced.

Why is this needed:

This state change without knowing why or when they were marked Unknown is not a good practice. Although its there to prevent a scenario where we start marking the machines as Failed since they had been Unknown , but the reason for Unknown was KAPI unreachable and not kubelet down or node unhealthy.

NOTE

If this is done , MCM FAQs and docs needs to be updated as well, and unit tests needs to adapted. Also if its possible improvement in FAQ to explain difference b/w API server unreachable freezing of machine controller and overshooting freezing of machineset and machinedeployment is needed.

himanshu-kun commented 2 years ago

The machines go to Unknown state only from Running state as also mentioned here https://github.com/gardener/machine-controller-manager/pull/180#discussion_r228443386

So marking the Unknown to Running makes sense , but doing it just to reset the health check doesn't. It could be reset by keeping the state as Unknown just changing the LastUpdateTime to time.Now()

In that way no harm would be made to the Unknown machine.

elankath commented 1 year ago

We discussed this in the grooming and note our observations and a proposed solution. Our design will need to be refined when we pick this up for implementation.

Justification for Current Behaviour

This code to mark Unknown Machines back to Running was introduced in the machine safety controller reconcile function to address a specific corner case:

To address this, the current machine safety controller also moves back phase: Unknown machines back to phase: Running state after freezing the MC controller. So, effectively the health-check is reset. When the next health check runs, then any machines that are healthy remain in Running. Any machines who are not health are transitioned back to Unknown.

Our default health-check-timeout is 10min.

Issue

The problem with the above hack is that if machine went un-healthy at time: T1, was unhealthy for 9min :T2=T1+9 , and API server went down and came back up after 3 minutes: T3=T2+3, then we again wait for 10 min T4=T3+10 before moving machine to Failed (in case it didn't pass the health-check). We effectively double the wait for an truly un-healthy machine before draining and deleting the same. This causes a lot of delay.

Solution Proposal

  1. reconcileMachineSafetyAPIserver should only toggle freeze or unfreeze , it should not set the Unknown machines to Running. Phase transition is the job of the health check. Individual actors should have segregated responsibilities.
  2. reconcileHealth instead of blindly moving all Unknown and un-healthy machines to Failed phase, also checks to see if the MC was frozen and un-frozen since its last run time. If so, then it gives a unfrozenGracePeriod so that KCM gets the opportunity to update the Node conditions. OLD: if time.Now()-machineStatusLastUpdateTime > timeoutDuration & unhealthy then mark phase as Failed NEW: if time.Now()-machineStatusLastUpdateTime-unfrozenGracePeriod> timeoutDuration then mark phase as Failed
  3. The above ensures that we do not needlessly delay moving a truly failed machine to the Failed phase. It also ensures that we do not spuriously move a truly Running machine to the Failed phase.