Open himanshu-kun opened 2 years ago
The machines go to Unknown
state only from Running
state as also mentioned here https://github.com/gardener/machine-controller-manager/pull/180#discussion_r228443386
So marking the Unknown
to Running
makes sense , but doing it just to reset the health check doesn't. It could be reset by keeping the state as Unknown
just changing the LastUpdateTime
to time.Now()
In that way no harm would be made to the Unknown machine.
We discussed this in the grooming and note our observations and a proposed solution. Our design will need to be refined when we pick this up for implementation.
This code to mark Unknown
Machines back to Running
was introduced in the machine safety controller reconcile function to address a specific corner case:
In the regular health check reconciliation of the machine controller, say that a node was detected as not healthy. (node ready condition false or one of the problem node conditions was true). Then the machine controller moves the Machine Phase to Unknown
. Note that the Machine Phase is moved back to Running
, if the node becomes healthy again. However, if the Machine Phase remains Unknown
for the duration of the configured health-check-timeout, then it is transitioned to phase:Failed
. (Failed
machines subsequently go through the node drain, vm-deletion and node-deletion flows which will not be described here.).
Expression: : if time.Now()-machineStatusLastUpdateTime > timeoutDuration & unhealthy then mark phase as Failed
M1
was just transitioned to Unknown
phase.M1
is seen to in Unknown
state and notes that the Node associated with this machine is still un-healthy since the KCM is yet to update the node conditions. But the backing VM instance is actually healthy - the KCM simply hasn't had the chance yet to update the latest Node conditions.Unknown
state for a time exceeding the configured health-check-timeout.Failed
. The drain and deletion flows are subsequently initiated.To address this, the current machine safety controller also moves back phase: Unknown
machines back to phase: Running
state after freezing the MC controller. So, effectively the health-check is reset. When the next health check runs, then any machines that are healthy remain in Running
. Any machines who are not health are transitioned back to Unknown
.
Our default health-check-timeout is 10min.
The problem with the above hack is that if machine went un-healthy at time: T1
, was unhealthy for 9min :T2=T1+9
, and API server went down and came back up after 3 minutes: T3=T2+3
, then we again wait for 10 min T4=T3+10
before moving machine to Failed
(in case it didn't pass the health-check). We effectively double the wait for an truly un-healthy machine before draining and deleting the same. This causes a lot of delay.
reconcileMachineSafetyAPIserver
should only toggle freeze or unfreeze , it should not set the Unknown
machines to Running
. Phase transition is the job of the health check. Individual actors should have segregated responsibilities.reconcileHealth
instead of blindly moving all Unknown
and un-healthy machines to Failed
phase, also checks to see if the MC was frozen and un-frozen since its last run time. If so, then it gives a unfrozenGracePeriod
so that KCM gets the opportunity to update the Node conditions.
OLD: if time.Now()-machineStatusLastUpdateTime > timeoutDuration & unhealthy then mark phase as Failed
NEW: if time.Now()-machineStatusLastUpdateTime-unfrozenGracePeriod> timeoutDuration then mark phase as Failed
Failed
phase. It also ensures that we do not spuriously move a truly Running machine to the Failed
phase.
How to categorize this issue?
/area robustness /kind bug /priority 3
What would you like to be added: Currently machine controller gets frozen if we can't reach KAPI using the domain provided in kubeconfig. Its gets unfrozen again once KAPI is reachable but we mark all the
Unknown
machines asRunning
again . So that logic needs to be removed or maybe enhanced.Why is this needed:
This state change without knowing why or when they were marked
Unknown
is not a good practice. Although its there to prevent a scenario where we start marking the machines asFailed
since they had beenUnknown
, but the reason forUnknown
was KAPI unreachable and not kubelet down or node unhealthy.NOTE
If this is done , MCM FAQs and docs needs to be updated as well, and unit tests needs to adapted. Also if its possible improvement in FAQ to explain difference b/w API server unreachable freezing of machine controller and overshooting freezing of machineset and machinedeployment is needed.