Open himanshu-kun opened 1 year ago
Can you provide more details on why CA would think that the node is under utilised and therefore the respective MachineDeployment
needs to be scaled down? In general having 2 actors working on the same set of resources can have un-needed consequences.
What happens if the pods that were originally scheduled on the un-healthy node are now scheduled/deployed elsewhere post node drain? When MCM brings up a node CA will now try and scale it down. In this case scale-down by CA is actually correct. Only in the case where pods could not be re-scheduled due to taints or affinity/TSC rules will the new node be utilised.
Yes its given in upstream CA also that they won't gaurantee that the scheduler will schedule pods on the node that CA bought up.
Aim of autoscaler is to make enough resources available for unschedulable pods to get scheduled. Keeping this in mind , I think we should try improving MCM handling of unhealthy machines.
One way is to enhance MachineStatus
and introduce another field which captures relevant information about the machine that the new machine is replacing. This information can then used to annotate the new node which gets created for this machine. Once the old machine is removed then the annotation can then be removed on the new node. The status field will remain as that is historical information which will never change for this machine and can be used for further diagnostics.
Me and @elankath went through the code and we think that @unmarshall solution would work. There's a tweak which we suggest:
machine-new
) info in the old machine, we note the old machine(machine-old
) info in the new machine as an annotationHow?
manageReplicas
method, we already have a slice of staleMachines
computed before we create new machine objects.createMachinesWithControllerRef
(called by manageReplicas
), when we create new machine object using GetMachineFromTemplate
, we pop the a staleMachine
from the staleMachines
slice and add the stale machine name as the replacement-for
annotation in the new machine object. We also put the "cluster-autoscaler.kubernetes.io/scale-down-disabled": "true"
annotation on the new machine object.replacement-for
annotation is still present. If it is not, we can remove the cluster-autoscaler.kubernetes.io/scale-down-disabled
annotation. Today in the code, all stale machines objects are removed using c.terminateMachines
(badly named) BEFORE replacement new machine objects are created using createMachinesWithControllerRef
. Hence, the solution proposed above will break down in the case when MCM crashes before it can create the new machine objects.
To mitigate this, we also propose that we first create the new machine objects before removing the old stale machine objects in manageReplicas
of machine set controller. This needs to be discussed in a bit more detail with Madhav, before implementation.
We would go ahead with first creating new machine objects and then deleting the old stale obj. But we will use the slowStartBatch
function in creating the machine obj.
Also instead of storing the info the old machine as an annotation , we would store it in Spec of the machine object , under a field named ReplacementRef
which could store the name and provider ID of the old machine. This will later be use for diagnostic purpose.
instead of noting down the new machine(machine-new) info in the old machine, we note the old machine(machine-old) info in the new machine as an annotation
This is also what i meant in the above suggestion when I had mentioned "This information can then used to annotate the new node which gets created for this machine.". So this sounds good
How to categorize this issue?
/area high-availability /area performance /kind bug /priority 3
What happened: When using cluster autoscaler with MCM , in case when MCM replaces an unhealthy machine with a new one, autoscaler could remove the new machine after
--scale-down-unneeded-time
(default 30min) . Removing is done by scaling down the machineDeployment so no new machine would be created and pods stay pending which got drained.What you expected to happen: Autoscaler should not interrupt in health recovery mechanism of MCM.
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?: This happens when draining takes long time so not enough pods get scheduled on the new machine.
Possible solution -> add the
"cluster-autoscaler.kubernetes.io/scale-down-disabled": "true"
annotation and remove it after old machine is removed. Possible problem -> currently we don't keep track of which machine is created as a replacement for the unhealthy machine so we will need to find a way.cc @unmarshall @rishabh-11 @elankath
Environment:
kubectl version
):