Autoscaler could scale down new machine created as replacement

himanshu-kun commented 1 year ago

How to categorize this issue?

/area high-availability /area performance /kind bug /priority 3

What happened: When using cluster autoscaler with MCM , in case when MCM replaces an unhealthy machine with a new one, autoscaler could remove the new machine after --scale-down-unneeded-time (default 30min) . Removing is done by scaling down the machineDeployment so no new machine would be created and pods stay pending which got drained.

What you expected to happen: Autoscaler should not interrupt in health recovery mechanism of MCM.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?: This happens when draining takes long time so not enough pods get scheduled on the new machine.

Possible solution -> add the "cluster-autoscaler.kubernetes.io/scale-down-disabled": "true" annotation and remove it after old machine is removed. Possible problem -> currently we don't keep track of which machine is created as a replacement for the unhealthy machine so we will need to find a way.

cc @unmarshall @rishabh-11 @elankath

Environment:

Kubernetes version (use kubectl version):
Cloud provider or hardware configuration:
Others:

unmarshall commented 1 year ago

Can you provide more details on why CA would think that the node is under utilised and therefore the respective MachineDeployment needs to be scaled down? In general having 2 actors working on the same set of resources can have un-needed consequences. What happens if the pods that were originally scheduled on the un-healthy node are now scheduled/deployed elsewhere post node drain? When MCM brings up a node CA will now try and scale it down. In this case scale-down by CA is actually correct. Only in the case where pods could not be re-scheduled due to taints or affinity/TSC rules will the new node be utilised.

himanshu-kun commented 1 year ago

Yes its given in upstream CA also that they won't gaurantee that the scheduler will schedule pods on the node that CA bought up.

Aim of autoscaler is to make enough resources available for unschedulable pods to get scheduled. Keeping this in mind , I think we should try improving MCM handling of unhealthy machines.

unmarshall commented 1 year ago

One way is to enhance MachineStatus and introduce another field which captures relevant information about the machine that the new machine is replacing. This information can then used to annotate the new node which gets created for this machine. Once the old machine is removed then the annotation can then be removed on the new node. The status field will remain as that is historical information which will never change for this machine and can be used for further diagnostics.

himanshu-kun commented 1 year ago

Me and @elankath went through the code and we think that @unmarshall solution would work. There's a tweak which we suggest:

instead of noting down the new machine(machine-new) info in the old machine, we note the old machine(machine-old) info in the new machine as an annotation

How?

in machineSet controller manageReplicas method, we already have a slice of staleMachines computed before we create new machine objects.
When creating new machine objects in createMachinesWithControllerRef (called by manageReplicas), when we create new machine object using GetMachineFromTemplate, we pop the a staleMachine from the staleMachines slice and add the stale machine name as the replacement-for annotation in the new machine object. We also put the "cluster-autoscaler.kubernetes.io/scale-down-disabled": "true" annotation on the new machine object.
Later whenever reconciling the new machine, we just check that old machine object in the replacement-for annotation is still present. If it is not, we can remove the cluster-autoscaler.kubernetes.io/scale-down-disabled annotation.

Today in the code, all stale machines objects are removed using c.terminateMachines (badly named) BEFORE replacement new machine objects are created using createMachinesWithControllerRef. Hence, the solution proposed above will break down in the case when MCM crashes before it can create the new machine objects.

To mitigate this, we also propose that we first create the new machine objects before removing the old stale machine objects in manageReplicas of machine set controller. This needs to be discussed in a bit more detail with Madhav, before implementation.

Update 17 Feb 2023 (Post-grooming)

We would go ahead with first creating new machine objects and then deleting the old stale obj. But we will use the slowStartBatch function in creating the machine obj. Also instead of storing the info the old machine as an annotation , we would store it in Spec of the machine object , under a field named ReplacementRef which could store the name and provider ID of the old machine. This will later be use for diagnostic purpose.

unmarshall commented 1 year ago

instead of noting down the new machine(machine-new) info in the old machine, we note the old machine(machine-old) info in the new machine as an annotation

This is also what i meant in the above suggestion when I had mentioned "This information can then used to annotate the new node which gets created for this machine.". So this sounds good

gardener / machine-controller-manager

Autoscaler could scale down new machine created as replacement #758

Update 17 Feb 2023 (Post-grooming)