Autoscaler interfering in meltdown scenario solution

himanshu-kun commented 2 years ago

How to categorize this issue?

/area performance /kind bug /priority 2

What happened: Autoscaler's fixNodeGroupSize logic interferes with meltdown logic where we remove only maxReplacement machines per machinedeployment, and it removes the other Unknown machines as well.

What you expected to happen: Autoscaler even on taking decision of DecreaseTargetSize should not be able to remove Unknown machines, because the node object is actually present for them.

How to reproduce it (as minimally and precisely as possible):

Create a machinedeployment with 2 replicas (its assumed autoscaler is enabled for the cluster)
block all traffic to/from the zone machinedeployment is for
with default maxReplacement 1 node will stay in Pending state
after around 20 min , the Unknown machine would be deleted when autoscaler fixes the node grp size by reducing machinedeployment replicas to 1

Anything else we need to know?: This is happening because the way machineSet prioritizes machine while deletion based on their status https://github.com/gardener/machine-controller-manager/blob/d7e3c5dffeb33abe2c30b435075fb050301da4fa/pkg/controller/controller_utils.go#L769-L776

*We need to look into any other implication of prioritizing Pending machine over Unknown machines for solution.

Environment:

Kubernetes version (use kubectl version):
Cloud provider or hardware configuration:
Others: CA version 1.23.1

himanshu-kun commented 2 years ago

cc @unmarshall

himanshu-kun commented 1 year ago

fixNodeGrpSize only understands Registered and Non-Registered nodes, it doesn't do anything even if the node joins and is NotReady for a long time. Here the intention of fixNodeGrpSize was to remove the Pending machine, but because of our preference in machineSet controller it removes the Unknown machine. The fixNodeGrpSize currently just acts when the RemoveLongUnregistered logic is not able to remove the longUnregistered nodes because node grp is already at the minimum. Also if we are not at min. node grp size, the RemoveLongUnregistered logic wouldn't delete the Unknown machine because it uses the priority annotation to pinpoint the machine it wants to delete and as per machineSet preference priority annotation is preffered over machine phase.

himanshu-kun commented 1 year ago

Prioritizing Pending machine removal over Unknown would make more sense because:

Unknown means node obj(in most cases) is there and pods are runnning on it mostly
- There is a case where node obj is not there and machine is marked Unknown , but after addressing this issues https://github.com/gardener/machine-controller-manager/issues/755 , the case would be gone.
Pending machine means the node obj is not hosting any pods as of now , or node obj is not there at all. Also with the recent development , in Gardener context, node obj are created with node.gardener.cloud/critical-components-not-ready taint , which also keep the machine object in pending state and stop workload from getting scheduled on the node.(https://github.com/gardener/machine-controller-manager/pull/778)

himanshu-kun commented 1 year ago

Also if we are not at min. node grp size, the RemoveLongUnregistered logic wouldn't delete the Unknown machine because it uses the priority annotation to pinpoint the machine it wants to delete and as per machineSet preference priority annotation is preffered over machine phase.

But the RemoveLongUnregistered/RemoveOldUnregistered logic will remove the Pending machine if autoscaler maxNodeProvisionTimeout runs out, thinking it long unregistered. This will again kick in a loop where machine deployment size is reduced and then meltdown logic again turns maxReplacement machines into Pending and finally it'll stop when node grp min size is reached as that time RemoveLongUnregistered logic would stop.

The ideal solution is to make autoscaler aware that the meltdown logic is in play because of an outage in the zone, and it doesn't interfere.

gardener / machine-controller-manager

Autoscaler interfering in meltdown scenario solution #741