React faster if VM instance is gone (i.e. don’t wait until full machineHealthTimeout/machineCreationTimeout lapses)

vlerenc commented 1 year ago

What happened:

While investigating issues with pending workload during a rolling update that looked as if drain started on nodes without sufficient new capacity (see also issue #754), we found out that VM instances that will never make it are not considered as such and the full machineHealthTimeout/machineCreationTimeout needs to lapse, which slows down auto-scaling/self-healing/rolling-update and may lead to prolonged pending pod times.

What was expected:

When a VM instance backing a machine is in fatal error or even terminated, we should probably act faster. While it was observed that hyperscaler API endpoints may temporarily serve stale or broken or no data at all (these systems are highly distributed and only eventually consistent) we should still act, if e.g.:

the API returns the VM instance with a status of in fatal error or terminated AND...
the kubelet stopped reporting at around or before the VM instance last timestamp then...
a much smaller timeout of e.g. 1m (instead of 10m/20m default for machineHealthTimeout/machineCreationTimeout) lapsed.

This should help to recover much faster in the future. WDYT?

How to reproduce it:

Terminate a running VM instance backing a machine resource or create a new machine resource (e.g. scale-out), but block (e.g. immediately terminate) the new VM instance from coming up. You will see that the machineHealthTimeout/machineCreationTimeout have to lapse before something happens even though the machines are in terminal state from which they will never recover and the kubelets do no longer report in.

while true; do sleep 0.1; aws ec2 describe-instances --filters Name=tag-key,Values=kubernetes.io/cluster/shoot--project--cluster Name=instance-state-code,Values=0 Name=availability-zone,Values=eu-central-1c --no-paginate --output json | jq -r .Reservations[].Instances[].InstanceId | tr '\n' ' ' | xargs -I {} sh -c 'echo Terminating Instances:  {}; aws ec2 terminate-instances --instance-ids {} | cat'; done

sskrustev commented 1 year ago

Hello Vedran Lerenc,

This is kind reminder . Are there any updates here ?

Kind regards, Stanimir

elankath commented 1 year ago

Grooming Results

Once the machine phase has moved to Unknown, we should check the the error code of Driver.GetMachineStatus call to see if the VM instance has gone.
If so, we should not wait for the health-check-timeout before moving the machine phase to Failed.
Instead of performing the GetMachineStatus call only for an Unknown machine, also consider regularly (say every few iterations of the reconcileHealthCheck) to check the VM instance status and moving the machine phase to Failed if VM is not there. This will obviate the need for the long Running -> Unknown -> Failed movement across consecutive invocations of health check reconciliation.

himanshu-kun commented 1 year ago

One possible solution , is to rely on CCM, as it deletes node obj if VM is gone. So we won't have to check the VM on infra atleast in machineHealthTimeout case. But the node can also be explicitly deleted by operator/user, then we need to test it out that the node obj doesn't come up again, because if it does then checking VM status would be the safest thing to do.

vlerenc commented 1 year ago

we should check the the error code of Driver.GetMachineStatus call to see if the VM instance has gone.

The problem with that is that infrastructures such as Azure are notoriously inconsistent for a very long time window (much more than the 10 minutes we try to shave off here). They show the machine, don't show it, show it - it takes a while until all API endpoints return the same result. Then again, if the machine is older than 2h (time window we see this issue, e.g. with Azure) and one API endpoint returns that it's gone, it's probably gone (unless that API endpoint is broken in some other way). We could safe-guard that somewhat more and wait to see this result 3x (ignoring other results were the machine is reported present). Still, risky. We simply cannot trust the cloud provider API to return the correct result due to eventual consistency.

elankath commented 1 year ago

The problem with that is that infrastructures such as Azure are notoriously inconsistent for a very long time window (much more than the 10 minutes we try to shave off here). They show the machine, don't show it, show it - it takes a while until all API endpoints return the same result.

Agreed. We don't need to immediately take a decision to move the Machine to Failed phase. Our health check is run in every reconcile cycle. So, if for 3 invocations or 1 min, if we continously confirm the absence of the machine, then the confidence interval should be sufficient enough to move the VM to Failed phase. We also log this (and event this) when we do so.

gardener / machine-controller-manager