Update CrashLoopBackOff condition as soon as timeout expires

himanshu-kun commented 1 year ago

How to categorize this issue?

/area robustness /kind bug /priority 2

What happened:

Currently MCM doesn't turn CrashLoopBackoff(CLBF) machines to Failed as soon as creationTimeout expires , but delays have been observed which can range to 2min to any time.

There are 2 parts of the problem: 1) timeout check for CLBF machine is done after making CreateMachine() driver call . CreateMachine() itself could take any amount of time as it provisions VM on the cloudprovider (in Azure big delays like 30min or more have been seen before) 2) even if the 1) doesn't exist to contribute to the delay, the retry/re-push to the queue can also introduce the delay of ShortRetry(3min). This happens because the retry period is not calculated considering the time left before timeout but is just a constant value of ShortRetry(3min)

What you expected to happen: Turn to Failed as soon as timeout expires.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

We could take inspiration from machineDeployment logic which turns Progressing condition to False with reason ProgressDeadlineExceeded as soon as the deadline exceeds.

Environment:

Kubernetes version (use kubectl version):
Cloud provider or hardware configuration:
Others:

elankath commented 1 year ago

Grooming Discussion Results

Our retry handling in machine creation is present in machineCreateErrorHandler in machine_util.go. This method specifies a medium retry (3min) for enqueuing the machine object if the VM instance failed to be created.
The error handler method mentioned above leverages the helper method getCreateFailurePhase which transitions the the machine phase: CrashLoopBackoff ➡ Failed on creation timeout.
We should ideally transition the machine phase to Failed faster if we are very close to the creation timeout and if the next retry added to the current time exceeds the creation timeout.
To ensure that the above works fine, we should establish that Driver.CreateMachine returns within a bounded time. We can do this by making sure there is a deadline on the Context passed to the CreateMachine method. The deadline should be the time pending for us to reach machine creation timeout.

aaronfern commented 1 year ago

On further testing we saw that although this change is fine if MCM runs standalone, there is a scenario that when CA runs alongside this change can cause CAs backoff mechanism to not work.

The case we focussed on was CA backing off from a MachineDeployment if the Machine does not get ready and CA knows that there is another MachineDeployment whose Machine spec is sufficient and adequate for incoming workload. In normal circumstances, CA would backoff, scale down the MachineDeployment it had originally scaled up, and scale up a different MachineDeployment whose Machines would then be used.

With this change however (assuming MCMs --machine-creation-timeout and CAs --max-node-provision-time are the same), MCM will replace the Machine before CA has a chance to look at it and calculate it's backoff parameters. So from CAs point of view it won't see a node not in the Running state for longer than required for it to backoff from the MachineDeployment

Solution:

~* Set the --machine-creation-timeout in MCM to 0 by default so that unless set, MCM will not have a timeout for machine creation. Field can be set by the user in the shoot spec~

~* Document this behaviour and advice that MCMs --machine-creation-timeout should be a value greater then CAs --max-node-provision-time~

Updating the solution to be more specific and descriptive

Update MCM to be able to interpret a negative or zero timeout value for --machine-creation-timeout. This means that if this flag is set to 0 or some negative value, then MCM will not have a timeout for machine creation. This does not change the default value for this flag, and if no value is passed, MCM will still use it's current default
Update gardenlet code to set a negative value to --machine-creation-timeout for MCM be default if no value is set via the shoot spec. If a value is specified in the shoot spec, then the value specified is always used This behaviour needs to be documented and there needs to be a recommendation to set MCMs --machine-creation-timeout value to be greater than CAs --max-node-provision-time value.
Validation needs to be added to the shoot to not allow machineCreationTimeout to be greater than maxNodeProvisionTime
No additional change from CA is needed

elankath commented 1 year ago

Regarding Set the --machine-creation-timeout in MCM to 0 by default - is this OK for standalone consumers of the MCM ? Do we have any stakeholders where folks use the MCM but not our fork of the CA?

unmarshall commented 1 year ago

We should just add a doc that the default value is set to 0. There are 2 different recommendations:

If MCM is used standalone without using CA then consumers should use a value > 0.
If MCM is used along with CA then:
1. It is recommended that you leave the default value of machine-creation-timeout to 0.
2. If there is a real need to set this value then machine-creation-timeout should be > CA's max-node-provision-time

gardener / machine-controller-manager

Update CrashLoopBackOff condition as soon as timeout expires #790

Grooming Discussion Results

Solution: