gardener / machine-controller-manager

Declarative way of managing machines for Kubernetes cluster
Apache License 2.0
257 stars 117 forks source link

Update CrashLoopBackOff condition as soon as timeout expires #790

Open himanshu-kun opened 1 year ago

himanshu-kun commented 1 year ago

How to categorize this issue?

/area robustness /kind bug /priority 2

What happened:

Currently MCM doesn't turn CrashLoopBackoff(CLBF) machines to Failed as soon as creationTimeout expires , but delays have been observed which can range to 2min to any time.

There are 2 parts of the problem: 1) timeout check for CLBF machine is done after making CreateMachine() driver call . CreateMachine() itself could take any amount of time as it provisions VM on the cloudprovider (in Azure big delays like 30min or more have been seen before) 2) even if the 1) doesn't exist to contribute to the delay, the retry/re-push to the queue can also introduce the delay of ShortRetry(3min). This happens because the retry period is not calculated considering the time left before timeout but is just a constant value of ShortRetry(3min)

What you expected to happen: Turn to Failed as soon as timeout expires.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

We could take inspiration from machineDeployment logic which turns Progressing condition to False with reason ProgressDeadlineExceeded as soon as the deadline exceeds.

Environment:

elankath commented 1 year ago

Grooming Discussion Results

aaronfern commented 1 year ago

On further testing we saw that although this change is fine if MCM runs standalone, there is a scenario that when CA runs alongside this change can cause CAs backoff mechanism to not work.

The case we focussed on was CA backing off from a MachineDeployment if the Machine does not get ready and CA knows that there is another MachineDeployment whose Machine spec is sufficient and adequate for incoming workload. In normal circumstances, CA would backoff, scale down the MachineDeployment it had originally scaled up, and scale up a different MachineDeployment whose Machines would then be used.

With this change however (assuming MCMs --machine-creation-timeout and CAs --max-node-provision-time are the same), MCM will replace the Machine before CA has a chance to look at it and calculate it's backoff parameters. So from CAs point of view it won't see a node not in the Running state for longer than required for it to backoff from the MachineDeployment


Solution:

~* Set the --machine-creation-timeout in MCM to 0 by default so that unless set, MCM will not have a timeout for machine creation. Field can be set by the user in the shoot spec~

~* Document this behaviour and advice that MCMs --machine-creation-timeout should be a value greater then CAs --max-node-provision-time~

Updating the solution to be more specific and descriptive

elankath commented 1 year ago

Regarding Set the --machine-creation-timeout in MCM to 0 by default - is this OK for standalone consumers of the MCM ? Do we have any stakeholders where folks use the MCM but not our fork of the CA?

unmarshall commented 1 year ago

We should just add a doc that the default value is set to 0. There are 2 different recommendations:

  1. If MCM is used standalone without using CA then consumers should use a value > 0.
  2. If MCM is used along with CA then:
    1. It is recommended that you leave the default value of machine-creation-timeout to 0.
    2. If there is a real need to set this value then machine-creation-timeout should be > CA's max-node-provision-time