Open himanshu-kun opened 1 year ago
machineCreateErrorHandler
in machine_util.go
. This method specifies a medium retry (3min) for enqueuing the machine object if the VM instance failed to be created.getCreateFailurePhase
which transitions the the machine phase: CrashLoopBackoff
➡ Failed
on creation timeout. Failed
faster if we are very close to the creation timeout and if the next retry added to the current time exceeds the creation timeout.Driver.CreateMachine
returns within a bounded time. We can do this by making sure there is a deadline on the Context
passed to the CreateMachine
method. The deadline should be the time pending for us to reach machine creation timeout.On further testing we saw that although this change is fine if MCM
runs standalone, there is a scenario that when CA
runs alongside this change can cause CA
s backoff mechanism to not work.
The case we focussed on was CA
backing off from a MachineDeployment
if the Machine
does not get ready and CA knows that there is another MachineDeployment
whose Machine
spec is sufficient and adequate for incoming workload. In normal circumstances, CA
would backoff, scale down the MachineDeployment
it had originally scaled up, and scale up a different MachineDeployment
whose Machine
s would then be used.
With this change however (assuming MCM
s --machine-creation-timeout
and CA
s --max-node-provision-time
are the same), MCM
will replace the Machine
before CA
has a chance to look at it and calculate it's backoff parameters. So from CA
s point of view it won't see a node not in the Running
state for longer than required for it to backoff from the MachineDeployment
~* Set the --machine-creation-timeout
in MCM
to 0 by default so that unless set, MCM
will not have a timeout for machine creation. Field can be set by the user in the shoot spec~
~* Document this behaviour and advice that MCM
s --machine-creation-timeout
should be a value greater then CA
s --max-node-provision-time
~
Updating the solution to be more specific and descriptive
Update MCM
to be able to interpret a negative or zero timeout value for --machine-creation-timeout
. This means that if this flag is set to 0 or some negative value, then MCM
will not have a timeout for machine creation. This does not change the default value for this flag, and if no value is passed, MCM
will still use it's current default
Update gardenlet
code to set a negative value to --machine-creation-timeout
for MCM
be default if no value is set via the shoot spec. If a value is specified in the shoot spec, then the value specified is always used
This behaviour needs to be documented and there needs to be a recommendation to set MCM
s --machine-creation-timeout
value to be greater than CA
s --max-node-provision-time
value.
Validation needs to be added to the shoot to not allow machineCreationTimeout
to be greater than maxNodeProvisionTime
No additional change from CA is needed
Regarding Set the --machine-creation-timeout in MCM to 0 by default
- is this OK for standalone consumers of the MCM ? Do we have any stakeholders where folks use the MCM but not our fork of the CA?
We should just add a doc that the default value is set to 0. There are 2 different recommendations:
machine-creation-timeout
to 0.machine-creation-timeout
should be > CA's max-node-provision-time
How to categorize this issue?
/area robustness /kind bug /priority 2
What happened:
Currently MCM doesn't turn
CrashLoopBackoff
(CLBF) machines toFailed
as soon ascreationTimeout
expires , but delays have been observed which can range to 2min to any time.There are 2 parts of the problem: 1) timeout check for CLBF machine is done after making
CreateMachine()
driver call .CreateMachine()
itself could take any amount of time as it provisions VM on the cloudprovider (in Azure big delays like 30min or more have been seen before) 2) even if the 1) doesn't exist to contribute to the delay, the retry/re-push to the queue can also introduce the delay ofShortRetry
(3min). This happens because the retry period is not calculated considering thetime left before timeout
but is just a constant value ofShortRetry
(3min)What you expected to happen: Turn to
Failed
as soon as timeout expires.How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
We could take inspiration from machineDeployment logic which turns
Progressing
condition toFalse
with reasonProgressDeadlineExceeded
as soon as the deadline exceeds.Environment:
kubectl version
):