Open rishabh-11 opened 1 month ago
ok, so we need to work around bugs where a cloud provider says a VM was not found even after it was successfully created. 😅
Yes, right from Neo days we have learnt the hard way that none of the infra providers get the distributed cache implementation right. Even when the resource has been created and confirmed by the infra provider, the subsequent provider GET call does not return that instance. We saw the same issue in Azure as well. @rishabh-11 and I discussed this and we have a proposal to improve this holistically. We can discuss this in a dedicated meeting.
After discussing with @ScheererJ, we have decided to move forward with the following solution:-
vm-not-initialised
to the kubelet configuration. This will create the Node object with the taint and none of the components will get scheduled on it till this taint is removed.Driver.InitialiseMachine
is successfully run (or returns Unimplemented
error code). GetMachineStatus
returns a NotFound
error. InitialiseMachine
in provider-aws to always return the Unitialised
error code only.After doing the MCM changes, providers will have to upgrade the MCM dependency and will have to be released. After the provider releases, the corresponding GEP will have to be updated with the correct image. Once all GEPs are released, then we can make the g/g change to add the taint.
- Add a taint representing
vm-not-initialised
to the kubelet configuration. This will create the Node object with the taint and none of the components will get scheduled on it till this taint is removed.
How will this work when machines are not managed via MCM (e.g., in the context of https://github.com/gardener/gardener/issues/2906)?
How will this work when machines are not managed via MCM (e.g., in the context of https://github.com/gardener/gardener/issues/2906)?
@rfranzke That is a valid question. Do you have clarity on who will manage virtual machines in an autonomous cluster?
We found out that the DescribeInstancesInput
is constructed differently in Driver.GetMachineStatus
- which uses filters on the machine name versus Driver.CreateMachine
- which directly uses the VM instanceID
leading to VM instance unfortunately being found by AWS in one case but not in other, despite existing. We will now revise the logic in Driver.GetMachineStatus
to fallback and also fallback to try obtaining the VM instance using the simple, direct instanceID
in DescribeInstancesInput
.
How to categorize this issue?
/area robustness /kind bug /priority 1
What happened: According to https://github.com/gardener/machine-controller-manager/blob/ff8261398277c3e5a481f06cfb57c417dfd07754/pkg/util/provider/machinecontroller/machine.go#L609-#L611, if
NotFound
error code is returned bydriver.InitialiseMachine
, the initialisation of the VM is skipped. This can lead to problems in the following case:-Here, for machine
shoot--garden--aws-ha-eu4-cpu-worker-z1-6dc96-9j7pm
, the VM was successfully created, but the initialisation failed as the VM was not found at a later instant. We know that this is an issue from the cloud provider side, but it can happen. In this case, the initialisation is skipped, and the machine object is updated with theproviderID
andNodeName
label.In the next reconciliation, the
GetMachineStatus
also has the same transient issue on the cloud provider, and the VM has still not been found. But because the node label was set in the previous reconciliation, https://github.com/gardener/machine-controller-manager/blob/ff8261398277c3e5a481f06cfb57c417dfd07754/pkg/util/provider/machinecontroller/machine.go#L390 is never executed and hence the VM is never initialised, but the machine will be moved toPending
state (and eventually toRunning
once the Node is registered)This leads to a problem because things like
sourceDestCheck
can be enabled/disabled during the initialisation of VM. If the initialization is not done, the pods running on them can go into CLBF, as seen in canary issue no. 5533.Another problem is that if the
driver.InitialiseMachine
method keeps on failing, it is still possible for the kubelet to run on the created VM and register the corresponding Node object. This will lead to the scheduler seeing the node and scheduling pods on it, which will go into CLBF as the VM has not been properly initialised.What you expected to happen: VM initialisation should be retried in case of
NotFound
errors and pods should not scheduled on the node until the initialisation of the corresponding VM is successful.How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?: