Machine is never initialised if `Driver.InitializeMachine` returns `NotFound` error code for VM

rishabh-11 commented 1 month ago

How to categorize this issue?

/area robustness /kind bug /priority 1

What happened: According to https://github.com/gardener/machine-controller-manager/blob/ff8261398277c3e5a481f06cfb57c417dfd07754/pkg/util/provider/machinecontroller/machine.go#L609-#L611, if NotFound error code is returned by driver.InitialiseMachine, the initialisation of the VM is skipped. This can lead to problems in the following case:-

2024-08-08 07:00:21 | {"log":"Creating a VM for machine \"shoot--garden--aws-ha-eu4-cpu-worker-z1-6dc96-9j7pm\", please wait!","pid":"1","severity":"INFO","source":"machine.go:392"}Show context
  |   | 2024-08-08 07:00:21 | {"log":"Machine creation request has been recieved for \"shoot--garden--aws-ha-eu4-cpu-worker-z1-6dc96-9j7pm\"","pid":"1","severity":"INFO","source":"core.go:82"}
  |   | 2024-08-08 07:00:22 | {"log":"Waiting for VM with Provider-ID \"aws:///eu-west-1/i-0cf8cb1aa8a61dc8d\", for machine \"shoot--garden--aws-ha-eu4-cpu-worker-z1-6dc96-9j7pm\" to be visible to all AWS endpoints","pid":"1","severity":"INFO","source":"core.go:238"}
  |   | 2024-08-08 07:00:22 | {"log":"VM with Provider-ID \"aws:///eu-west-1/i-0cf8cb1aa8a61dc8d\", for machine \"shoot--garden--aws-ha-eu4-cpu-worker-z1-6dc96-9j7pm\" should be visible to all AWS endpoints now","pid":"1","severity":"INFO","source":"core.go:249"}
  |   | 2024-08-08 07:00:22 | {"log":"VM with Provider-ID: \"aws:///eu-west-1/i-0cf8cb1aa8a61dc8d\" created for Machine: \"shoot--garden--aws-ha-eu4-cpu-worker-z1-6dc96-9j7pm\"","pid":"1","severity":"INFO","source":"core.go:250"}
  |   | 2024-08-08 07:00:22 | {"log":"Created new VM for machine: \"shoot--garden--aws-ha-eu4-cpu-worker-z1-6dc96-9j7pm\" with ProviderID: \"aws:///eu-west-1/i-0cf8cb1aa8a61dc8d\" and backing node: \"\"","pid":"1","severity":"INFO","source":"machine.go:405"}
  |   | 2024-08-08 07:00:22 | {"log":"Initializing VM instance for Machine \"shoot--garden--aws-ha-eu4-cpu-worker-z1-6dc96-9j7pm\"","pid":"1","severity":"INFO","source":"machine.go:596"}
  |   | 2024-08-08 07:00:22 | {"log":"Error occurred while initializing VM instance for machine \"shoot--garden--aws-ha-eu4-cpu-worker-z1-6dc96-9j7pm\": machine codes error: code = [NotFound] message = [AWS plugin is returning no VM instances backing this machine object]","pid":"1","severity":"ERR","source":"machine.go:604"}
  |   | 2024-08-08 07:00:22 | {"log":"No VM instance found for machine \"shoot--garden--aws-ha-eu4-cpu-worker-z1-6dc96-9j7pm\". Skipping VM instance initialization.","pid":"1","severity":"WARN","source":"machine.go:610"}
  |   | 2024-08-08 07:00:22 | {"log":"Machine labels/annotations UPDATE for \"shoot--garden--aws-ha-eu4-cpu-worker-z1-6dc96-9j7pm\"","pid":"1","severity":"INFO","source":"machine.go:552"}Show context
  |   | 2024-08-08 07:00:22 | {"log":"reconcileClusterMachine: Stop for \"shoot--garden--aws-ha-eu4-cpu-worker-z1-6dc96-9j7pm\"","pid":"1","severity":"INFO","source":"machine.go:178"}

Here, for machine shoot--garden--aws-ha-eu4-cpu-worker-z1-6dc96-9j7pm, the VM was successfully created, but the initialisation failed as the VM was not found at a later instant. We know that this is an issue from the cloud provider side, but it can happen. In this case, the initialisation is skipped, and the machine object is updated with the providerID and NodeName label.

In the next reconciliation, the GetMachineStatus also has the same transient issue on the cloud provider, and the VM has still not been found. But because the node label was set in the previous reconciliation, https://github.com/gardener/machine-controller-manager/blob/ff8261398277c3e5a481f06cfb57c417dfd07754/pkg/util/provider/machinecontroller/machine.go#L390 is never executed and hence the VM is never initialised, but the machine will be moved to Pending state (and eventually to Running once the Node is registered)

2024-08-08 07:00:22 | {"log":"reconcileClusterMachine: Start for \"shoot--garden--aws-ha-eu4-cpu-worker-z1-6dc96-9j7pm\" with phase:\"\", description:\"\"","pid":"1","severity":"INFO","source":"machine.go:116"}
  |   | 2024-08-08 07:00:23 | {"log":"Get request has been recieved for \"shoot--garden--aws-ha-eu4-cpu-worker-z1-6dc96-9j7pm\"","pid":"1","severity":"INFO","source":"core.go:411"}
  |   | 2024-08-08 07:00:23 | {"log":"For machine \"shoot--garden--aws-ha-eu4-cpu-worker-z1-6dc96-9j7pm\", obtained VM error status as: machine codes error: code = [NotFound] message = [AWS plugin is returning no VM instances backing this machine object]","pid":"1","severity":"WARN","source":"machine.go:382"}
  |   | 2024-08-08 07:00:23 | {"log":"Machine/status UPDATE for \"shoot--garden--aws-ha-eu4-cpu-worker-z1-6dc96-9j7pm\" during creation","pid":"1","severity":"INFO","source":"machine.go:578"}

This leads to a problem because things like sourceDestCheck can be enabled/disabled during the initialisation of VM. If the initialization is not done, the pods running on them can go into CLBF, as seen in canary issue no. 5533.

Another problem is that if the driver.InitialiseMachine method keeps on failing, it is still possible for the kubelet to run on the created VM and register the corresponding Node object. This will lead to the scheduler seeing the node and scheduling pods on it, which will go into CLBF as the VM has not been properly initialised.

What you expected to happen: VM initialisation should be retried in case of NotFound errors and pods should not scheduled on the node until the initialisation of the corresponding VM is successful.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

elankath commented 1 month ago

ok, so we need to work around bugs where a cloud provider says a VM was not found even after it was successfully created. 😅

unmarshall commented 1 month ago

Yes, right from Neo days we have learnt the hard way that none of the infra providers get the distributed cache implementation right. Even when the resource has been created and confirmed by the infra provider, the subsequent provider GET call does not return that instance. We saw the same issue in Azure as well. @rishabh-11 and I discussed this and we have a proposal to improve this holistically. We can discuss this in a dedicated meeting.

rishabh-11 commented 4 weeks ago

After discussing with @ScheererJ, we have decided to move forward with the following solution:-

Add a taint representing vm-not-initialised to the kubelet configuration. This will create the Node object with the taint and none of the components will get scheduled on it till this taint is removed.
Adapt MCM to remove the above-mentioned taint once Driver.InitialiseMachine is successfully run (or returns Unimplemented error code).
Adapt MCM to add the node label before we initialise the VM. This is to ensure that we do not create multiple VMs for the same machine object if GetMachineStatus returns a NotFound error.
Change the implementation of InitialiseMachine in provider-aws to always return the Unitialised error code only.

After doing the MCM changes, providers will have to upgrade the MCM dependency and will have to be released. After the provider releases, the corresponding GEP will have to be updated with the correct image. Once all GEPs are released, then we can make the g/g change to add the taint.

rfranzke commented 4 weeks ago

Add a taint representing vm-not-initialised to the kubelet configuration. This will create the Node object with the taint and none of the components will get scheduled on it till this taint is removed.

How will this work when machines are not managed via MCM (e.g., in the context of https://github.com/gardener/gardener/issues/2906)?

unmarshall commented 4 weeks ago

How will this work when machines are not managed via MCM (e.g., in the context of https://github.com/gardener/gardener/issues/2906)?

@rfranzke That is a valid question. Do you have clarity on who will manage virtual machines in an autonomous cluster?

elankath commented 3 weeks ago

We found out that the DescribeInstancesInput is constructed differently in Driver.GetMachineStatus - which uses filters on the machine name versus Driver.CreateMachine - which directly uses the VM instanceID leading to VM instance unfortunately being found by AWS in one case but not in other, despite existing. We will now revise the logic in Driver.GetMachineStatus to fallback and also fallback to try obtaining the VM instance using the simple, direct instanceID in DescribeInstancesInput.

gardener / machine-controller-manager

Machine is never initialised if `Driver.InitializeMachine` returns `NotFound` error code for VM #933