Closed elankath closed 1 year ago
Because of dealing with NIC separately , when we face problem in deleting them then issue like this can happen
This has happened because of two reasons:
problem in our driver response design
CrashLoopbackoff
machinesAzure API sending unexpected responses
We can leverage resource-graph API to get this cheaply.
resources
| extend tagKeys = bag_keys(tags)
| where tagKeys hasprefix "kubernetes.io-cluster-" and tagKeys hasprefix "kubernetes.io-role-"
| where name startswith "<vm-name>"
| project name, type
In a single query you can get all resources created and associated to a VM.
The above query leverages the existing convention that we have for all resources created:
Also currently all resources - NICs and Disks have 2 tags set on them which start with "kubernetes.io-cluster- and "kubernetes.io-role-".
The above issue doesn't occur anymore. It has been solved by the following changes in this https://github.com/gardener/machine-controller-manager-provider-azure/pull/105 :
1) GetMachineStatus()
has been updated to be in sync with the contract and return NotFound
error only if VM is not found. Means if the nics, disks are there , and VM not there, then also it'll return NotFound
. This allows for re-create operation being tried in every reconcile
Further CreateMachine()
has been impoved to not delete NIC if VM creation fails, so that in the next reconcile VM creation is retried directly. This allows to avoid errors due to NIC deletion
But there still is a scope of orphan resources being left even after machine deletion. This is due to the updated behavior of GetMachineStatus()
and how its used in deletion flow.
Will open another issue to deal with that corner case. Issue opened -> https://github.com/gardener/machine-controller-manager/issues/850
/close
What happened:
The creation of VM, Disk and NIC is done by mcm-provider-azure in createVMNicDisk. This method is called in CreateMachine method which is called by the MachineController's triggerCreationFlow. If there is any error in the creation of VM, NIC or disk, then all the three are removed and will be recreated in the subsequent reconciliation.
However, we have an edge case when NIC deletion fails. In that case, the MC changes the machine phase to
CrashLoopBackOff
. In the next reconcile cycle: thetriggerCreationFlow
invokes Driver.GetMachineStatus to get the VM status. Unfortunately, the azure driver impl of this method populates the node name in the response GetMachineStatusResponse if there is a NIC still existing for the VM even if the VM doesn't exist. This cases thetriggerCreationFlow
to think that the VM associated with the Machine is already created and populate thenode
label on the machine object with the value in the driver response and skip the machine creation. The machine is then permanently stuck inCrashLoopBackoff
RECENT UPDATES IN BEHAVIOUR
CreateMachine
implementation no longer deletes NIC in case the VM creation failed. It will still return an error but will no delete created NIC.What you expected to happen:
The Driver.GetMachineStatus should not return the node name for a non-existent VM in the GetMachineStatusResponse.
The current driver.GetMachineStatusResponse only has a
ProviderID
(VM id at cloud provider) and aNodeName
This should be enhanced to also contain aDiskNames
andNicNames
slice. In other words, the newdriver.GetMachineStatusResponse
should look like the belowDriver.GetMachineStatus()
should return anil
response and astatus.Error
withcodes.NotFound
.Driver.GetMachineStatus
should return a populatedGetMachineStatusResponse
struct with anil
error. If the VM is present thenProviderID
is populated, else it is empty.The
triggerCreationFlow
currently just checks whether there is an error returned byDriver.GetMachineStatus()
in order to determine whether to invoke a call toDriver.CreateMachine
. It should should additionally check if the check whetherGetMachineStatusResponse.ProviderID
is empty.The
controller.getVMStatus
helper method invoked in thetriggerDeletionFlow
currently makes a call toDriver.GetMachineStatus
and ignores the response and only checks the error code. It must now capture and check the response.ProviderID
it should change the operation stage toInitiateDrain
.ProviderID
is empty, but some other resource is found in the response, it should change the operation stage toInitiateVMDeletion
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know:
Environment: