[Edge case] When NIC deletion fails machine stays `CrashLoopBackoff` until `CreationTimeout` without trying re-create

What happened:

The creation of VM, Disk and NIC is done by mcm-provider-azure in createVMNicDisk. This method is called in CreateMachine method which is called by the MachineController's triggerCreationFlow. If there is any error in the creation of VM, NIC or disk, then all the three are removed and will be recreated in the subsequent reconciliation.

However, we have an edge case when NIC deletion fails. In that case, the MC changes the machine phase to CrashLoopBackOff. In the next reconcile cycle: the triggerCreationFlow invokes Driver.GetMachineStatus to get the VM status. Unfortunately, the azure driver impl of this method populates the node name in the response GetMachineStatusResponse if there is a NIC still existing for the VM even if the VM doesn't exist. This cases the triggerCreationFlow to think that the VM associated with the Machine is already created and populate the node label on the machine object with the value in the driver response and skip the machine creation. The machine is then permanently stuck in CrashLoopBackoff

RECENT UPDATES IN BEHAVIOUR

In the recent Azure SDK upgrade PR. CreateMachine implementation no longer deletes NIC in case the VM creation failed. It will still return an error but will no delete created NIC.

What you expected to happen:

The Driver.GetMachineStatus should not return the node name for a non-existent VM in the GetMachineStatusResponse.
The current driver.GetMachineStatusResponse only has a ProviderID (VM id at cloud provider) and a NodeName This should be enhanced to also contain a DiskNamesand NicNames slice. In other words, the new driver.GetMachineStatusResponse should look like the below
```
type GetMachineStatusResponse struct {
ProviderID  string
NodeName    string
DiskNames   []string
NicNames    []string
}
```
- If there are utterly no resources found (no vm, disks or nics), then Driver.GetMachineStatus() should return a nil response and a status.Error with codes.NotFound.
- If there is even a single resource found, then Driver.GetMachineStatus should return a populated GetMachineStatusResponse struct with a nil error. If the VM is present then ProviderID is populated, else it is empty.
- We will need to check this in other provider implementations too and create relevant issues.
The triggerCreationFlow currently just checks whether there is an error returned by Driver.GetMachineStatus() in order to determine whether to invoke a call to Driver.CreateMachine. It should should additionally check if the check whether GetMachineStatusResponse.ProviderID is empty.
The controller.getVMStatus helper method invoked in the triggerDeletionFlow currently makes a call to Driver.GetMachineStatus and ignores the response and only checks the error code. It must now capture and check the response.
1. If there is a populated ProviderID it should change the operation stage to InitiateDrain.
2. if ProviderID is empty, but some other resource is found in the response, it should change the operation stage to InitiateVMDeletion

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know:

Environment:

Post grooming discussion

Because of dealing with NIC separately , when we face problem in deleting them then issue like this can happen

This has happened because of two reasons:

problem in our driver response design
- our driver response not sending separate resources (nic, disks, vm) , so presence of even one is considered presence of all and we don't retry creation for CrashLoopbackoff machines
Azure API sending unexpected responses
- ideally Azure API should allow deletion of NIC , but in this case its not
- this can be solved by updating azure-go-sdk , because then we don't have deal with NICs separately and only azure deals with them(cascade delete) and this way presence of 1 resource can be safely considered presence of all.

We can leverage resource-graph API to get this cheaply.

resources
| extend tagKeys = bag_keys(tags)
| where tagKeys hasprefix "kubernetes.io-cluster-" and tagKeys hasprefix "kubernetes.io-role-"
| where name startswith "<vm-name>"
| project name, type

In a single query you can get all resources created and associated to a VM.

The above query leverages the existing convention that we have for all resources created:

NIC has a name: "-nic"
OSDisk has a name: "-os-disk"
DataDisk has a name: "-data-disk"

Also currently all resources - NICs and Disks have 2 tags set on them which start with "kubernetes.io-cluster- and "kubernetes.io-role-".

The above issue doesn't occur anymore. It has been solved by the following changes in this https://github.com/gardener/machine-controller-manager-provider-azure/pull/105 : 1) GetMachineStatus() has been updated to be in sync with the contract and return NotFound error only if VM is not found. Means if the nics, disks are there , and VM not there, then also it'll return NotFound. This allows for re-create operation being tried in every reconcile

Further CreateMachine() has been impoved to not delete NIC if VM creation fails, so that in the next reconcile VM creation is retried directly. This allows to avoid errors due to NIC deletion

But there still is a scope of orphan resources being left even after machine deletion. This is due to the updated behavior of GetMachineStatus() and how its used in deletion flow.

Will open another issue to deal with that corner case. Issue opened -> https://github.com/gardener/machine-controller-manager/issues/850

/close

gardener / machine-controller-manager-provider-azure

[Edge case] When NIC deletion fails machine stays `CrashLoopBackoff` until `CreationTimeout` without trying re-create #88

Post grooming discussion