gardener / machine-controller-manager

Declarative way of managing machines for Kubernetes cluster
Apache License 2.0
257 stars 117 forks source link

Improve deletion flow to finish possiblity of leaving orphan resources #850

Open himanshu-kun opened 1 year ago

himanshu-kun commented 1 year ago

How to categorize this issue?

/area performance /area usability /area productivity /kind bug /priority 1

What happened:

The deletion flow of MCM currently follows the following flow:

image

But acc. to the contract of GetMachineStatus() , NOTFOUND should be returned only if VM not found, it doesn't mention nics/disks. This leaves the deletion flow , deleting a machine object but NOT cleaning up orphan nics , disks, in some cases

So the **proposed flow** is to try DeleteMachine() even after NOTFOUND is returned. This will ensure removal of orphan nics, disks.

image

Note: We can't rely on orphan collection logic as the logic is limited to MCM . So in cases where MCM is removed after the last machine obj is deleted in a shoot cluster (like in gardener cases) , and the last machine obj satisfied the above described corner case, its disks and nics would stay, further blocking infra deletion (subnet / resource group deletion for example)

What you expected to happen:

Delete flow should not leave any orphan resources

The following changes are required:

1) The mcm-providers should be updated , so that the DeleteMachine() driver implementation also follows the contract. For example , gcp returns NOTFOUND error if VM not there, but acc. to contract it shouldn't return any error, it should be a no-op.

2) MCM delete flow should be updated to match the proposed flow above.

NOTE: MCM-provider should vendor the MCM with proposed change only after their DeleteMachine() starts following contract, otherwise delete flow could get stuck on the Delete machine step

How to reproduce it (as minimally and precisely as possible):

1) Create single machine obj 2) delete the VM such that disks and nics still remain. This can be achieved by turning the cascade delete option to false 3) Put deletion timestamp on the machine obj 4) As soon as the machine deletes , scale down MCM . This is to make sure orphan collection doesn't run. This is how higher level gardener controllers scale-down MCM. 5) query the infra for disks and nics.

Anything else we need to know?:

There are many canary and live tickets , where such orphan disks and nics are seen. The reason may not be the same as described above, but since this is one such codepath, we need to fix it

canary # 3637 live # 730 live # 2263 live # 2273

Environment: mcm <= 0.49.3

gardener-robot commented 1 year ago

@himanshu-kun Label area/productivity does not exist.