gardener / machine-controller-manager

Declarative way of managing machines for Kubernetes cluster
Apache License 2.0
257 stars 117 forks source link

Allow deletion to proceed in case of VM initialization error #928

Closed rishabh-11 closed 4 months ago

rishabh-11 commented 4 months ago

What this PR does / why we need it: This PR fixes the triggerDeletionFlow, specifically the getVMStatus function to allow deletion of Unitialized VMs to proceed

Which issue(s) this PR fixes: Fixes #926

Special notes for your reviewer:

Release note:

Fixed a bug where the `Unitialised` error code was blocking machine deletion
kon-angelo commented 4 months ago

@rishabh-11 what changes are required on the provider e.g. mcm-provider-aws to return the appropriate error ?

rishabh-11 commented 4 months ago

No change required on the provider side. The problem was with the triggerDeletionFlow in MCM. It is not handling the Unitialised error code correctly.

rishabh-11 commented 4 months ago

Once this PR merges, I'll make a patch release of MCM. The providers will need to be updated with this patch version of MCM

kon-angelo commented 4 months ago

What about this scenario: User is doing some experiment and changes the something like the srcAndDstCheck (sorry don't remember the exact flag name), after a successful initialization. You still end up in a scenario where the VM can't be deleted by the MCM.

I am positing that checking the VM status during deletion and blocking the deletion only on that may not be the best thing to do. Especially when these post-init checks are involved in healthchecks of the VM. Or maybe we can have 2 "versions" of the check depending on if we are deleting the VM.

rishabh-11 commented 4 months ago

What about this scenario: User is doing some experiment and changes the something like the srcAndDstCheck (sorry don't remember the exact flag name), after a successful initialization. You still end up in a scenario where the VM can't be deleted by the MCM.

This case will be handled in this PR. Let's consider that the user is experimenting and changes something on the VM, causing the GetMachineStatus to return a Unitialized error. In this case the deletion will not be blocked and still proceed.

rishabh-11 commented 4 months ago

Consider the case for errors apart from Unitialized for GetMachineStatus. This can happen if the validation of providerSpec fails or if there are errors in fetching the VM itself. I agree that the providerSpec validation done in GetMachineStatus is incorrect as it is a full-fledged check which is not required and can cause problems like the one we saw with GCP. This can be corrected in the provider. But if I cannot get the VM for reasons apart from NotFound errors, shouldn't the deletion be blocked?

elankath commented 4 months ago

But if I cannot get the VM for reasons apart from NotFound errors, shouldn't the deletion be blocked?

Point raised by @kon-angelo feels right. Perhaps it shouldn't be blocked ? (who cares about health at this point?). Ideally, we should just go ahead with driain->deletion in all circumstances except for NotFound. But for now, this fix addresses the current problem.

kon-angelo commented 4 months ago

@rishabh-11 Maybe first, I know that the PR solves the issue for provider-aws - so in that regard it is a /lgtm.

I still do not think that it is a nice experience to not be able to delete a VM for something silly like flipping a boolean flag on the VM. @elankath summarised perfectly: why do the full blown healthcheck in this case. You particularly only care if the machine exists or not.

his can happen if the validation of providerSpec fails or if there are errors in fetching the VM itself.

Our implementations already have their own validations before doing a delete call.

This can be corrected in the provider

How exactly ? In this case the provider does not know if it should do the "full-check". You could change the interface to note if the machine is in deletion if you want to go that route.

But if I cannot get the VM for reasons apart from NotFound errors, shouldn't the deletion be blocked?

Take this as an anecdote, but for provider-openstack I do not implement GetMachineStatus (ref]) because there are also post-creation operations to be done (and there was not InitializeMachine until recently). With that the MCM flow just goes ahead and calls the deleteMachine, which generally works and I haven't seen a provider where this wouldn't work.

I don't really see a case where the delete call itself cannot and should not handle this case. Either the delete would fail, or maybe the delete call must do a get check beforehand - but in either case everything necessary would be handled by the delete call.

rishabh-11 commented 4 months ago

@kon-angelo I agree that getVMStatus might not be required in the deletion flow, and we can discuss removing it and raise a separate PR for that.

But, the GetMachineStatus method is still required in the creation flow for VM. The creation flow uses this method to check if the VM is already present in the cloud provider and create one only if it is not present. It works in OpenStack because it relies on the nodeLabel that we put on the machine object once the VM is created. It can work like this for other providers as well, but i don't think relying on a label instead of the cloud provider is the right way to go. wdyt?

How exactly ? In this case the provider does not know if it should do the "full-check".

The "check" here is a check of the providerSpec in the machine class. I meant that we should only check those fields in the providerSpec that are needed for the particular driver method to work and not the entire providerSpec for every call.

kon-angelo commented 4 months ago

But, the GetMachineStatus method is still required in the creation flow for VM. The creation flow uses this method to check if the VM is already present in the cloud provider and create one only if it is not present

Sure. Technically without the GetMachineStatus MCM will just call CreateMachine and the provider has to implement more of a "reconcile" function and for openstack the equivalent of GetMachineStatus is just handled internally in the create call to make it idempotent. But i am not arguing on the functionality of GetMachineStatus on reconcile/create - it does make things easier. My argument was more just in the delete case.

The "check" here is a check of the providerSpec in the machine class.

Maybe I misunderstood the point above. The issue with provider-aws currently is not that the spec validation does not match - what happened with gcp and the "static" spec checking. The machine is indeed "unhealthy" because the VM in the hyperscaler does not match its expected spec. The healthcheck is correct - but you don't need a "healthy" machine to move over to delete. There is no way from the GetMachineStatus to know if you need to perform a healthcheck or just say that the machine exists.

Anyway, we can have this discussion offline. I think we all agree that we should merge the PR and go ahead with the fix and optimisations can come later.

rishabh-11 commented 4 months ago

Maybe I misunderstood the point above. The issue with provider-aws currently is not that the spec validation does not match - what happened with gcp and the "static" spec checking. The machine is indeed "unhealthy" because the VM in the hyperscaler does not match its expected spec. The healthcheck is correct - but you don't need a "healthy" machine to move over to delete. There is no way from the GetMachineStatus to know if you need to perform a healthcheck or just say that the machine exists.

The idea behind GetMachineStatus should not be to perform a health check, but I think it has evolved that way. We can relook at that offline.

elankath commented 4 months ago

Test Log Before Fix.

  1. Get Instance ID of running machine
k get mc                                                                               
NAME                                    STATUS    AGE   NODE
shoot--i034796--aw1-w1-z1-54c64-n2mjz   Running   22m   ip-10-180-20-144.eu-west-1.compute.internal

aws ec2 describe-instances --query "Reservations[*].Instances[*].{Name:Tags[?Key=='Name']|[0].Value,InstanceId:InstanceId}" --output table | grep aw1-w1-z1-54c64-n2mjz
|  i-020497d720e898696 |  shoot--i034796--aw1-w1-z1-54c64-n2mjz                      |
  1. Enable Source/Dest check.

    aws ec2 describe-instances --instance-ids i-020497d720e898696 >| /tmp/i1.json
    aws ec2 modify-instance-attribute --source-dest-check --instance-id i-020497d720e898696
  2. Check describe-instances differences

    
    aws ec2 describe-instances --instance-ids i-020497d720e898696 >| /tmp/i2.json
    diff -U0 -u /tmp/i1.json /tmp/i2.json

--- /tmp/i1.json 2024-07-12 15:15:45 +++ /tmp/i2.json 2024-07-12 15:18:30 @@ -82 +82 @@

  1. Force Delete the Machine

    k label mc shoot--i034796--aw1-w1-z1-54c64-n2mjz  force-deletion=true
    k delete mc shoot--i034796--aw1-w1-z1-54c64-n2mjz  force-deletion=true
  2. Machine is stuck in terminating phase.


k get mc shoot--i034796--aw1-w1-z1-54c64-n2mjz                                         
NAME                                    STATUS        AGE   NODE
shoot--i034796--aw1-w1-z1-54c64-n2mjz   Terminating   29m   ip-10-180-20-144.eu-west-1.compute.internal

k get mc shoot--i034796--aw1-w1-z1-54c64-n2mjz -oyaml | grep -C2 -i error             
    phase: Terminating
  lastOperation:
    description: 'Error occurred with decoding machine error status while getting
      VM status, aborting without retry. machine code: machine codes error: code =
      [Uninitialized] message = [VM "i-020497d720e898696" associated with machine
      "shoot--i034796--aw1-w1-z1-54c64-n2mjz" has SourceDestCheck=true despite providerSpec.SrcAndDstChecksEnabled=false]

k get mc shoot--i034796--aw1-w1-z1-54c64-n2mjz                                         
NAME                                    STATUS        AGE   NODE
shoot--i034796--aw1-w1-z1-54c64-n2mjz   Terminating   29m   ip-10-180-20-144.eu-west-1.compute.internal

k get mc shoot--i034796--aw1-w1-z1-54c64-n2mjz -oyaml | grep -C2 -i error             
    phase: Terminating
  lastOperation:
    description: 'Error occurred with decoding machine error status while getting
      VM status, aborting without retry. machine code: machine codes error: code =
      [Uninitialized] message = [VM "i-020497d720e898696" associated with machine
      "shoot--i034796--aw1-w1-z1-54c64-n2mjz" has SourceDestCheck=true despite providerSpec.SrcAndDstChecksEnabled=false]
elankath commented 4 months ago

Test Log Post Fix

  1. Get Instance ID of running machine
k get mc                                                                               
NAME                                    STATUS    AGE   NODE
shoot--i034796--aw1-w1-z1-54c64-84gm8   Running   32m   ip-10-180-1-180.eu-west-1.compute.internal

 aws ec2 describe-instances --query "Reservations[*].Instances[*].{Name:Tags[?Key=='Name']|[0].Value,InstanceId:InstanceId}" --output text | grep aw1-w1-z1-54c64-84gm8
i-01dfe71e6bea2d80a shoot--i034796--aw1-w1-z1-54c64-84gm8
  1. Enable Source/Dest check.

    aws ec2 describe-instances --instance-ids i-01dfe71e6bea2d80a >| /tmp/i1.json
    aws ec2 modify-instance-attribute --source-dest-check --instance-id i-01dfe71e6bea2d80a
  2. Check describe-instances differences

    
    aws ec2 describe-instances --instance-ids i-01dfe71e6bea2d80a >| /tmp/i2.json

diff -U0 -u /tmp/i1.json /tmp/i2.json
--- /tmp/i1.json 2024-07-12 16:00:22 +++ /tmp/i2.json 2024-07-12 16:01:02 @@ -82 +82 @@

  1. Force Delete the Machine
    
    k label mc shoot--i034796--aw1-w1-z1-54c64-84gm8 force-deletion=true                   
    machine.machine.sapcloud.io/shoot--i034796--aw1-w1-z1-54c64-84gm8 labeled

k delete mc shoot--i034796--aw1-w1-z1-54c64-84gm8
machine.machine.sapcloud.io "shoot--i034796--aw1-w1-z1-54c64-84gm8" deleted


5.  Machine goes to `Terminating` phase, `Uninitialized` is ignored, goes to node drain.

k get mc shoot--i034796--aw1-w1-z1-54c64-84gm8
NAME STATUS AGE NODE shoot--i034796--aw1-w1-z1-54c64-84gm8 Terminating 38m ip-10-180-1-180.eu-west-1.compute.internal

k get mc shoot--i034796--aw1-w1-z1-54c64-84gm8 -oyaml | grep -C1 description git:fix-del-flow lastOperation: description: VM instance was not initalized. Moving forward to node drain. Initiate node drain


6. SUCCESS: VM is deleted and Machine obj  disappears

k get mc shoot--i034796--aw1-w1-z1-54c64-84gm8 git:fix-del-flow Error from server (NotFound): machines.machine.sapcloud.io "shoot--i034796--aw1-w1-z1-54c64-84gm8" not found

elankath commented 4 months ago

Test Post-Fix with Kubelet Crash Simulation (checking that Failed flow is unaffected by fix)

  1. See Node Status
 k get no                                                                      
NAME                                         STATUS   ROLES    AGE   VERSION
ip-10-180-14-52.eu-west-1.compute.internal   Ready    <none>   29m   v1.29.4
  1. Disable and stop the gardener-node-agent and kubelet services after SSH'ing into node. (use a privileged pod spec for this)
kubectl exec -it priv-pod -- chroot /host /bin/bash
root@ip-10-180-14-52:/# systemctl disable gardener-node-agent.service
Removed "/etc/systemd/system/multi-user.target.wants/gardener-node-agent.service".
root@ip-10-180-14-52:/# systemctl disable kubelet
systemctl stop gardener-node-agent
systemctl stop kubelet
  1. Node goes into NotReady status, Machine goes into Unknown Phase
    
    k get no                                                                      
    NAME                                         STATUS     ROLES    AGE   VERSION
    ip-10-180-14-52.eu-west-1.compute.internal   NotReady   <none>   33m   v1.29.4
    kubectl exec -it priv-pod -- chroot /host /bin/bash

k get mc
NAME STATUS AGE NODE shoot--i034796--aw1-w1-z1-54c64-dp49r Failed 45m ip-10-180-14-52.eu-west-1.compute.internal


4. `Machine` goes to `Failed` state after `machine-health-timeout` and then goes into `Terminating` and then disappears

k get mc
NAME STATUS AGE NODE shoot--i034796--aw1-w1-z1-54c64-dp49r Failed 47m ip-10-180-14-52.eu-west-1.compute.internal

k get mc shoot--i034796--aw1-w1-z1-54c64-dp49r
Error from server (NotFound): machines.machine.sapcloud.io "shoot--i034796--aw1-w1-z1-54c64-dp49r" not found

elankath commented 4 months ago

@rishabh-11 Tests complete. Please merge and release whenever ready.