Closed rishabh-11 closed 4 months ago
@rishabh-11 what changes are required on the provider e.g. mcm-provider-aws to return the appropriate error ?
No change required on the provider side. The problem was with the triggerDeletionFlow
in MCM. It is not handling the Unitialised
error code correctly.
Once this PR merges, I'll make a patch release of MCM. The providers will need to be updated with this patch version of MCM
What about this scenario: User is doing some experiment and changes the something like the srcAndDstCheck
(sorry don't remember the exact flag name), after a successful initialization. You still end up in a scenario where the VM can't be deleted by the MCM.
I am positing that checking the VM status during deletion and blocking the deletion only on that may not be the best thing to do. Especially when these post-init checks are involved in healthchecks of the VM. Or maybe we can have 2 "versions" of the check depending on if we are deleting the VM.
What about this scenario: User is doing some experiment and changes the something like the srcAndDstCheck (sorry don't remember the exact flag name), after a successful initialization. You still end up in a scenario where the VM can't be deleted by the MCM.
This case will be handled in this PR. Let's consider that the user is experimenting and changes something on the VM, causing the GetMachineStatus
to return a Unitialized
error. In this case the deletion will not be blocked and still proceed.
Consider the case for errors apart from Unitialized
for GetMachineStatus
. This can happen if the validation of providerSpec fails or if there are errors in fetching the VM itself. I agree that the providerSpec validation done in GetMachineStatus is incorrect as it is a full-fledged check which is not required and can cause problems like the one we saw with GCP. This can be corrected in the provider. But if I cannot get the VM for reasons apart from NotFound
errors, shouldn't the deletion be blocked?
But if I cannot get the VM for reasons apart from NotFound errors, shouldn't the deletion be blocked?
Point raised by @kon-angelo feels right. Perhaps it shouldn't be blocked ? (who cares about health at this point?). Ideally, we should just go ahead with driain->deletion in all circumstances except for NotFound
. But for now, this fix addresses the current problem.
@rishabh-11 Maybe first, I know that the PR solves the issue for provider-aws - so in that regard it is a /lgtm.
I still do not think that it is a nice experience to not be able to delete a VM for something silly like flipping a boolean flag on the VM. @elankath summarised perfectly: why do the full blown healthcheck in this case. You particularly only care if the machine exists or not.
his can happen if the validation of providerSpec fails or if there are errors in fetching the VM itself.
Our implementations already have their own validations before doing a delete call.
This can be corrected in the provider
How exactly ? In this case the provider does not know if it should do the "full-check". You could change the interface to note if the machine is in deletion if you want to go that route.
But if I cannot get the VM for reasons apart from NotFound errors, shouldn't the deletion be blocked?
Take this as an anecdote, but for provider-openstack I do not implement GetMachineStatus
(ref]) because there are also post-creation operations to be done (and there was not InitializeMachine until recently). With that the MCM flow just goes ahead and calls the deleteMachine, which generally works and I haven't seen a provider where this wouldn't work.
I don't really see a case where the delete call itself cannot and should not handle this case. Either the delete would fail, or maybe the delete call must do a get check beforehand - but in either case everything necessary would be handled by the delete call.
@kon-angelo I agree that getVMStatus
might not be required in the deletion flow, and we can discuss removing it and raise a separate PR for that.
But, the GetMachineStatus
method is still required in the creation flow for VM. The creation flow uses this method to check if the VM is already present in the cloud provider and create one only if it is not present. It works in OpenStack because it relies on the nodeLabel that we put on the machine object once the VM is created. It can work like this for other providers as well, but i don't think relying on a label instead of the cloud provider is the right way to go. wdyt?
How exactly ? In this case the provider does not know if it should do the "full-check".
The "check" here is a check of the providerSpec in the machine class. I meant that we should only check those fields in the providerSpec that are needed for the particular driver method to work and not the entire providerSpec for every call.
But, the GetMachineStatus method is still required in the creation flow for VM. The creation flow uses this method to check if the VM is already present in the cloud provider and create one only if it is not present
Sure. Technically without the GetMachineStatus
MCM will just call CreateMachine
and the provider has to implement more of a "reconcile" function and for openstack the equivalent of GetMachineStatus
is just handled internally in the create call to make it idempotent. But i am not arguing on the functionality of GetMachineStatus
on reconcile/create - it does make things easier. My argument was more just in the delete case.
The "check" here is a check of the providerSpec in the machine class.
Maybe I misunderstood the point above. The issue with provider-aws currently is not that the spec validation does not match - what happened with gcp and the "static" spec checking. The machine is indeed "unhealthy" because the VM in the hyperscaler does not match its expected spec. The healthcheck is correct - but you don't need a "healthy" machine to move over to delete. There is no way from the GetMachineStatus
to know if you need to perform a healthcheck or just say that the machine exists.
Anyway, we can have this discussion offline. I think we all agree that we should merge the PR and go ahead with the fix and optimisations can come later.
Maybe I misunderstood the point above. The issue with provider-aws currently is not that the spec validation does not match - what happened with gcp and the "static" spec checking. The machine is indeed "unhealthy" because the VM in the hyperscaler does not match its expected spec. The healthcheck is correct - but you don't need a "healthy" machine to move over to delete. There is no way from the GetMachineStatus to know if you need to perform a healthcheck or just say that the machine exists.
The idea behind GetMachineStatus
should not be to perform a health check, but I think it has evolved that way. We can relook at that offline.
k get mc
NAME STATUS AGE NODE
shoot--i034796--aw1-w1-z1-54c64-n2mjz Running 22m ip-10-180-20-144.eu-west-1.compute.internal
aws ec2 describe-instances --query "Reservations[*].Instances[*].{Name:Tags[?Key=='Name']|[0].Value,InstanceId:InstanceId}" --output table | grep aw1-w1-z1-54c64-n2mjz
| i-020497d720e898696 | shoot--i034796--aw1-w1-z1-54c64-n2mjz |
Enable Source/Dest check.
aws ec2 describe-instances --instance-ids i-020497d720e898696 >| /tmp/i1.json
aws ec2 modify-instance-attribute --source-dest-check --instance-id i-020497d720e898696
Check describe-instances
differences
aws ec2 describe-instances --instance-ids i-020497d720e898696 >| /tmp/i2.json
diff -U0 -u /tmp/i1.json /tmp/i2.json
--- /tmp/i1.json 2024-07-12 15:15:45 +++ /tmp/i2.json 2024-07-12 15:18:30 @@ -82 +82 @@
Force Delete the Machine
k label mc shoot--i034796--aw1-w1-z1-54c64-n2mjz force-deletion=true
k delete mc shoot--i034796--aw1-w1-z1-54c64-n2mjz force-deletion=true
Machine is stuck in terminating phase.
k get mc shoot--i034796--aw1-w1-z1-54c64-n2mjz
NAME STATUS AGE NODE
shoot--i034796--aw1-w1-z1-54c64-n2mjz Terminating 29m ip-10-180-20-144.eu-west-1.compute.internal
k get mc shoot--i034796--aw1-w1-z1-54c64-n2mjz -oyaml | grep -C2 -i error
phase: Terminating
lastOperation:
description: 'Error occurred with decoding machine error status while getting
VM status, aborting without retry. machine code: machine codes error: code =
[Uninitialized] message = [VM "i-020497d720e898696" associated with machine
"shoot--i034796--aw1-w1-z1-54c64-n2mjz" has SourceDestCheck=true despite providerSpec.SrcAndDstChecksEnabled=false]
k get mc shoot--i034796--aw1-w1-z1-54c64-n2mjz
NAME STATUS AGE NODE
shoot--i034796--aw1-w1-z1-54c64-n2mjz Terminating 29m ip-10-180-20-144.eu-west-1.compute.internal
k get mc shoot--i034796--aw1-w1-z1-54c64-n2mjz -oyaml | grep -C2 -i error
phase: Terminating
lastOperation:
description: 'Error occurred with decoding machine error status while getting
VM status, aborting without retry. machine code: machine codes error: code =
[Uninitialized] message = [VM "i-020497d720e898696" associated with machine
"shoot--i034796--aw1-w1-z1-54c64-n2mjz" has SourceDestCheck=true despite providerSpec.SrcAndDstChecksEnabled=false]
k get mc
NAME STATUS AGE NODE
shoot--i034796--aw1-w1-z1-54c64-84gm8 Running 32m ip-10-180-1-180.eu-west-1.compute.internal
aws ec2 describe-instances --query "Reservations[*].Instances[*].{Name:Tags[?Key=='Name']|[0].Value,InstanceId:InstanceId}" --output text | grep aw1-w1-z1-54c64-84gm8
i-01dfe71e6bea2d80a shoot--i034796--aw1-w1-z1-54c64-84gm8
Enable Source/Dest check.
aws ec2 describe-instances --instance-ids i-01dfe71e6bea2d80a >| /tmp/i1.json
aws ec2 modify-instance-attribute --source-dest-check --instance-id i-01dfe71e6bea2d80a
Check describe-instances
differences
aws ec2 describe-instances --instance-ids i-01dfe71e6bea2d80a >| /tmp/i2.json
diff -U0 -u /tmp/i1.json /tmp/i2.json
--- /tmp/i1.json 2024-07-12 16:00:22
+++ /tmp/i2.json 2024-07-12 16:01:02
@@ -82 +82 @@
k label mc shoot--i034796--aw1-w1-z1-54c64-84gm8 force-deletion=true
machine.machine.sapcloud.io/shoot--i034796--aw1-w1-z1-54c64-84gm8 labeled
k delete mc shoot--i034796--aw1-w1-z1-54c64-84gm8
machine.machine.sapcloud.io "shoot--i034796--aw1-w1-z1-54c64-84gm8" deleted
5. Machine goes to `Terminating` phase, `Uninitialized` is ignored, goes to node drain.
k get mc shoot--i034796--aw1-w1-z1-54c64-84gm8
NAME STATUS AGE NODE
shoot--i034796--aw1-w1-z1-54c64-84gm8 Terminating 38m ip-10-180-1-180.eu-west-1.compute.internal
k get mc shoot--i034796--aw1-w1-z1-54c64-84gm8 -oyaml | grep -C1 description git:fix-del-flow lastOperation: description: VM instance was not initalized. Moving forward to node drain. Initiate node drain
6. SUCCESS: VM is deleted and Machine obj disappears
k get mc shoot--i034796--aw1-w1-z1-54c64-84gm8 git:fix-del-flow Error from server (NotFound): machines.machine.sapcloud.io "shoot--i034796--aw1-w1-z1-54c64-84gm8" not found
Failed
flow is unaffected by fix)Node
Status k get no
NAME STATUS ROLES AGE VERSION
ip-10-180-14-52.eu-west-1.compute.internal Ready <none> 29m v1.29.4
gardener-node-agent
and kubelet
services after SSH'ing into node. (use a privileged pod spec for this)kubectl exec -it priv-pod -- chroot /host /bin/bash
root@ip-10-180-14-52:/# systemctl disable gardener-node-agent.service
Removed "/etc/systemd/system/multi-user.target.wants/gardener-node-agent.service".
root@ip-10-180-14-52:/# systemctl disable kubelet
systemctl stop gardener-node-agent
systemctl stop kubelet
NotReady
status, Machine goes into Unknown
Phase
k get no
NAME STATUS ROLES AGE VERSION
ip-10-180-14-52.eu-west-1.compute.internal NotReady <none> 33m v1.29.4
kubectl exec -it priv-pod -- chroot /host /bin/bash
k get mc
NAME STATUS AGE NODE
shoot--i034796--aw1-w1-z1-54c64-dp49r Failed 45m ip-10-180-14-52.eu-west-1.compute.internal
4. `Machine` goes to `Failed` state after `machine-health-timeout` and then goes into `Terminating` and then disappears
k get mc
NAME STATUS AGE NODE
shoot--i034796--aw1-w1-z1-54c64-dp49r Failed 47m ip-10-180-14-52.eu-west-1.compute.internal
k get mc shoot--i034796--aw1-w1-z1-54c64-dp49r
Error from server (NotFound): machines.machine.sapcloud.io "shoot--i034796--aw1-w1-z1-54c64-dp49r" not found
@rishabh-11 Tests complete. Please merge and release whenever ready.
What this PR does / why we need it: This PR fixes the
triggerDeletionFlow
, specifically thegetVMStatus
function to allow deletion of Unitialized VMs to proceedWhich issue(s) this PR fixes: Fixes #926
Special notes for your reviewer:
Release note: