gardener / machine-controller-manager-provider-aws

Gardener machine controller manager provider for AWS
Apache License 2.0
9 stars 35 forks source link

Race occurring between MCM, OOT provider and Orphan Safety Controller #127

Closed mattburgess closed 4 months ago

mattburgess commented 1 year ago

What happened:

Having upgraded to mcm-0.48.3 and mcm-provider-aws-0.17.0 we're seeing odd behaviour during a rolling update of a machinedeployment. Specifically:

  1. A VM was adopted with no prior logs about it occuring during the rollout
  2. A VM that was provisioned during the rolling update was seen as being an orphan and therefore deleted; the machine/VM was then stuck in a perpetually 'Pending' state

What you expected to happen:

All VMs created during a rolling update reach Running. No VMs created during a rolling update are classed as Orphans.

How to reproduce it (as minimally and precisely as possible):

  1. Create an AWSMachineClass
  2. Create a MachineDeployment referencing the AWSMachineclass
  3. (note that MCM will migrate AWSMachineClass to MachineClass)
  4. Scale the MachineDeployment up to 2 or 3 machines
  5. Update the AWSMachineClass & MachineDeployment so that a rollingUpdate is triggered

Anything else we need to know:

We're trying to carry out a migration from AWSMachineClasses to MachineClasses but need to be in a position where we have a single deployment of MCM that can handle deployments of AWSMachineClasses until we can update the deployments to roll out MachineClasses instead.

Our pod config looks like this (taken from deployment.yaml based on a complete lack of documentation on either the MCM or mcm-provider-aws side of how to carry out this migration):

containers:
  - command:
      - ./machine-controller-manager
      - --safety-up=2
      - --safety-down=1
      - --machine-safety-overshooting-period=1m
      - --namespace=$machineDeploymentNamespace
      - --leader-elect=true
      - --v=2
    image: machine-controller-manager:0.48.3
  - command:
      - ./machine-controller
      - --control-kubeconfig=inClusterConfig
      - --namespace=$machineDeploymentNamespace
      - --machine-creation-timeout=20m
      - --machine-drain-timeout=10m
      - --machine-health-timeout=$nodeHealthTimeout
      - --machine-safety-orphan-vms-period=30m
      - --node-conditions=ReadonlyFilesystem,KernelDeadlock,DiskPressure,NetworkUnavailable,NodeInitializing
      - --v=3
    image: machine-controller-manager-provider-aws:v0.17.0-0.2.0

And finally, here's a breakdown of the logs that we saw:

After the initial MachineDeployment scale up we see 2 healthy running nodes:

NAME                                                  STATUS    AGE     NODE                                          PROVIDERID
mcm-immutable-node-az-a-integrated-test-59b49-snjpg   Running   4m55s   ip-10-50-192-106.eu-west-1.compute.internal   aws:///eu-west-1/i-00f2b6e3691412937
mcm-immutable-node-az-b-integrated-test-6b8bc-kprnn   Running   2m47s   ip-10-50-199-120.eu-west-1.compute.internal   aws:///eu-west-1/i-0d0f1b84e879cd812

After the rollout of the updated AWSMachineClass/MachineDeployment we see the following:

mcm-immutable-node-az-a-integrated-test-5db48-g95b4   Running   5m56s   ip-10-50-194-38.eu-west-1.compute.internal    aws:///eu-west-1/i-0d407a47ea28e7e6c
mcm-immutable-node-az-b-integrated-test-5bc67-vj4br   Pending   3m42s   ip-10-50-196-240.eu-west-1.compute.internal   aws:///eu-west-1/i-0ee191ba68be62801

Note that AZ A rolled out fine, but AZ B's node is still pending:

The start of the logs for the AZ B rollout all look good; MCM's seen the incoming AWSMachineClass and span up a new machine:

immutable-machine-controller-manager-8677546c59-b74qz machine-controller-manager I0627 13:26:23.907365       1 deployment_util.go:522] Observed a change in classKind of Machine Deployment mcm-immutable-node-az-b-integrated-test. Changing classKind from MachineClass to AWSMachineClass.
immutable-machine-controller-manager-8677546c59-b74qz machine-controller-manager I0627 13:26:24.264514       1 deployment_util.go:522] Observed a change in classKind of Machine Deployment mcm-immutable-node-az-b-integrated-test. Changing classKind from MachineClass to AWSMachineClass.
immutable-machine-controller-manager-8677546c59-b74qz machine-controller-manager I0627 13:26:25.960746       1 event.go:282] Event(v1.ObjectReference{Kind:"MachineDeployment", Namespace:"machine-controller-manager-int", Name:"mcm-immutable-node-az-b-integrated-test", UID:"d3210af5-da77-420c-85e1-693364965216", APIVersion:"machine.sapcloud.io/v1alpha1", ResourceVersion:"3950408426", FieldPath:""}): type: 'Normal' reason: 'ScalingMachineSet' Scaled up machine set mcm-immutable-node-az-b-integrated-test-5bc67 to 1
immutable-machine-controller-manager-8677546c59-b74qz machine-controller-manager I0627 13:26:26.207697       1 machineset.go:371] Too few replicas for MachineSet mcm-immutable-node-az-b-integrated-test-5bc67, need 1, creating 1
immutable-machine-controller-manager-8677546c59-b74qz machine-controller-manager I0627 13:26:26.410655       1 controller_utils.go:601] Controller mcm-immutable-node-az-b-integrated-test-5bc67 created machine mcm-immutable-node-az-b-integrated-test-5bc67-vj4br
immutable-machine-controller-manager-8677546c59-b74qz machine-controller-manager I0627 13:26:26.410725       1 event.go:282] Event(v1.ObjectReference{Kind:"MachineSet", Namespace:"machine-controller-manager-int", Name:"mcm-immutable-node-az-b-integrated-test-5bc67", UID:"b8651201-e547-4217-8d1e-d62916b428e9", APIVersion:"machine.sapcloud.io/v1alpha1", ResourceVersion:"3950408472", FieldPath:""}): type: 'Normal' reason: 'SuccessfulCreate' Created Machine: mcm-immutable-node-az-b-integrated-test-5bc67-vj4br

Next we see MCM needing to migrate from AWSMachineClass to MachineClass:

immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:26.414931       1 core.go:532] Migrate request has been recieved for mcm-immutable-node-az-b-integrated-test
immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:26.414981       1 core.go:543] Migrate request has been processed for mcm-immutable-node-az-b-integrated-test
immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:26.414987       1 migrate_machineclass.go:147] Generated generic machineClass for class AWSMachineClass/mcm-immutable-node-az-b-integrated-test
immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:26.422698       1 migrate_machineclass.go:155] Create/Apply successful for MachineClass mcm-immutable-node-az-b-integrated-test
immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:26.428448       1 migrate_machineclass.go:177] Updated class reference for machine machine-controller-manager-int/mcm-immutable-node-az-b-integrated-test-5bc67-vj4br
immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:26.461724       1 migrate_machineclass.go:196] Updated class reference for machineset machine-controller-manager-int/mcm-immutable-node-az-b-integrated-test-5bc67
immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:26.484119       1 migrate_machineclass.go:215] Updated class reference for machinedeployment machine-controller-manager-int/mcm-immutable-node-az-b-integrated-test
immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:26.493563       1 migrate_machineclass.go:384] Set migrated annotation for ProviderSpecificMachineClass AWSMachineClass/mcm-immutable-node-az-b-integrated-test
immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:26.493578       1 migrate_machineclass.go:426] Migration successful for class AWSMachineClass/mcm-immutable-node-az-b-integrated-test

The following also looks good; it's now determined that we have too many nodes in AZ B because we have the new vj4br and old kprnn node; it chooses to delete the latter:

immutable-machine-controller-manager-8677546c59-b74qz machine-controller E0627 13:26:26.497013       1 machine_util.go:786] Failed to add finalizers for machine "mcm-immutable-node-az-b-integrated-test-5bc67-vj4br": Operation cannot be fulfilled on machines.machine.sapcloud.io "mcm-immutable-node-az-b-integrated-test-5bc67-vj4br": the object has been modified; please apply your changes to the latest version and try again
immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:26.501806       1 machine_util.go:789] Added finalizer to machine "mcm-immutable-node-az-b-integrated-test-5bc67-vj4br" with providerID "" and backing node ""
immutable-machine-controller-manager-8677546c59-b74qz machine-controller W0627 13:26:26.507805       1 machine_bootstrap_token.go:76] no bootstrap token placeholder found in user-data, nothing to replace! Without bootstrap token , node won't join.
immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:26.507823       1 core.go:359] Get request has been recieved for "mcm-immutable-node-az-b-integrated-test-5bc67-vj4br"
immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:26.574764       1 machine.go:354] Creating a VM for machine "mcm-immutable-node-az-b-integrated-test-5bc67-vj4br", please wait!
immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:26.574784       1 machine.go:355] The machine creation is triggered with timeout of 20m0s
immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:26.574791       1 core.go:80] Machine creation request has been recieved for "mcm-immutable-node-az-b-integrated-test-5bc67-vj4br"
immutable-machine-controller-manager-8677546c59-b74qz machine-controller-manager I0627 13:26:26.712394       1 event.go:282] Event(v1.ObjectReference{Kind:"MachineDeployment", Namespace:"machine-controller-manager-int", Name:"mcm-immutable-node-az-b-integrated-test", UID:"d3210af5-da77-420c-85e1-693364965216", APIVersion:"machine.sapcloud.io/v1alpha1", ResourceVersion:"3950408462", FieldPath:""}): type: 'Normal' reason: 'ScalingMachineSet' Scaled down machine set mcm-immutable-node-az-b-integrated-test-6b8bc to 0
immutable-machine-controller-manager-8677546c59-b74qz machine-controller-manager I0627 13:26:26.958977       1 machine.go:409] Creating machine "mcm-immutable-node-az-b-integrated-test-5bc67-vj4br", please wait!
immutable-machine-controller-manager-8677546c59-b74qz machine-controller-manager I0627 13:26:27.057470       1 machineset.go:419] Too many replicas for  machine-controller-manager-int/mcm-immutable-node-az-b-integrated-test-6b8bc, need 0, deleting 1
immutable-machine-controller-manager-8677546c59-b74qz machine-controller-manager I0627 13:26:27.057523       1 controller_utils.go:619] Controller mcm-immutable-node-az-b-integrated-test-6b8bc deleting machine mcm-immutable-node-az-b-integrated-test-6b8bc-kprnn
immutable-machine-controller-manager-8677546c59-b74qz machine-controller-manager I0627 13:26:27.310853       1 event.go:282] Event(v1.ObjectReference{Kind:"MachineSet", Namespace:"machine-controller-manager-int", Name:"mcm-immutable-node-az-b-integrated-test-6b8bc", UID:"5c2cfee9-ee6c-45c7-a1b8-f76b6427894b", APIVersion:"machine.sapcloud.io/v1alpha1", ResourceVersion:"3950408499", FieldPath:""}): type: 'Normal' reason: 'SuccessfulDelete' Deleted machine: mcm-immutable-node-az-b-integrated-test-6b8bc-kprnn
immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:27.319871       1 machine_util.go:871] Machine "mcm-immutable-node-az-b-integrated-test-6b8bc-kprnn" status updated to terminating
immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:27.319921       1 core.go:359] Get request has been recieved for "mcm-immutable-node-az-b-integrated-test-6b8bc-kprnn"
immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:27.456037       1 core.go:401] Machine get request has been processed successfully for "mcm-immutable-node-az-b-integrated-test-6b8bc-kprnn"
immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:27.475377       1 machine_util.go:538] Machine/status UPDATE for "mcm-immutable-node-az-b-integrated-test-6b8bc-kprnn"
immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:27.475748       1 machine_util.go:1047] Normal delete/drain has been triggerred for machine "mcm-immutable-node-az-b-integrated-test-6b8bc-kprnn" with providerID "aws:///eu-west-1/i-0d0f1b84e879cd812" and backing node "ip-10-50-199-120.eu-west-1.compute.internal" with drain-timeout:10m0s & maxEvictRetries:30
immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:27.565012       1 drain.go:200] Machine drain ended on 2023-06-27 13:26:27.565007172 +0000 UTC m=+3579.036882365 and took 78.674388ms for "ip-10-50-199-120.eu-west-1.compute.internal"
immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:27.565042       1 machine_util.go:1101] Drain successful for machine "mcm-immutable-node-az-b-integrated-test-6b8bc-kprnn" ,providerID "aws:///eu-west-1/i-0d0f1b84e879cd812", backing node "ip-10-50-199-120.eu-west-1.compute.internal".
immutable-machine-controller-manager-8677546c59-b74qz machine-controller Buf:
immutable-machine-controller-manager-8677546c59-b74qz machine-controller ErrBuf:WARNING: Ignoring DaemonSet-managed pods: cadvisor-sx7t4, crowdstrike-falcon-wcmtc, aws-node-xsvd4, no-iam-kgwgr
immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:27.572123       1 machine_util.go:538] Machine/status UPDATE for "mcm-immutable-node-az-b-integrated-test-6b8bc-kprnn"
immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:27.572186       1 machine_util.go:1047] Normal delete/drain has been triggerred for machine "mcm-immutable-node-az-b-integrated-test-6b8bc-kprnn" with providerID "aws:///eu-west-1/i-0d0f1b84e879cd812" and backing node "ip-10-50-199-120.eu-west-1.compute.internal" with drain-timeout:10m0s & maxEvictRetries:30
immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:27.582677       1 drain.go:1146] Scheduling state for node "ip-10-50-199-120.eu-west-1.compute.internal" is already in desired state
immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:27.648075       1 drain.go:200] Machine drain ended on 2023-06-27 13:26:27.648069731 +0000 UTC m=+3579.119944910 and took 66.976743ms for "ip-10-50-199-120.eu-west-1.compute.internal"
immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:27.648109       1 machine_util.go:1101] Drain successful for machine "mcm-immutable-node-az-b-integrated-test-6b8bc-kprnn" ,providerID "aws:///eu-west-1/i-0d0f1b84e879cd812", backing node "ip-10-50-199-120.eu-west-1.compute.internal".
immutable-machine-controller-manager-8677546c59-b74qz machine-controller Buf:
immutable-machine-controller-manager-8677546c59-b74qz machine-controller ErrBuf:WARNING: Ignoring DaemonSet-managed pods: cadvisor-sx7t4, crowdstrike-falcon-wcmtc, aws-node-xsvd4, no-iam-kgwgr
immutable-machine-controller-manager-8677546c59-b74qz machine-controller W0627 13:26:27.652387       1 machine_util.go:536] Machine/status UPDATE failed for machine "mcm-immutable-node-az-b-integrated-test-6b8bc-kprnn". Retrying, error: Operation cannot be fulfilled on machines.machine.sapcloud.io "mcm-immutable-node-az-b-integrated-test-6b8bc-kprnn": the object has been modified; please apply your changes to the latest version and try again
immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:27.652462       1 core.go:290] Machine deletion request has been recieved for "mcm-immutable-node-az-b-integrated-test-6b8bc-kprnn"
immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:27.816973       1 core.go:316] VM "aws:///eu-west-1/i-0d0f1b84e879cd812" for Machine "mcm-immutable-node-az-b-integrated-test-6b8bc-kprnn" was terminated succesfully
immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:27.816999       1 core.go:342] Machine deletion request has been processed for "mcm-immutable-node-az-b-integrated-test-6b8bc-kprnn"
immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:27.823703       1 machine_util.go:538] Machine/status UPDATE for "mcm-immutable-node-az-b-integrated-test-6b8bc-kprnn"
immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:27.825093       1 core.go:233] Waiting for VM with Provider-ID "aws:///eu-west-1/i-0c50d92b176e99c0d" to be visible to all AWS endpoints
immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:27.838000       1 machine_util.go:538] Machine/status UPDATE for "mcm-immutable-node-az-b-integrated-test-6b8bc-kprnn"
immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:27.845968       1 machine_util.go:812] Removed finalizer to machine "mcm-immutable-node-az-b-integrated-test-6b8bc-kprnn" with providerID "aws:///eu-west-1/i-0d0f1b84e879cd812" and backing node "ip-10-50-199-120.eu-west-1.compute.internal"
immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:27.845988       1 machine.go:625] Machine "mcm-immutable-node-az-b-integrated-test-6b8bc-kprnn" with providerID "aws:///eu-west-1/i-0d0f1b84e879cd812" and nodeName "ip-10-50-199-120.eu-west-1.compute.internal" deleted successfully

Next we see MCM make the API call to get a new AWS instance created for the vj4br machine at the same time as the Orphan VM reconciler kicks in:

immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:27.846072       1 machine_safety.go:55] reconcileClusterMachineSafetyOrphanVMs: Start
immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:27.846114       1 core.go:419] List machines request has been recieved for "mcm-immutable-m6i-4xlarge-az-a-integrated-test"
immutable-machine-controller-manager-8677546c59-b74qz machine-controller-manager E0627 13:26:27.857051       1 machineset.go:686] failed to update machine status for machine mcm-immutable-node-az-b-integrated-test-6b8bc-kprnn: machines.machine.sapcloud.io "mcm-immutable-node-az-b-integrated-test-6b8bc-kprnn" not found
immutable-machine-controller-manager-8677546c59-b74qz machine-controller-manager I0627 13:26:27.857068       1 machineset.go:688] Delete machine from machineset "mcm-immutable-node-az-b-integrated-test-6b8bc-kprnn"
immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:27.905818       1 core.go:253] VM with Provider-ID: "aws:///eu-west-1/i-0c50d92b176e99c0d" created for Machine: "mcm-immutable-node-az-b-integrated-test-5bc67-vj4br"
immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:27.905844       1 machine.go:367] Created new VM for machine: "mcm-immutable-node-az-b-integrated-test-5bc67-vj4br" with ProviderID: "aws:///eu-west-1/i-0c50d92b176e99c0d" and backing node: ""
immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:27.912682       1 machine.go:492] Machine labels/annotations UPDATE for "mcm-immutable-node-az-b-integrated-test-5bc67-vj4br"
immutable-machine-controller-manager-8677546c59-b74qz machine-controller W0627 13:26:27.914514       1 machine_bootstrap_token.go:76] no bootstrap token placeholder found in user-data, nothing to replace! Without bootstrap token , node won't join.
immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:27.914530       1 core.go:359] Get request has been recieved for "mcm-immutable-node-az-b-integrated-test-5bc67-vj4br"
immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:27.971307       1 core.go:401] Machine get request has been processed successfully for "mcm-immutable-node-az-b-integrated-test-5bc67-vj4br"
immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:27.976595       1 machine.go:518] Machine/status UPDATE for "mcm-immutable-node-az-b-integrated-test-5bc67-vj4br" during creation

The API call resulted in aws:///eu-west-1/i-0c50d92b176e99c0d being provisioned, but straight afterwards we see MCM adopting aws:///eu-west-1/i-0ee191ba68be62801 then immediately seeing that it's orphaned so terminates it!

immutable-machine-controller-manager-8677546c59-b74qz machine-controller-manager I0627 13:26:28.327003       1 machine.go:487] Created/Adopted machine: "mcm-immutable-node-az-b-integrated-test-5bc67-vj4br", MachineID: aws:///eu-west-1/i-0ee191ba68be62801
immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:28.458570       1 core.go:489] List machines request has been processed successfully
immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:28.558973       1 core.go:290] Machine deletion request has been recieved for "mcm-immutable-node-az-b-integrated-test-5bc67-vj4br"
immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:28.768106       1 core.go:316] VM "aws:///eu-west-1/i-0ee191ba68be62801" for Machine "mcm-immutable-node-az-b-integrated-test-5bc67-vj4br" was terminated succesfully
immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:28.768127       1 core.go:342] Machine deletion request has been processed for "mcm-immutable-node-az-b-integrated-test-5bc67-vj4br"
immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:28.768134       1 machine_safety.go:300] SafetyController: Orphan VM found and terminated VM: mcm-immutable-node-az-b-integrated-test-5bc67-vj4br, aws:///eu-west-1/i-0ee191ba68be62801

Next we see MCM (manager, not the provider) decide that the original instance that was provisioned, aws:///eu-west-1/i-0c50d92b176e99c0d, is also Orphaned so it deletes it as well:

immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:28.768148       1 core.go:419] List machines request has been recieved for "mcm-immutable-node-az-b-integrated-test"
immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:28.941080       1 core.go:489] List machines request has been processed successfully
immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:29.041900       1 core.go:290] Machine deletion request has been recieved for "mcm-immutable-node-az-b-integrated-test-5bc67-vj4br"
immutable-machine-controller-manager-8677546c59-b74qz machine-controller-manager I0627 13:26:29.216360       1 machine_safety.go:659] SafetyController: Orphan VM found and terminated VM: mcm-immutable-node-az-b-integrated-test-5bc67-vj4br, aws:///eu-west-1/i-0c50d92b176e99c0d
immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:29.229228       1 core.go:316] VM "aws:///eu-west-1/i-0c50d92b176e99c0d" for Machine "mcm-immutable-node-az-b-integrated-test-5bc67-vj4br" was terminated succesfully
immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:29.229251       1 core.go:342] Machine deletion request has been processed for "mcm-immutable-node-az-b-integrated-test-5bc67-vj4br"
immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:29.229259       1 machine_safety.go:300] SafetyController: Orphan VM found and terminated VM: mcm-immutable-node-az-b-integrated-test-5bc67-vj4br, aws:///eu-west-1/i-0c50d92b176e99c0d

Environment:

k8s: 1.23 mcm: 0.48.3 mcm-provider-aws: 0.17.0

himanshu-kun commented 1 year ago

[Before rollout] mcm-immutable-node-az-b-integrated-test-6b8bc-kprnn

[updated mcd's reference to AWSMachineClass with some changes again]

[MCM updated new mcs' reference to AWSMachineClass] immutable-machine-controller-manager-8677546c59-b74qz machine-controller-manager I0627 13:26:23.907365 1 deployment_util.go:522] Observed a change in classKind of Machine Deployment mcm-immutable-node-az-b-integrated-test. Changing classKind from MachineClass to AWSMachineClass.

[MC migration logic migrates from AWSMachineClass to MachineClass] immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:26.422698 1 migrate_machineclass.go:155] Create/Apply successful for MachineClass mcm-immutable-node-az-b-integrated-test immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:26.428448 1 migrate_machineclass.go:177] Updated class reference for machine machine-controller-manager-int/mcm-immutable-node-az-b-integrated-test-5bc67-vj4br immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:26.461724 1 migrate_machineclass.go:196] Updated class reference for machineset machine-controller-manager-int/mcm-immutable-node-az-b-integrated-test-5bc67 immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:26.493563 1 migrate_machineclass.go:384] Set migrated annotation for ProviderSpecificMachineClass AWSMachineClass/mcm-immutable-node-az-b-integrated-test

[MC creates VM for new mc vj4br] immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:27.905844 1 machine.go:367] Created new VM for machine: "mcm-immutable-node-az-b-integrated-test-5bc67-vj4br" with ProviderID: "aws:///eu-west-1/i-0c50d92b176e99c0d" and backing node: "" immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:27.912682 1 machine.go:492] Machine labels/annotations UPDATE for "mcm-immutable-node-az-b-integrated-test-5bc67-vj4br"

immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:27.976595 1 machine.go:518] Machine/status UPDATE for "mcm-immutable-node-az-b-integrated-test-5bc67-vj4br" during creation

[MCM creates VM for new mc vj4br] immutable-machine-controller-manager-8677546c59-b74qz machine-controller-manager I0627 13:26:28.327003 1 machine.go:487] Created/Adopted machine: "mcm-immutable-node-az-b-integrated-test-5bc67-vj4br", MachineID: aws:///eu-west-1/i-0ee191ba68be62801

[MC safety controller deletes VM created by MCM] immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:28.768134 1 machine_safety.go:300] SafetyController: Orphan VM found and terminated VM: mcm-immutable-node-az-b-integrated-test-5bc67-vj4br, aws:///eu-west-1/i-0ee191ba68be62801

[MCM safety controller deletes VM created by MC] immutable-machine-controller-manager-8677546c59-b74qz machine-controller I0627 13:26:29.229228 1 core.go:316] VM "aws:///eu-west-1/i-0c50d92b176e99c0d" for Machine "mcm-immutable-node-az-b-integrated-test-5bc67-vj4br" was terminated succesfully

himanshu-kun commented 1 year ago

Ideally MCM should not reconcile the machine obj, because in ideal case mc is updated with generic machineclass and so due to this code no operation will be done by MCM , but from the code of MCM v0.48.3, it seems like orphan collection could still be done by MCM. To deal with that you could use MCM v0.49.0 where orphan collection code has been removed.

Here it seems the following happened: (some assumptions due to missing logs)

MCM has started acting here because the migration logic was not completed by the time MCM had picked up the machine for reconcile (unfortunately the logs aren't available mostly because they are higher V level logs )

himanshu-kun commented 1 year ago

I don't understand the reasoning for not migrating by your above statements @mattburgess :

We're trying to carry out a migration from AWSMachineClasses to MachineClasses but need to be in a position where we have a single deployment of MCM that can handle deployments of AWSMachineClasses until we can update the deployments to roll out MachineClasses instead.

could you explain more. Because we don't support migration fixes now.

rishabh-11 commented 4 months ago

close due to inactivity