Mirantis / hmc

Apache License 2.0
10 stars 11 forks source link

Cluster can stuck in Deleting state indefinitely #151

Open DinaBelova opened 1 month ago

DinaBelova commented 1 month ago

An intermittent issue which most probably connected to CAPI provider.

At some point Cluster as well as Machines stuck in Deleting state, even though the actual infrastructure in AWS was cleared.

@Kshatrix noticed that when it happens AWSCluster object is absent, even though Machines and AWSMachines are present.

AWS provider tries to patch AWSCluster and then marks it as Not ready

I0726 16:19:31.225463       1 awscluster_controller.go:208] "Reconciling AWSCluster delete" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="defau
lt/aws-cl-1" namespace="default" name="aws-cl-1" reconcileID="da967d9f-4c3d-47a1-953a-cedf44e4d8d0" cluster="default/aws-cl-1"
I0726 16:19:33.955431       1 securitygroups.go:320] "Deleted security group" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="default/aws-cl-1" n
amespace="default" name="aws-cl-1" reconcileID="da967d9f-4c3d-47a1-953a-cedf44e4d8d0" cluster="default/aws-cl-1" security-group-id="sg-068b633aae83d2e19" kind="cluster managed"
I0726 16:19:34.432437       1 securitygroups.go:320] "Deleted security group" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="default/aws-cl-1" n
amespace="default" name="aws-cl-1" reconcileID="da967d9f-4c3d-47a1-953a-cedf44e4d8d0" cluster="default/aws-cl-1" security-group-id="sg-05fe37ab8f0a3ab15" kind="cluster managed"
I0726 16:19:36.516438       1 vpc.go:550] "Deleted VPC" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="default/aws-cl-1" namespace="default" nam
e="aws-cl-1" reconcileID="da967d9f-4c3d-47a1-953a-cedf44e4d8d0" cluster="default/aws-cl-1" vpc-id="vpc-03b7241ad6eae9ab1"
E0726 16:19:36.632931       1 controller.go:329] "Reconciler error" err="failed to patch AWSCluster default/aws-cl-1: awsclusters.infrastructure.cluster.x-k8s.io \"aws-cl-1\" not found" controller="awscluster" c
ontrollerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="default/aws-cl-1" namespace="default" name="aws-cl-1" reconcileID="da967d9f-4c3d-47a1-953a-cedf44e4d8d0"
I0726 16:19:51.603067       1 awsmachine_controller.go:198] "AWSCluster or AWSManagedControlPlane is not ready yet" controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMa
chine" AWSMachine="default/aws-cl-1-md-xczh6-bfjvv" namespace="default" name="aws-cl-1-md-xczh6-bfjvv" reconcileID="c0acf9c4-8be9-413f-a906-483b59563d9f" machine="default/aws-cl-1-md-xczh6-bfjvv" cluster="defaul
t/aws-cl-1"
I0726 16:19:52.434829       1 awsmachine_controller.go:198] "AWSCluster or AWSManagedControlPlane is not ready yet" controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMa
chine" AWSMachine="default/aws-cl-1-cp-0" namespace="default" name="aws-cl-1-cp-0" reconcileID="dac57c28-e165-472b-b20f-fa0521e4b2f1" machine="default/aws-cl-1-cp-0" cluster="default/aws-cl-1"
I0726 16:19:59.970099       1 awsmachine_controller.go:198] "AWSCluster or AWSManagedControlPlane is not ready yet" controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMa
chine" AWSMachine="default/aws-cl-1-cp-0" namespace="default" name="aws-cl-1-cp-0" reconcileID="df4dc07d-5e1b-4d28-88a3-0f30fa7a76f8" machine="default/aws-cl-1-cp-0" cluster="default/aws-cl-1"
I0726 16:19:59.970270       1 awsmachine_controller.go:198] "AWSCluster or AWSManagedControlPlane is not ready yet" controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMa
chine" AWSMachine="default/aws-cl-1-md-xczh6-bfjvv" namespace="default" name="aws-cl-1-md-xczh6-bfjvv" reconcileID="120b3a56-1855-4d44-8333-8502e8d04981" machine="default/aws-cl-1-md-xczh6-bfjvv" cluster="defaul
t/aws-cl-1"
I0726 16:22:36.109923       1 awsmachine_controller.go:198] "AWSCluster or AWSManagedControlPlane is not ready yet" controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMa
chine" AWSMachine="default/aws-cl-1-cp-0" namespace="default" name="aws-cl-1-cp-0" reconcileID="4c0f6242-2f77-4194-b9dc-c1aed2034184" machine="default/aws-cl-1-cp-0" cluster="default/aws-cl-1"
I0726 16:22:36.110149       1 awsmachine_controller.go:198] "AWSCluster or AWSManagedControlPlane is not ready yet" controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMa
chine" AWSMachine="default/aws-cl-1-md-xczh6-bfjvv" namespace="default" name="aws-cl-1-md-xczh6-bfjvv" reconcileID="90ca119f-d6c2-457f-9d79-be35c1ae70a8" machine="default/aws-cl-1-md-xczh6-bfjvv" cluster="defaul
t/aws-cl-1"
I0726 16:29:30.719506       1 awsmachine_controller.go:198] "AWSCluster or AWSManagedControlPlane is not ready yet" controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMa
chine" AWSMachine="default/aws-cl-1-cp-0" namespace="default" name="aws-cl-1-cp-0" reconcileID="d9d8a14a-8e04-43ab-b596-b3bc9083af81" machine="default/aws-cl-1-cp-0" cluster="default/aws-cl-1"
I0726 16:29:30.719543       1 awsmachine_controller.go:198] "AWSCluster or AWSManagedControlPlane is not ready yet" controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMa
chine" AWSMachine="default/aws-cl-1-md-xczh6-bfjvv" namespace="default" name="aws-cl-1-md-xczh6-bfjvv" reconcileID="7bba88c3-0555-4536-bb12-d366df08e338" machine="default/aws-cl-1-md-xczh6-bfjvv" cluster="defaul
t/aws-cl-1"
I0726 16:32:52.252957       1 awsmachine_controller.go:198] "AWSCluster or AWSManagedControlPlane is not ready yet" controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMa
chine" AWSMachine="default/aws-cl-1-md-xczh6-bfjvv" namespace="default" name="aws-cl-1-md-xczh6-bfjvv" reconcileID="2ba19b22-ddb1-4323-aa07-4bd170f5e49b" machine="default/aws-cl-1-md-xczh6-bfjvv" cluster="defaul
t/aws-cl-1"
I0726 16:32:52.253183       1 awsmachine_controller.go:198] "AWSCluster or AWSManagedControlPlane is not ready yet" controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMa
chine" AWSMachine="default/aws-cl-1-cp-0" namespace="default" name="aws-cl-1-cp-0" reconcileID="36079ee5-a4ec-407a-a181-1d7a3ca1058f" machine="default/aws-cl-1-cp-0" cluster="default/aws-cl-1"

After that process is pretty much stuck.

We should keep this issue in mind.

Restart of controller not helping.

a13x5 commented 2 weeks ago

Created upstream issue kubernetes-sigs/cluster-api-provider-aws#5107 @Kshatrix FYI

slysunkin commented 1 week ago

Our fix for #217 should provide a workaround for that (but it is a temporary solution).