aws / eks-anywhere

Run Amazon EKS on your own infrastructure 🚀
https://anywhere.eks.amazonaws.com
Apache License 2.0
1.97k stars 289 forks source link

CloudStack clusterctl move occasionally fails to bring over cloudstackmachinetemplate for machinedeployments #2183

Open maxdrib opened 2 years ago

maxdrib commented 2 years ago

What happened: In the e2e tests, we run clusterctl move soon after a cluster is first created. It seems like the cloudstackmachinetemplates occasionally do not get inherited by the CAPI cluster in time so the move does not bring them over to the destination cluster and so the move ultimately fails with error messages like

2022-05-16T16:53:25.110-0400    V3  Waiting for workload cluster machine deployment replicas to be ready after move
2022-05-16T16:53:25.110-0400    V6  waiting for nodes   {"cluster": "eksa-drib-a9ee116"}
2022-05-16T16:53:25.110-0400    V6  Executing command   {"cmd": "/usr/local/bin/docker exec -i eksa_1652734226607009000 kubectl get machinedeployments.cluster.x-k8s.io -o json --kubeconfig eksa-drib-a9ee116/generated/eksa-drib-a9ee116.kind.kubeconfig --namespace eksa-system"}
2022-05-16T16:53:25.777-0400    V6  waiting for nodes   {"cluster": "eksa-drib-a9ee116"}
2022-05-16T16:53:25.777-0400    V6  Executing command   {"cmd": "/usr/local/bin/docker exec -i eksa_1652734226607009000 kubectl get machinedeployments.cluster.x-k8s.io -o json --kubeconfig eksa-drib-a9ee116/generated/eksa-drib-a9ee116.kind.kubeconfig --namespace eksa-system"}
2022-05-16T16:53:26.304-0400    V5  Error happened during retry {"error": "machine deployment is in  phase", "retries": 1}
2022-05-16T16:53:26.304-0400    V5  Sleeping before next retry  {"time": "0s"}
...
2022-05-16T15:21:11.540-0400    V6  Executing command   {"cmd": "/usr/local/bin/docker exec -i eksa_1652726909294702000 kubectl get machinedeployments.cluster.x-k8s.io -o json --kubeconfig eksa-drib-2ffeb29/generated/eksa-drib-2ffeb29.kind.kubeconfig --namespace eksa-system"}
2022-05-16T15:21:12.480-0400    V5  Error happened during retry {"error": "machine deployment is in  phase", "retries": 2292}
2022-05-16T15:21:12.481-0400    V5  Sleeping before next retry  {"time": "0s"}
2022-05-16T15:21:12.481-0400    V5  Timeout reached. Returning error    {"retries": 2292, "duration": "30m0.809020736s", "error": "machine deployment is in  phase"}
2022-05-16T15:21:12.482-0400    V4  Task finished   {"task_name": "cluster-management-move", "duration": "30m37.551740161s"}
...
Error: failed to delete cluster: waiting for workload cluster machinedeployment replicas to be ready: retries exhausted waiting for machinedeployment replicas to be ready: machine deployment is in  phase

What you expected to happen: I expected the move to succeed and bring the cloudstackmachinetemplate for the machinedeployment over

How to reproduce it (as minimally and precisely as possible): This is a nondeterministic bug. It appears on any e2e test where we first create a cluster, and then proceed to either upgrade or delete it.

Anything else we need to know?: I discovered a workaround is to manually move the cloudstackmachinetemplate, and then force the associated capi machinedeployment to reconcile by editing some field in its spec

Environment:

maxdrib commented 2 years ago

We have observed the ownerRef on the md cloudstackmachinetemplate be present while the cluster is coming up, but then some process seems to cause the ownerRef to disappear. We have observed it come back eventually. This is likely either a CAPC issue or an eks-a controller issue.

maxdrib commented 2 years ago

I am seeing pretty consistent failures in the stacked etcd upgrade flow (TestCloudStackKubernetes120RedhatTo121Upgrade) related to move. When transferring the worker node machinedeployments from the workload cluster to the bootstrap cluster, they never get a status. Eks-a is waiting for them to become ready but they never provide a ready status. I also see errors in the capi controller saying

E0512 13:30:22.642854       1 machinedeployment_controller.go:158] controller/machinedeployment "msg"="Failed to 
reconcile MachineDeployment" "error"="failed to retrieve CloudStackMachineTemplate external object \"eksa-
system\"/\"eksa-drib-2ffeb29-md-0-1652357005770\": cloudstackmachinetemplates.infrastructure.cluster.x-k8s.io 
\"eksa-drib-2ffeb29-md-0-1652357005770\" not found" "name"="eksa-drib-2ffeb29-md-0" "namespace"="eksa-
system" "reconciler group"="cluster.x-k8s.io" "reconciler kind"="MachineDeployment"

so this may be a bigger issue. It seems the worker node cloudstackmachinetemplate didn’t get moved again, and it’s missing an ownerref

maxdrib commented 2 years ago

Timeline of observed failure

2:52 pm - bootstrap cluster creation starts 2:54 pm - bootstrap cluster created, initializing workload cluster 2:57 pm - run clusterctl init on workload cluster to install capi components, and wait for capi deployments to be Available, including capc 2:57:55 pm - run clusterctl move to move capi components/cluster from bootstrap to workload cluster 2:58:14 pm - eks-a components are installed 2:58:37 pm - eks-a manifest is applied to cluster 2:58:42 pm - bootstrap cluster is deleted, cluster creation is considered successful 2:58:46 pm - eks-a upgrade starts running preflight cmk checks 2:59:12 pm - ownerRef disappears on the md-0 cloudstackmachinetemplate on the workload cluster 2:59:13 pm - ek-a upgrade starts spinning up bootstrap cluster 2:59:47 pm - bootstrap cluster is up, run clusterctl init on it for upgrade

maxdrib commented 2 years ago

Looking at the creation timestamp of the rogue post-move cloudstackmachinetemplate (from the associated support bundle support-bundle-2022-05-12T15_38_22.zip), it was created at 3:02 pm. It seems that for some reason the original cloudstackmachinetemplate that was created is being removed, presumably by garbage collection. This would occur if the ownerRef is presen and then removed from the cloudstackmachinetemplate for whatever reason. So now the question might be why would the ownerRef get removed?

maxdrib commented 2 years ago

I was able to reproduce this issue while logging the cloudstackmachinetemplate's metadata field before and after ownerRef disappeared. Logs attached. TLDR we see the cloudstackmachinetemplate is created and ownerRefs is immediately present. At some point, ownerRefs is removed. The resource's uid and creationTimestamp and name are unchanged, so something actually removes the ownerRef attributes, but it's unclear what that is. ResourceVersion is updated on the object at that time, but nothing else seems to change.

ownerRefs.log

After about 18 minutes, ownerRefs are re-added to the cloudstackmachinetemplates, so subsequent move operations should succeed

From conversation with @jiayiwang7, it sounds like a good next step would be to try disabling the EKS-A controller to see if the issue still occurs. That’s done by skipping the InstallEksaComponentsTask in create.go