Open maxdrib opened 2 years ago
We have observed the ownerRef on the md cloudstackmachinetemplate be present while the cluster is coming up, but then some process seems to cause the ownerRef to disappear. We have observed it come back eventually. This is likely either a CAPC issue or an eks-a controller issue.
I am seeing pretty consistent failures in the stacked etcd upgrade flow (TestCloudStackKubernetes120RedhatTo121Upgrade) related to move. When transferring the worker node machinedeployments from the workload cluster to the bootstrap cluster, they never get a status. Eks-a is waiting for them to become ready but they never provide a ready status. I also see errors in the capi controller saying
E0512 13:30:22.642854 1 machinedeployment_controller.go:158] controller/machinedeployment "msg"="Failed to
reconcile MachineDeployment" "error"="failed to retrieve CloudStackMachineTemplate external object \"eksa-
system\"/\"eksa-drib-2ffeb29-md-0-1652357005770\": cloudstackmachinetemplates.infrastructure.cluster.x-k8s.io
\"eksa-drib-2ffeb29-md-0-1652357005770\" not found" "name"="eksa-drib-2ffeb29-md-0" "namespace"="eksa-
system" "reconciler group"="cluster.x-k8s.io" "reconciler kind"="MachineDeployment"
so this may be a bigger issue. It seems the worker node cloudstackmachinetemplate didn’t get moved again, and it’s missing an ownerref
Timeline of observed failure
2:52 pm - bootstrap cluster creation starts 2:54 pm - bootstrap cluster created, initializing workload cluster 2:57 pm - run clusterctl init on workload cluster to install capi components, and wait for capi deployments to be Available, including capc 2:57:55 pm - run clusterctl move to move capi components/cluster from bootstrap to workload cluster 2:58:14 pm - eks-a components are installed 2:58:37 pm - eks-a manifest is applied to cluster 2:58:42 pm - bootstrap cluster is deleted, cluster creation is considered successful 2:58:46 pm - eks-a upgrade starts running preflight cmk checks 2:59:12 pm - ownerRef disappears on the md-0 cloudstackmachinetemplate on the workload cluster 2:59:13 pm - ek-a upgrade starts spinning up bootstrap cluster 2:59:47 pm - bootstrap cluster is up, run clusterctl init on it for upgrade
Looking at the creation timestamp of the rogue post-move cloudstackmachinetemplate (from the associated support bundle support-bundle-2022-05-12T15_38_22.zip), it was created at 3:02 pm. It seems that for some reason the original cloudstackmachinetemplate that was created is being removed, presumably by garbage collection. This would occur if the ownerRef is presen and then removed from the cloudstackmachinetemplate for whatever reason. So now the question might be why would the ownerRef get removed?
I was able to reproduce this issue while logging the cloudstackmachinetemplate's metadata field before and after ownerRef disappeared. Logs attached. TLDR we see the cloudstackmachinetemplate is created and ownerRefs is immediately present. At some point, ownerRefs is removed. The resource's uid and creationTimestamp and name are unchanged, so something actually removes the ownerRef attributes, but it's unclear what that is. ResourceVersion is updated on the object at that time, but nothing else seems to change.
After about 18 minutes, ownerRefs are re-added to the cloudstackmachinetemplates, so subsequent move operations should succeed
From conversation with @jiayiwang7, it sounds like a good next step would be to try disabling the EKS-A controller to see if the issue still occurs. That’s done by skipping the InstallEksaComponentsTask in create.go
What happened: In the e2e tests, we run clusterctl move soon after a cluster is first created. It seems like the cloudstackmachinetemplates occasionally do not get inherited by the CAPI cluster in time so the move does not bring them over to the destination cluster and so the move ultimately fails with error messages like
What you expected to happen: I expected the move to succeed and bring the cloudstackmachinetemplate for the machinedeployment over
How to reproduce it (as minimally and precisely as possible): This is a nondeterministic bug. It appears on any e2e test where we first create a cluster, and then proceed to either upgrade or delete it.
Anything else we need to know?: I discovered a workaround is to manually move the cloudstackmachinetemplate, and then force the associated capi machinedeployment to reconcile by editing some field in its spec
Environment: