Closed itaytalmi closed 2 years ago
I'm not sure why you would get kubelet v1.21.6 on the BR image, but it should work anyway.
Just to make sure the image is "fresh" I've removed all the older images I had including the vSphere content library and let eksctl anywhere create it from scratch. I suppose the image name should be 1.21.6. But yes, clearly it should work anyway. That's just a discrepancy I've noticed.
I'll investigate what is up with that image. I have the same thing on my little cluster with BR:
% k get nodes
NAME STATUS ROLES AGE VERSION
198.18.40.196 Ready control-plane,master 46h v1.21.6
198.18.81.135 Ready <none> 46h v1.21.6
Any chance the cluster is just very slow? I have not looked at the support bundles yet, but from your post everything looks fine.
Actually this environment has tons of resources in vSphere. I'm also deploying it to an all-flash vSAN datastore and the vSphere port group is backed by a 10Gb network. The time it takes to spin up the cluster actually seems fine. Just gets stuck for about 5 minutes on the Moving cluster management from bootstrap to workload cluster
step.
Also, as I mentioned, it works with 0.6.1.
The exact same spec works with Ubuntu...
$ eksctl anywhere create cluster -f eksa-mgmt-cluster-ubuntu.yaml
Performing setup and validations
✅ Connected to server
✅ Authenticated to vSphere
✅ Datacenter validated
✅ Network validated
✅ Datastore validated
✅ Folder validated
✅ Resource pool validated
✅ Datastore validated
✅ Folder validated
✅ Resource pool validated
✅ Datastore validated
✅ Folder validated
✅ Resource pool validated
✅ Control plane and Workload templates validated
✅ Vsphere Provider setup is valid
✅ Create preflight validations pass
Creating new bootstrap cluster
Installing cluster-api providers on bootstrap cluster
Provider specific setup
Creating new workload cluster
Installing networking on workload cluster
Installing storage class on workload cluster
Installing cluster-api providers on workload cluster
Installing EKS-A secrets on workload cluster
Moving cluster management from bootstrap to workload cluster
Installing EKS-A custom components (CRD and controller) on workload cluster
Creating EKS-A CRDs instances on workload cluster
Installing AddonManager and GitOps Toolkit on workload cluster
GitOps field not specified, bootstrap flux skipped
Writing cluster config file
Deleting bootstrap cluster
🎉 Cluster created!
By the way, unlike the Bottlerocket image, the ubuntu-v1.21.5-kubernetes-1-21-eks-8-amd64
image actually contains 1.12.5...
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
it-eksa-mgmt-426xv Ready control-plane,master 7m14s v1.21.5-eks-1-21
it-eksa-mgmt-67smp Ready control-plane,master 5m55s v1.21.5-eks-1-21
it-eksa-mgmt-md-0-5894445f56-42ttt Ready <none> 6m v1.21.5-eks-1-21
it-eksa-mgmt-md-0-5894445f56-fwjj8 Ready <none> 6m1s v1.21.5-eks-1-21
Okay, sounds like there is a problem with that bottle rocket build, but not entirely sure why it is not working for you, but I was able to create a cluster with it. It shouldn't have 1.21.6 either way though.
Attempting to deploy using Bottlerocket once again. This time the Installing cluster-api providers on workload cluster
step is just stuck and doesn't proceed at all. This is another behavior I mentioned in the original post. It is all very random. Bottlerocket deployments keep failing at different stages...
I can see that during this stage, it doesn't even attempt to install the CAPI providers on the cluster.
$ eksctl anywhere create cluster -f eksa-mgmt-cluster.yaml
Performing setup and validations
✅ Connected to server
✅ Authenticated to vSphere
✅ Datacenter validated
✅ Network validated
✅ Datastore validated
✅ Folder validated
✅ Resource pool validated
✅ Datastore validated
✅ Folder validated
✅ Resource pool validated
✅ Datastore validated
✅ Folder validated
✅ Resource pool validated
✅ Control plane and Workload templates validated
✅ Vsphere Provider setup is valid
✅ Create preflight validations pass
Creating new bootstrap cluster
Installing cluster-api providers on bootstrap cluster
Provider specific setup
Creating new workload cluster
Installing networking on workload cluster
Installing storage class on workload cluster
Installing cluster-api providers on workload cluster
$ kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
cert-manager cert-manager-7988d4fb6c-mrx82 1/1 Running 0 7m4s
cert-manager cert-manager-cainjector-6bc8dcdb64-8cxt5 1/1 Running 0 7m4s
cert-manager cert-manager-webhook-68979bfb95-ptkq4 1/1 Running 0 7m4s
kube-system cilium-2tfhn 1/1 Running 0 7m23s
kube-system cilium-c2l6s 1/1 Running 0 7m23s
kube-system cilium-m42l8 1/1 Running 0 7m23s
kube-system cilium-operator-86d59d5c88-gvq58 1/1 Running 0 7m23s
kube-system cilium-operator-86d59d5c88-qm9sn 1/1 Running 0 7m23s
kube-system cilium-tx5xd 1/1 Running 0 7m23s
kube-system coredns-745c7986c7-9tv4s 1/1 Running 0 9m24s
kube-system coredns-745c7986c7-j66f5 1/1 Running 0 9m24s
kube-system kube-apiserver-10.100.137.115 1/1 Running 0 9m31s
kube-system kube-apiserver-10.100.137.117 1/1 Running 0 7m27s
kube-system kube-controller-manager-10.100.137.115 1/1 Running 0 9m31s
kube-system kube-controller-manager-10.100.137.117 1/1 Running 0 7m27s
kube-system kube-proxy-7s4td 1/1 Running 0 7m51s
kube-system kube-proxy-drtr6 1/1 Running 0 7m33s
kube-system kube-proxy-j4tgp 1/1 Running 0 9m24s
kube-system kube-proxy-jtqws 1/1 Running 0 7m26s
kube-system kube-scheduler-10.100.137.115 1/1 Running 0 9m31s
kube-system kube-scheduler-10.100.137.117 1/1 Running 0 7m27s
kube-system kube-vip-10.100.137.115 1/1 Running 0 9m31s
kube-system kube-vip-10.100.137.117 1/1 Running 0 7m27s
kube-system vsphere-cloud-controller-manager-7p4fc 1/1 Running 4 7m26s
kube-system vsphere-cloud-controller-manager-m8rd5 1/1 Running 1 9m25s
kube-system vsphere-cloud-controller-manager-vqh2r 1/1 Running 2 7m51s
kube-system vsphere-cloud-controller-manager-z28s9 1/1 Running 3 7m33s
kube-system vsphere-csi-controller-576c9c8dc8-6xbpx 5/5 Running 0 9m25s
kube-system vsphere-csi-node-hxfdp 3/3 Running 0 7m51s
kube-system vsphere-csi-node-psxx7 3/3 Running 0 7m33s
kube-system vsphere-csi-node-r4hjc 3/3 Running 0 9m25s
kube-system vsphere-csi-node-ww2fg 3/3 Running 0 7m26s
It also doesn't time out... So I'm not sure how I can generate support bundles at this point.
I believe this is related some race condition. I've noticed that the Installing cluster-api providers on workload cluster
gets stuck only when the nodes do not become "Ready" within a certain period of time (although a very short period of time...)
Just ran the deployment again with verbose (-v6). It actually gets stuck waiting for cert-manager and just never retries at all. Even logging just stops, so I can clearly see it doesn't retry.
I've noticed cert-manager takes about 2-3 minutes to come up but the process is just unaware of that...
....
2022-02-05T17:25:32.466Z V4 Nodes ready {"total": 4}
2022-02-05T17:25:32.467Z V5 Retry execution successful {"retries": 16, "duration": "27.913587775s"}
2022-02-05T17:25:32.467Z V0 Installing networking on workload cluster
2022-02-05T17:25:34.055Z V6 Executing command {"cmd": "/usr/bin/docker exec -i eksa_1644081363153908882 kubectl apply -f - --kubeconfig it-eksa-mgmt/it-eksa-mgmt-eks-a-cluster.kubeconfig"}
2022-02-05T17:25:34.695Z V5 Retry execution successful {"retries": 1, "duration": "640.274586ms"}
2022-02-05T17:25:34.696Z V0 Installing storage class on workload cluster
2022-02-05T17:25:34.696Z V6 Executing command {"cmd": "/usr/bin/docker exec -i eksa_1644081363153908882 kubectl apply -f - --kubeconfig it-eksa-mgmt/it-eksa-mgmt-eks-a-cluster.kubeconfig"}
2022-02-05T17:25:35.008Z V5 Retry execution successful {"retries": 1, "duration": "312.261637ms"}
2022-02-05T17:25:35.008Z V0 Installing cluster-api providers on workload cluster
2022-02-05T17:25:36.465Z V6 Executing command {"cmd": "/usr/bin/docker exec -i -e VSPHERE_USERNAME=***** -e VSPHERE_PASSWORD=***** -e EXP_CLUSTER_RESOURCE_SET=true eksa_1644081363153908882 clusterctl init --core cluster-api:v1.0.2+c1f8c13 --bootstrap kubeadm:v1.0.2+bb954e1 --control-plane kubeadm:v1.0.2+2b94d76 --infrastructure vsphere:v1.0.1+2435934 --config it-eksa-mgmt/generated/clusterctl_tmp.yaml --bootstrap etcdadm-bootstrap:v1.0.0-rc4+b9ee67d --bootstrap etcdadm-controller:v1.0.0-rc4+f8caa60 --kubeconfig it-eksa-mgmt/it-eksa-mgmt-eks-a-cluster.kubeconfig"}
Fetching providers
Using Override="core-components.yaml" Provider="cluster-api" Version="v1.0.2+c1f8c13"
Using Override="bootstrap-components.yaml" Provider="bootstrap-kubeadm" Version="v1.0.2+bb954e1"
Using Override="bootstrap-components.yaml" Provider="bootstrap-etcdadm-bootstrap" Version="v1.0.0-rc4+b9ee67d"
Using Override="bootstrap-components.yaml" Provider="bootstrap-etcdadm-controller" Version="v1.0.0-rc4+f8caa60"
Using Override="control-plane-components.yaml" Provider="control-plane-kubeadm" Version="v1.0.2+2b94d76"
Using Override="infrastructure-components.yaml" Provider="infrastructure-vsphere" Version="v1.0.1+2435934"
Installing cert-manager Version="v1.5.3+66e1acc"
Using Override="cert-manager.yaml" Provider="cert-manager" Version="v1.5.3+66e1acc"
Waiting for cert-manager to be available...
That is basically the end of the log in this state:
NAMESPACE NAME READY STATUS RESTARTS AGE
cert-manager cert-manager-7988d4fb6c-w2xkr 1/1 Running 0 3m2s
cert-manager cert-manager-cainjector-6bc8dcdb64-dbjj8 1/1 Running 0 3m2s
cert-manager cert-manager-webhook-68979bfb95-gqntp 1/1 Running 0 3m2s
kube-system cilium-4krpg 1/1 Running 0 3m8s
kube-system cilium-cmlxz 1/1 Running 0 3m8s
kube-system cilium-f48ml 1/1 Running 0 3m8s
kube-system cilium-operator-86d59d5c88-2jsnq 1/1 Running 0 3m8s
kube-system cilium-operator-86d59d5c88-zzb8m 1/1 Running 0 3m8s
kube-system cilium-sl8xk 1/1 Running 0 3m8s
kube-system coredns-745c7986c7-cqrqf 1/1 Running 0 5m16s
kube-system coredns-745c7986c7-l976j 1/1 Running 0 5m16s
kube-system kube-apiserver-10.100.137.124 1/1 Running 0 5m23s
kube-system kube-apiserver-10.100.137.64 1/1 Running 0 3m10s
kube-system kube-controller-manager-10.100.137.124 1/1 Running 0 5m23s
kube-system kube-controller-manager-10.100.137.64 1/1 Running 0 3m10s
kube-system kube-proxy-5qc2p 1/1 Running 0 3m44s
kube-system kube-proxy-ds8fm 1/1 Running 0 3m10s
kube-system kube-proxy-k5nh6 1/1 Running 0 5m16s
kube-system kube-proxy-lcv58 1/1 Running 0 3m16s
kube-system kube-scheduler-10.100.137.124 1/1 Running 0 5m23s
kube-system kube-scheduler-10.100.137.64 1/1 Running 0 3m10s
kube-system kube-vip-10.100.137.124 1/1 Running 0 5m23s
kube-system kube-vip-10.100.137.64 1/1 Running 0 3m10s
kube-system vsphere-cloud-controller-manager-r2mrl 1/1 Running 4 3m10s
kube-system vsphere-cloud-controller-manager-vcm8d 1/1 Running 2 3m44s
kube-system vsphere-cloud-controller-manager-wlvkk 1/1 Running 2 5m12s
kube-system vsphere-cloud-controller-manager-wn2rr 1/1 Running 2 3m16s
kube-system vsphere-csi-controller-576c9c8dc8-7nmhd 5/5 Running 0 5m12s
kube-system vsphere-csi-node-6spcz 3/3 Running 0 3m10s
kube-system vsphere-csi-node-gd8h5 3/3 Running 0 5m12s
kube-system vsphere-csi-node-j5rj6 3/3 Running 0 3m44s
kube-system vsphere-csi-node-k4lgf 3/3 Running 0 3m16
That is why I think it is a race condition. Sometimes it gets stuck at Installing cluster-api providers on workload cluster
and sometimes at Moving cluster management from bootstrap to workload cluster
.
I cannot explain why this only occurs using Bottlerocket.
It eventually timed out after about an hour so I've uploaded log bundles for this issue as well: https://ts-k8s-pub.s3.eu-central-1.amazonaws.com/aws-eks-anywhere/logs/bootstrap-cluster-support-bundle-2022-02-05T17_55_58.tar.gz https://ts-k8s-pub.s3.eu-central-1.amazonaws.com/aws-eks-anywhere/logs/mgmt-cluster-support-bundle-2022-02-05T17_58_03.tar.gz
Investigating this issue on our side. Thank you for uploading the support bundles.
Works using v0.7.1.
What happened:
In the same environment I mentioned in #1155, the
Moving cluster management from bootstrap to workload cluster
step fails in 0.7.0 with the following message:kubectl get nodes
returns:kubectl get pods -A
returns:In one of the attempts I didn't even get that far and the
Installing cluster-api providers on workload cluster
step was completely stuck.In another attempt, the same
Moving cluster management from bootstrap to workload cluster
step failed as well, but only after terminating thecapi-kubeadm-bootstrap-controller-manager
,capi-kubeadm-control-plane-controller-manager
,etcdadm-bootstrap-provider-controller-manager
, andetcdadm-controller-controller-manager
pods. This time, that didn't happen for some reason... I cannot explain the different behaviors.The cluster creation cannot complete due to the above errors.
How to reproduce it (as minimally and precisely as possible):
Simply create a standard cluster.
My cluster spec:
Anything else we need to know?:
Environment:
The same spec works using the 0.6.1 binary.
I have uploaded both the bootstrap-cluster and the mgmt-cluster support bundles to my S3 bucket:
https://ts-k8s-pub.s3.eu-central-1.amazonaws.com/aws-eks-anywhere/logs/bootstrap-cluster-support-bundle-2022-02-05T13_58_29.tar.gz
https://ts-k8s-pub.s3.eu-central-1.amazonaws.com/aws-eks-anywhere/logs/mgmt-cluster-support-bundle-2022-02-05T14_00_24.tar.gz
P.S - something strange I've noticed is that
kubectl get nodes
displays Kubernetes versionv1.21.6
while the template name isbottlerocket-v1.21.5-kubernetes-1-21-eks-8-amd64
.Assistance will be appreciated.