aws / eks-anywhere

Run Amazon EKS on your own infrastructure 🚀
https://anywhere.eks.amazonaws.com
Apache License 2.0
1.96k stars 285 forks source link

Cluster create fails on kind stage #5475

Open AndreasDavour opened 1 year ago

AndreasDavour commented 1 year ago

What happened:

I tried to create a new cluster, and running with level 9 verbosity, I saw that it looks like it fails when it tries to run kind. What's going on? I ran the exact same procedure a week ago without this error?!

eksctl anywhere create cluster --verbosity=9 --bundles-override eks-anywhere-downloads/bundle-release.yaml -f mgmt_cluster.yaml --force-cleanup 2>&1 | tee /tmp/eksa_create.log

2023-03-31T11:04:18.697+0200 V0 ✅ svc-eksa@vsphere.local user vSphere privileges validated 2023-03-31T11:04:18.697+0200 V0 ✅ Vsphere Provider setup is valid 2023-03-31T11:04:18.760+0200 V0 ✅ Validate certificate for registry mirror 2023-03-31T11:04:18.760+0200 V0 ✅ Validate authentication for git provider 2023-03-31T11:04:18.760+0200 V0 ✅ Create preflight validations pass 2023-03-31T11:04:18.760+0200 V4 Task finished {"task_name": "setup-validate", "duration": "6.142577339s"} 2023-03-31T11:04:18.760+0200 V4 ---------------------------------- 2023-03-31T11:04:18.760+0200 V4 Task start {"task_name": "bootstrap-cluster-init"} 2023-03-31T11:04:18.760+0200 V0 Creating new bootstrap cluster 2023-03-31T11:04:18.761+0200 V4 Creating kind cluster {"name": "mgmt-eks-a-cluster", "kubeconfig": "mgmt/generated/mgmt.kind.kubeconfig"} 2023-03-31T11:04:18.761+0200 V6 Executing command {"cmd": "/usr/bin/docker exec -i eksa_1680253451933621095 kind create cluster --name mgmt-eks-a-cluster --kubeconfig mgmt/generated/mgmt.kind.kubeconfig --image harbor01.dc.tnse.se/eks-anywhere/kubernetes-sigs/kind/node:v1.23.16-eks-d-1-23-16-eks-a-29 --config mgmt/generated/kind_tmp.yaml"}

2023-03-31T11:08:25.119+0200 V9 docker {"stderr": "Creating cluster \"mgmt-eks-a-cluster\" ...\n • Ensuring node image (harbor01.dc.tnse.se/eks-anywhere/kubernetes-sigs/kind/node:v1.23.16-eks-d-1-23-16-eks-a-29) 🖼 ...\n ✓ Ensuring node image (harbor01.d c.tnse.se/eks-anywhere/kubernetes-sigs/kind/node:v1.23.16-eks-d-1-23-16-eks-a-29) 🖼\n • Preparing nodes 📦 ...\n ✓ Preparing nodes 📦 \n • Writing configuration 📜 ...\n ✓ Writing configuration 📜\n • Starting control-plane 🕹️ ...\n ✗ Starting control-plane 🕹️\nERROR: failed to create cluster: failed to init node with k ubeadm: command \"docker exec --privileged mgmt-eks-a-cluster-control-plane kubeadm init --skip-phases=preflight --config=/kind/kubeadm.conf --skip-token-print --v=6\" failed with error: exit status 1\n\nCommand Output: I0331 09:04:22.169296 305 initconfiguration.go:248] loading configuration from \"/kind/kubeadm.conf\"\n[config] WARNING: Ignored YAML document with GroupVersionKind kubeadm.k8s.io/v1beta3, Kind=JoinConfiguration\nW0331 09:04:22.171261 305 strict.go:55] error unmarshaling configuration schema.GroupVersionKind{Group:\"kubeadm.k8s.io\", Version:\"v1beta3\", Kind:\"ClusterConfiguration\"}: error unmarshaling JSON: while decoding JSON: json: unknown field \"type\"\n[init] Using Kubernetes version: v1.23.16-eks-1-23-16\n[certs] Using certificateDir folder \"/etc/kubernetes/pki\"\nI0331 09:04:22.177230 305 common.go:126] WARNING: tolerating control plane version v1.23.16-eks-1-23-16, assuming that k8s version 1.22.0 is not released yet\nI0331 09:04:22.177418 305 certs.go:112] creating a new certificate authority for ca\n[certs] Generating \"ca\" certificate and key\nI0331 09:04:22.299453 305 certs.go:522] validating certificate period for ca certificate\n[certs] Generating \"apiserver\" certificate and key\n[certs] apiserver serving cert is signed for DNS names [kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local localhost mgmt-eks-a-cluster-control-plane] and IPs [10.96.0.1 172.18.0.2 127.0.0.1]\n[certs] Generating \"apiserver-kubelet-client\" certificate and key\nI0331 09:04:22.756930 305 certs.go:112] creating a new certificate authority for front-proxy-ca\n[certs] Generating \"front-proxy-ca\" certificate and key\nI0331 09:04:22.876813 305 certs.go:522] [...] 2023-03-31T11:08:25.122+0200 V4 Task finished {"task_name": "bootstrap-cluster-init", "duration": "4m6.362692073s"} 2023-03-31T11:08:25.123+0200 V4 ---------------------------------- 2023-03-31T11:08:25.123+0200 V4 Saving checkpoint {"file": "mgmt-checkpoint.yaml"} 2023-03-31T11:08:25.123+0200 V4 Tasks completed {"duration": "4m12.506318354s"} 2023-03-31T11:08:25.123+0200 V3 Logging out from current govc session 2023-03-31T11:08:25.124+0200 V6 Executing command {"cmd": "/usr/bin/docker exec -i -e GOVC_USERNAME= -e GOVC_PASSWORD= -e GOVC_URL=nwlabvc01-102935-nwlab.se.telenor.net -e GOVC_INSECURE=true -e GOVC_DATACENTER=Rasunda_lab eksa_1680253451933621095 govc session.logout"} 2023-03-31T11:08:25.279+0200 V6 Executing command {"cmd": "/usr/bin/docker exec -i -e GOVC_DATACENTER=Rasunda_lab -e GOVC_USERNAME= -e GOVC_PASSWORD= -e GOVC_URL=nwlabvc01-102935-nwlab.se.telenor.net -e GOVC_INSECURE=true eksa_1680253451933621095 govc session.logout -k"} 2023-03-31T11:08:25.486+0200 V3 Cleaning up long running container {"name": "eksa_1680253451933621095"} 2023-03-31T11:08:25.486+0200 V6 Executing command {"cmd": "/usr/bin/docker rm -f -v eksa_1680253451933621095"} Error: creating bootstrap cluster: executing create cluster: Creating cluster "mgmt-eks-a-cluster" ... • Ensuring node image (harbor01.dc.tnse.se/eks-anywhere/kubernetes-sigs/kind/node:v1.23.16-eks-d-1-23-16-eks-a-29) 🖼 ... ✓ Ensuring node image (harbor01.dc.tnse.se/eks-anywhere/kubernetes-sigs/kind/node:v1.23.16-eks-d-1-23-16-eks-a-29) 🖼 • Preparing nodes 📦 ... ✓ Preparing nodes 📦 • Writing configuration 📜 ... ✓ Writing configuration 📜 • Starting control-plane 🕹️ ... ✗ Starting control-plane 🕹️ ERROR: failed to create cluster: failed to init node with kubeadm: command "docker exec --privileged mgmt-eks-a-cluster-control-plane kubeadm init --skip-phases=preflight --config=/kind/kubeadm.conf --skip-token-print --v=6" failed with error: exit status 1

Command Output: I0331 09:04:22.169296 305 initconfiguration.go:248] loading configuration from "/kind/kubeadm.conf" [config] WARNING: Ignored YAML document with GroupVersionKind kubeadm.k8s.io/v1beta3, Kind=JoinConfiguration W0331 09:04:22.171261 305 strict.go:55] error unmarshaling configuration schema.GroupVersionKind{Group:"kubeadm.k8s.io", Version:"v1beta3", Kind:"ClusterConfiguration"}: error unmarshaling JSON: while decoding JSON: json: unknown field "type" [init] Using Kubernetes version: v1.23.16-eks-1-23-16 [certs] Using certificateDir folder "/etc/kubernetes/pki" I0331 09:04:22.177230 305 common.go:126] WARNING: tolerating control plane version v1.23.16-eks-1-23-16, assuming that k8s version 1.22.0 is not released yet I0331 09:04:22.177418 305 certs.go:112] creating a new certificate authority for ca [certs] Generating "ca" certificate and key I0331 09:04:22.299453 305 certs.go:522] validating certificate period for ca certificate [certs] Generating "apiserver" certificate and key [certs] apiserver serving cert is signed for DNS names [kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local localhost mgmt-eks-a-cluster-control-plane] and IPs [10.96.0.1 172.18.0.2 127.0.0.1] [certs] Generating "apiserver-kubelet-client" certificate and key I0331 09:04:22.756930 305 certs.go:112] creating a new certificate authority for front-proxy-ca [certs] Generating "front-proxy-ca" certificate and key I0331 09:04:22.876813 305 certs.go:522] validating certificate period for front-proxy-ca certificate [certs] Generating "front-proxy-client" certificate and key I0331 09:04:22.978872 305 certs.go:112] creating a new certificate authority for etcd-ca [certs] Generating "etcd/ca" certificate and key I0331 09:04:23.125195 305 certs.go:522] validating certificate period for etcd/ca certificate [certs] Generating "etcd/server" certificate and key [certs] etcd/server serving cert is signed for DNS names [localhost mgmt-eks-a-cluster-control-plane] and IPs [172.18.0.2 127.0.0.1 ::1] [certs] Generating "etcd/peer" certificate and key

What you expected to happen: The cluster "mgmt" to be created.

How to reproduce it (as minimally and precisely as possible): /usr/bin/docker exec -i eksa_1680253451933621095 kind create cluster --name mgmt-eks-a-cluster --kubeconfig mgmt/generated/mgmt.kind.kubeconfig --image harbor01.dc.tnse.se/eks-anywhere/kubernetes-sigs/kind/node:v1.23.16-eks-d-1-23-16-eks-a-29 --config mgmt/generated/kind_tmp.yaml reating cluster "mgmt-eks-a-cluster" ... • Ensuring node image (harbor01.dc.tnse.se/eks-anywhere/kubernetes-sigs/kind/node:v1.23.16-eks-d-1-23-16-eks-a-29) 🖼 ... ✓ Ensuring node image (harbor01.dc.tnse.se/eks-anywhere/kubernetes-sigs/kind/node:v1.23.16-eks-d-1-23-16-eks-a-29) 🖼 • Preparing nodes 📦 ... ✓ Preparing nodes 📦 • Writing configuration 📜 ... ✓ Writing configuration 📜 • Starting control-plane 🕹️ ... ✗ Starting control-plane 🕹️ ERROR: failed to create cluster: failed to init node with kubeadm: command "docker exec --privileged mgmt-eks-a-cluster-control-plane kubeadm init --skip-phases=preflight --config=/kind/kubeadm.conf --skip-token-print --v=6" failed with error: exit status 1

Anything else we need to know?:

Environment: eksctl anywhere version -> v0.14.3 eksctl version -> 0.132.0

jiayiwang7 commented 1 year ago

Try restarting your docker daemon, and docker system prune to clean up unused data. Remove unused kind clusters and try the create command again.

AndreasDavour commented 1 year ago

Sadly, not much of a change.

The error message on maximum verbosity is too long to paste in full (I tried), but this one: "I0403 07:28:06.788415 306 manifests.go:154] [control-plane] wrote static Pod manifest for component "kube-scheduler" to "/etc/kubernetes/manifests/kube-scheduler.yaml" [etcd] Creating static Pod manifest for local etcd in "/etc/kubernetes/manifests" I0403 07:28:06.789004 306 local.go:65] [etcd] wrote Static Pod manifest for a local etcd member to "/etc/kubernetes/manifests/etcd.yaml" I0403 07:28:06.789021 306 waitcontrolplane.go:91] [wait-control-plane] Waiting for the API server to be healthy I0403 07:28:06.789468 306 loader.go:374] Config loaded from file: /etc/kubernetes/admin.conf [wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s"

To me looks like the kind container is running, as it has created those static manifests. Why those doesn't start when they did it just fine before is something I really don't understand.

tatlat commented 1 year ago

Since you are doing a bundle override, could you confirm that the bundle supports the release version you are using? If possible, maybe you could try to create the cluster with v0.15.0. Since the create fails on the kind stage, it might also be helpful to try kind export logs to get more information.

AndreasDavour commented 1 year ago

How do you confirm that the bundle supports the release version you are using? In the bundle-release.yaml it says cliMaxVersion: v0.14.3 and cliMinVersion: v0.14.3 which looks like it matches what I'm using, the way I read it?

"kind export logs" returns: ERROR: unknown cluster "kind"

...which looks totally nonsensical to me...

As kind never seem to start, is there anything to get logs from?

I could in theory try to upgrade to v0.15.0 but then I'll have to download a new bundle with new possibilities to create mismatches, so I'd rather avoid that. Also it would feel better to know what the error means, if it shows up again if I bump the versions.

tatlat commented 1 year ago

Sorry for the late reply. You might need to add --name mgmt-eks-a-cluster to "kind export logs". If that still doesn't work, you might need to manually run the kind create cluster command but with the retain flag: "/usr/bin/docker exec -i eksa_1680253451933621095 kind create cluster --retain --name mgmt-eks-a-cluster --kubeconfig mgmt/generated/mgmt.kind.kubeconfig --image harbor01.dc.tnse.se/eks-anywhere/kubernetes-sigs/kind/node:v1.23.16-eks-d-1-23-16-eks-a-29 --config mgmt/generated/kind_tmp.yaml"

AndreasDavour commented 1 year ago

I gave up on this project as it seems way too brittle, with things failing left right and centre. Thanks a lot for the feedback, though.

If you want to purge old issues, I guess you could close this one.