aws / eks-anywhere

Run Amazon EKS on your own infrastructure 🚀
https://anywhere.eks.amazonaws.com
Apache License 2.0
1.97k stars 287 forks source link

"Installing resources on management cluster" step fails during baremetal cluster provisioning #3140

Closed etungsten closed 2 years ago

etungsten commented 2 years ago

What happened: K8s 1.23 cluster creation failed on the "Installing resources on management cluster" step after the workload cluster gets created.

Warning: The recommended number of control plane nodes is 3 or 5
Warning: The recommended number of control plane nodes is 3 or 5
Performing setup and validations
Private key saved to br-123/eks-a-id_rsa. Use 'ssh -i br-123/eks-a-id_rsa <username>@<Node-IP-Address>' to login to your cluster node
✅ Tinkerbell Provider setup is valid
✅ Validate certificate for registry mirror
✅ Validate authentication for git provider
✅ Create preflight validations pass
Creating new bootstrap cluster
Provider specific pre-capi-install-setup on bootstrap cluster
Installing cluster-api providers on bootstrap cluster
Provider specific post-setup
Creating new workload cluster
Installing networking on workload cluster
Creating EKS-A namespace
Installing cluster-api providers on workload cluster
Installing EKS-A secrets on workload cluster
Installing resources on management cluster
collecting cluster diagnostics
collecting management cluster diagnostics
⏳ Collecting support bundle from cluster, this can take a while        {"cluster": "bootstrap-cluster", "bundle": "br-123/generated/bootstrap-cluster-2022-08-24T00:33:56Z-bundle.yaml", "since": 1661290436066552625, "kubeconfig": "br-123/generated/br-123.kind.kubeconfig"}
Support bundle archive created  {"path": "support-bundle-2022-08-24T00_33_56.tar.gz"}
Analyzing support bundle        {"bundle": "br-123/generated/bootstrap-cluster-2022-08-24T00:33:56Z-bundle.yaml", "archive": "support-bundle-2022-08-24T00_33_56.tar.gz"}
Analysis output generated       {"path": "br-123/generated/bootstrap-cluster-2022-08-24T00:34:52Z-analysis.yaml"}
collecting workload cluster diagnostics
⏳ Collecting support bundle from cluster, this can take a while        {"cluster": "br-123", "bundle": "br-123/generated/br-123-2022-08-24T00:34:58Z-bundle.yaml", "since": 1661290498045272916, "kubeconfig": "br-123/br-123-eks-a-cluster.kubeconfig"}
Support bundle archive created  {"path": "support-bundle-2022-08-24T00_34_59.tar.gz"}
Analyzing support bundle        {"bundle": "br-123/generated/br-123-2022-08-24T00:34:58Z-bundle.yaml", "archive": "support-bundle-2022-08-24T00_34_59.tar.gz"}
Analysis output generated       {"path": "br-123/generated/br-123-2022-08-24T00:36:02Z-analysis.yaml"}
Error: installing stack on workload cluster: installing Tinkerbell helm chart: Error: INSTALLATION FAILED: Deployment.apps "envoy" is invalid: spec.template.spec.containers[0].image: Required value

What you expected to happen: K8s 1.23 cluster creation to finish and succeed.

How to reproduce it (as minimally and precisely as possible): Ran

eksctl anywhere create cluster \
  --hardware-csv "my-hardware.csv" \
  -f "${CLUSTER_CONFIG}" \
  --force-cleanup \
  --skip-ip-check

where cluster config looks like the following with OS_IMAGE_URL set to the Bottlerocket v1.9.1 metal-k8s-1.23 image:

apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: Cluster
metadata:
  name: "br-123"
spec:
  clusterNetwork:
    cniConfig:
      cilium: {}
    pods:
      cidrBlocks:
        - 192.168.0.0/16
    services:
      cidrBlocks:
        - 10.96.0.0/12
  controlPlaneConfiguration:
    count: 1
    endpoint:
      host: "10.80.50.123"
    machineGroupRef:
      kind: TinkerbellMachineConfig
      name: "br-123-cp"
  datacenterRef:
    kind: TinkerbellDatacenterConfig
    name: "br-123"
  kubernetesVersion: "1.23"
  managementCluster:
    name: "br-123"
  workerNodeGroupConfigurations:
    - count: 2
      machineGroupRef:
        kind: TinkerbellMachineConfig
        name: "br-123"
      name: md-0

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: TinkerbellDatacenterConfig
metadata:
  name: "br-123"
spec:
  tinkerbellIP: "10.80.50.223"
  osImageURL: "${OS_IMAGE_URL}"

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: TinkerbellMachineConfig
metadata:
  name: "br-123-cp"
spec:
  hardwareSelector:
    type: "cp"
  osFamily: bottlerocket
  templateRef: {}
  users:
    - name: ec2-user

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: TinkerbellMachineConfig
metadata:
  name: "br-123"
spec:
  hardwareSelector:
    type: "worker"
  osFamily: bottlerocket
  templateRef:
    kind: TinkerbellTemplateConfig
    name: br-hp-template
  users:
    - name: ec2-user

---
# left out br-hp-template
....

Anything else we need to know?: The provisioned baremetal cluster seems to be functional:

$ kubectl --kubeconfig br-123-eks-a-cluster.kubeconfig get nodes -o wide
NAME          STATUS   ROLES                  AGE   VERSION               INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                 KERNEL-VERSION   CONTAINER-RUNTIME
10.80.50.26   Ready    control-plane,master   27m   v1.23.7-eks-7709a84   10.80.50.26   <none>        Bottlerocket OS 1.9.1 (metal-k8s-1.23)   5.10.130         containerd://1.6.6+bottlerocket
10.80.50.30   Ready    <none>                 17m   v1.23.7-eks-7709a84   10.80.50.30   <none>        Bottlerocket OS 1.9.1 (metal-k8s-1.23)   5.10.130         containerd://1.6.6+bottlerocket
10.80.50.32   Ready    <none>                 17m   v1.23.7-eks-7709a84   10.80.50.32   <none>        Bottlerocket OS 1.9.1 (metal-k8s-1.23)   5.10.130         containerd://1.6.6+bottlerocket
$ kubectl --kubeconfig br-123-eks-a-cluster.kubeconfig get pods -o wide -A
NAMESPACE                           NAME                                                             READY   STATUS    RESTARTS   AGE   IP              NODE          NOMINATED NODE   READINESS GATES
capi-kubeadm-bootstrap-system       capi-kubeadm-bootstrap-controller-manager-5f4599b5c9-zd65r       1/1     Running   0          17m   192.168.2.30    10.80.50.30   <none>           <none>
capi-kubeadm-control-plane-system   capi-kubeadm-control-plane-controller-manager-574cbcd9d7-b8pg2   1/1     Running   0          17m   192.168.0.154   10.80.50.26   <none>           <none>
capi-system                         capi-controller-manager-7b7f574f86-pz8v4                         1/1     Running   0          17m   192.168.2.66    10.80.50.30   <none>           <none>
capt-system                         capt-controller-manager-5dcfdb8dd5-bbgxh                         1/1     Running   0          17m   192.168.0.223   10.80.50.26   <none>           <none>
cert-manager                        cert-manager-7568b959dc-8l7w6                                    1/1     Running   0          17m   192.168.1.123   10.80.50.32   <none>           <none>
cert-manager                        cert-manager-cainjector-9c8db6d5b-p7wfj                          1/1     Running   0          17m   192.168.1.251   10.80.50.32   <none>           <none>
cert-manager                        cert-manager-webhook-559f9f4f7-w9rpb                             1/1     Running   0          17m   192.168.1.50    10.80.50.32   <none>           <none>
eksa-system                         boots-5f95d74cb5-gm8l9                                           1/1     Running   0          16m   10.80.50.30     10.80.50.30   <none>           <none>
eksa-system                         hegel-7dc744cbb8-jkfrs                                           1/1     Running   0          16m   192.168.2.31    10.80.50.30   <none>           <none>
eksa-system                         kube-vip-lcgqk                                                   1/1     Running   0          16m   10.80.50.30     10.80.50.30   <none>           <none>
eksa-system                         kube-vip-q2m7b                                                   1/1     Running   0          16m   10.80.50.32     10.80.50.32   <none>           <none>
eksa-system                         rufio-controller-manager-8df4974bb-v284n                         1/1     Running   0          16m   192.168.2.15    10.80.50.30   <none>           <none>
eksa-system                         tink-controller-manager-568f74ff8c-dpn4v                         1/1     Running   0          16m   192.168.2.107   10.80.50.30   <none>           <none>
eksa-system                         tink-server-68d945bcb-vmcx2                                      1/1     Running   0          16m   192.168.2.185   10.80.50.30   <none>           <none>
etcdadm-bootstrap-provider-system   etcdadm-bootstrap-provider-controller-manager-6bf887d794-8pmp4   1/1     Running   0          17m   192.168.2.52    10.80.50.30   <none>           <none>
etcdadm-controller-system           etcdadm-controller-controller-manager-8f76cd998-cg9bj            1/1     Running   0          17m   192.168.2.104   10.80.50.30   <none>           <none>
kube-system                         cilium-fd8vv                                                     1/1     Running   0          17m   10.80.50.26     10.80.50.26   <none>           <none>
kube-system                         cilium-gzqbl                                                     1/1     Running   0          17m   10.80.50.32     10.80.50.32   <none>           <none>
kube-system                         cilium-n28h7                                                     1/1     Running   0          17m   10.80.50.30     10.80.50.30   <none>           <none>
kube-system                         cilium-operator-5799bc594c-5k2m2                                 1/1     Running   0          17m   10.80.50.32     10.80.50.32   <none>           <none>
kube-system                         cilium-operator-5799bc594c-hj5k4                                 1/1     Running   0          17m   10.80.50.30     10.80.50.30   <none>           <none>
kube-system                         coredns-65cfcb59c8-pvbp9                                         1/1     Running   0          27m   192.168.1.160   10.80.50.32   <none>           <none>
kube-system                         coredns-65cfcb59c8-s9xlt                                         1/1     Running   0          27m   192.168.1.219   10.80.50.32   <none>           <none>
kube-system                         etcd-10.80.50.26                                                 1/1     Running   0          27m   10.80.50.26     10.80.50.26   <none>           <none>
kube-system                         kube-apiserver-10.80.50.26                                       1/1     Running   0          27m   10.80.50.26     10.80.50.26   <none>           <none>
kube-system                         kube-controller-manager-10.80.50.26                              1/1     Running   0          27m   10.80.50.26     10.80.50.26   <none>           <none>
kube-system                         kube-proxy-h9b5p                                                 1/1     Running   0          17m   10.80.50.32     10.80.50.32   <none>           <none>
kube-system                         kube-proxy-l49t6                                                 1/1     Running   0          27m   10.80.50.26     10.80.50.26   <none>           <none>
kube-system                         kube-proxy-m9snb                                                 1/1     Running   0          17m   10.80.50.30     10.80.50.30   <none>           <none>
kube-system                         kube-scheduler-10.80.50.26                                       1/1     Running   0          27m   10.80.50.26     10.80.50.26   <none>           <none>
kube-system                         kube-vip-10.80.50.26                                             1/1     Running   0          27m   10.80.50.26     10.80.50.26   <none>           <none>

Environment:

abhinavmpandey08 commented 2 years ago

Hi @etungsten! Thanks for reporting this issue. We are aware of this issue and have fixed it in the latest v0.11.1 release You can download the latest release from here https://github.com/aws/eks-anywhere/releases/tag/v0.11.1 and try the create command again. I'm going to close this issue. Feel free to re-open if you run into any other issues.