"Moving cluster management from bootstrap to workload cluster" step fails in 0.7.0 using Bottlerocket OS

itaytalmi commented 2 years ago

What happened:

In the same environment I mentioned in #1155, the Moving cluster management from bootstrap to workload cluster step fails in 0.7.0 with the following message:

$ eksctl anywhere create cluster -f eksa-mgmt-cluster.yaml

Performing setup and validations
✅ Connected to server
✅ Authenticated to vSphere
✅ Datacenter validated
✅ Network validated
Creating template. This might take a while.
✅ Datastore validated
✅ Folder validated
✅ Resource pool validated
✅ Datastore validated
✅ Folder validated
✅ Resource pool validated
✅ Datastore validated
✅ Folder validated
✅ Resource pool validated
✅ Control plane and Workload templates validated
✅ Vsphere Provider setup is valid
✅ Create preflight validations pass
Creating new bootstrap cluster
Installing cluster-api providers on bootstrap cluster
Provider specific setup
Creating new workload cluster
Installing networking on workload cluster
Installing storage class on workload cluster
Installing cluster-api providers on workload cluster
Installing EKS-A secrets on workload cluster
Moving cluster management from bootstrap to workload cluster
collecting cluster diagnostics
collecting management cluster diagnostics
⏳ Collecting support bundle from cluster, this can take a while        {"cluster": "bootstrap-cluster", "bundle": "it-eksa-mgmt/generated/bootstrap-cluster-2022-02-05T13:58:28Z-bundle.yaml", "since": 1644065908000245091, "kubeconfig": "it-eksa-mgmt/generated/it-eksa-mgmt.kind.kubeconfig"}
Support bundle archive created  {"path": "support-bundle-2022-02-05T13_58_29.tar.gz"}
Analyzing support bundle        {"bundle": "it-eksa-mgmt/generated/bootstrap-cluster-2022-02-05T13:58:28Z-bundle.yaml", "archive": "support-bundle-2022-02-05T13_58_29.tar.gz"}
Analysis output generated       {"path": "it-eksa-mgmt/generated/bootstrap-cluster-2022-02-05T14:00:16Z-analysis.yaml"}
collecting workload cluster diagnostics
⏳ Collecting support bundle from cluster, this can take a while        {"cluster": "it-eksa-mgmt", "bundle": "it-eksa-mgmt/generated/it-eksa-mgmt-2022-02-05T14:00:23Z-bundle.yaml", "since": 1644066023917299387, "kubeconfig": "it-eksa-mgmt/it-eksa-mgmt-eks-a-cluster.kubeconfig"}
Support bundle archive created  {"path": "support-bundle-2022-02-05T14_00_24.tar.gz"}
Analyzing support bundle        {"bundle": "it-eksa-mgmt/generated/it-eksa-mgmt-2022-02-05T14:00:23Z-bundle.yaml", "archive": "support-bundle-2022-02-05T14_00_24.tar.gz"}
Analysis output generated       {"path": "it-eksa-mgmt/generated/it-eksa-mgmt-2022-02-05T14:03:04Z-analysis.yaml"}
Error: failed to create cluster: error moving CAPI management from source to target: failed moving management cluster: Performing move...
Discovering Cluster API objects
Moving Cluster API objects Clusters=1
Creating objects in the target cluster
Error: action failed after 10 attempts: error creating "controlplane.cluster.x-k8s.io/v1beta1, Kind=KubeadmControlPlane" eksa-system/it-eksa-mgmt: Internal error occurred: failed calling webhook "default.kubeadmcontrolplane.controlplane.cluster.x-k8s.io": Post "https://capi-kubeadm-control-plane-webhook-service.capi-kubeadm-control-plane-system.svc:443/mutate-controlplane-cluster-x-k8s-io-v1beta1-kubeadmcontrolplane?timeout=10s": context deadline exceeded

kubectl get nodes returns:

$ kubectl get nodes

NAME             STATUS   ROLES                  AGE   VERSION
10.100.137.107   Ready    control-plane,master   17m   v1.21.6
10.100.137.108   Ready    <none>                 16m   v1.21.6
10.100.137.109   Ready    control-plane,master   15m   v1.21.6
10.100.137.55    Ready    <none>                 16m   v1.21.6

kubectl get pods -A returns:

$ kubectl get pod -A

NAMESPACE                           NAME                                                             READY   STATUS    RESTARTS   AGE
capi-kubeadm-bootstrap-system       capi-kubeadm-bootstrap-controller-manager-694cc79bb7-7lz9s       1/1     Running   0          14m
capi-kubeadm-control-plane-system   capi-kubeadm-control-plane-controller-manager-5b6b48dd8c-k6c8r   1/1     Running   0          14m
capi-system                         capi-controller-manager-689cd9b4fd-5vp2c                         1/1     Running   0          14m
capv-system                         capv-controller-manager-6b467446b9-x4d4z                         1/1     Running   0          14m
cert-manager                        cert-manager-7988d4fb6c-wwbfl                                    1/1     Running   0          16m
cert-manager                        cert-manager-cainjector-6bc8dcdb64-7xjst                         1/1     Running   0          16m
cert-manager                        cert-manager-webhook-68979bfb95-rsghj                            1/1     Running   0          16m
etcdadm-bootstrap-provider-system   etcdadm-bootstrap-provider-controller-manager-74c86ffb56-xzh9x   1/1     Running   0          14m
etcdadm-controller-system           etcdadm-controller-controller-manager-7894945688-rvcct           1/1     Running   0          14m
kube-system                         cilium-5nkz5                                                     1/1     Running   0          16m
kube-system                         cilium-5wdj4                                                     1/1     Running   0          16m
kube-system                         cilium-f8t7l                                                     1/1     Running   0          16m
kube-system                         cilium-mzz9q                                                     1/1     Running   0          16m
kube-system                         cilium-operator-86d59d5c88-mdspl                                 1/1     Running   0          16m
kube-system                         cilium-operator-86d59d5c88-vlcqn                                 1/1     Running   0          16m
kube-system                         coredns-745c7986c7-cdvzg                                         1/1     Running   0          18m
kube-system                         coredns-745c7986c7-kglxn                                         1/1     Running   0          18m
kube-system                         kube-apiserver-10.100.137.107                                    1/1     Running   0          18m
kube-system                         kube-apiserver-10.100.137.109                                    1/1     Running   0          16m
kube-system                         kube-controller-manager-10.100.137.107                           1/1     Running   0          18m
kube-system                         kube-controller-manager-10.100.137.109                           1/1     Running   0          16m
kube-system                         kube-proxy-58jwb                                                 1/1     Running   0          18m
kube-system                         kube-proxy-8t4gv                                                 1/1     Running   0          17m
kube-system                         kube-proxy-bgh5h                                                 1/1     Running   0          16m
kube-system                         kube-proxy-pl7g5                                                 1/1     Running   0          17m
kube-system                         kube-scheduler-10.100.137.107                                    1/1     Running   0          18m
kube-system                         kube-scheduler-10.100.137.109                                    1/1     Running   0          16m
kube-system                         kube-vip-10.100.137.107                                          1/1     Running   0          18m
kube-system                         kube-vip-10.100.137.109                                          1/1     Running   0          16m
kube-system                         vsphere-cloud-controller-manager-gplzh                           1/1     Running   2          16m
kube-system                         vsphere-cloud-controller-manager-lwbk2                           1/1     Running   2          18m
kube-system                         vsphere-cloud-controller-manager-tc9gs                           1/1     Running   2          17m
kube-system                         vsphere-cloud-controller-manager-w5g96                           1/1     Running   2          17m
kube-system                         vsphere-csi-controller-576c9c8dc8-b8jwr                          5/5     Running   0          18m
kube-system                         vsphere-csi-node-fpn8b                                           3/3     Running   0          17m
kube-system                         vsphere-csi-node-hlp2m                                           3/3     Running   0          16m
kube-system                         vsphere-csi-node-jhn8g                                           3/3     Running   0          17m
kube-system                         vsphere-csi-node-twh7m                                           3/3     Running   0          18m

In one of the attempts I didn't even get that far and the Installing cluster-api providers on workload cluster step was completely stuck.

In another attempt, the same Moving cluster management from bootstrap to workload cluster step failed as well, but only after terminating the capi-kubeadm-bootstrap-controller-manager, capi-kubeadm-control-plane-controller-manager, etcdadm-bootstrap-provider-controller-manager, and etcdadm-controller-controller-manager pods. This time, that didn't happen for some reason... I cannot explain the different behaviors.

The cluster creation cannot complete due to the above errors.

How to reproduce it (as minimally and precisely as possible):

Simply create a standard cluster.

My cluster spec:

apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: Cluster
metadata:
  name: it-eksa-mgmt
spec:
  clusterNetwork:
    cni: cilium
    pods:
      cidrBlocks:
      - 192.168.0.0/16
    services:
      cidrBlocks:
      - 10.96.0.0/12
  controlPlaneConfiguration:
    count: 2
    endpoint:
      host: "10.100.137.210"
    machineGroupRef:
      kind: VSphereMachineConfig
      name: it-eksa-mgmt-cp
  datacenterRef:
    kind: VSphereDatacenterConfig
    name: it-eksa-mgmt
  externalEtcdConfiguration:
    count: 3
    machineGroupRef:
      kind: VSphereMachineConfig
      name: it-eksa-mgmt-etcd
  kubernetesVersion: "1.21"
  managementCluster:
    name: it-eksa-mgmt
  workerNodeGroupConfigurations:
  - count: 2
    machineGroupRef:
      kind: VSphereMachineConfig
      name: it-eksa-mgmt
    name: md-0
---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereDatacenterConfig
metadata:
  name: it-eksa-mgmt
spec:
  datacenter: "my-datacenter"
  insecure: false
  network: "eksa-node-network-01"
  server: "my-vcenter.example.domain"
  thumbprint: "26:3A:FF:3E:01:84:36:F5:BC:18:80:27:0E:14:59:AB:8E:1B:9E:53"
---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereMachineConfig
metadata:
  name: it-eksa-mgmt-cp
spec:
  datastore: "vsanDatastore"
  diskGiB: 25
  folder: "/Demo-Datacenter/vm/LABS/K8S/EKSA"
  memoryMiB: 8192
  numCPUs: 2
  osFamily: bottlerocket
  resourcePool: "/Demo-Datacenter/host/Demo-Cluster/Resources"
  users:
  - name: ec2-user
    sshAuthorizedKeys:
    - ssh-rsa AAAAB3Nz.....
---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereMachineConfig
metadata:
  name: it-eksa-mgmt
spec:
  datastore: "vsanDatastore"
  diskGiB: 25
  folder: "/Demo-Datacenter/vm/LABS/K8S/EKSA"
  memoryMiB: 8192
  numCPUs: 2
  osFamily: bottlerocket
  resourcePool: "/Demo-Datacenter/host/Demo-Cluster/Resources"
  users:
  - name: ec2-user
    sshAuthorizedKeys:
    - ssh-rsa AAAAB3Nz.....
---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereMachineConfig
metadata:
  name: it-eksa-mgmt-etcd
spec:
  datastore: "vsanDatastore"
  diskGiB: 25
  folder: "/Demo-Datacenter/vm/LABS/K8S/EKSA"
  memoryMiB: 8192
  numCPUs: 2
  osFamily: bottlerocket
  resourcePool: "/Demo-Datacenter/host/Demo-Cluster/Resources"
  users:
  - name: ec2-user
    sshAuthorizedKeys:
    - ssh-rsa AAAAB3Nz.....

Anything else we need to know?:

Environment:

EKS Anywhere Release: 0.7.0
EKS distro version: v1-21-eks-8
eksctl version: 0.82.0
VMware vCenter Server version: 7.0.2.00100 (Update 3b, build number 18901211)
Admin VM is Linux Ubuntu 20.04.3 LTS (Focal Fossa).
Docker version: 20.10.12

The same spec works using the 0.6.1 binary.

I have uploaded both the bootstrap-cluster and the mgmt-cluster support bundles to my S3 bucket:

https://ts-k8s-pub.s3.eu-central-1.amazonaws.com/aws-eks-anywhere/logs/bootstrap-cluster-support-bundle-2022-02-05T13_58_29.tar.gz

https://ts-k8s-pub.s3.eu-central-1.amazonaws.com/aws-eks-anywhere/logs/mgmt-cluster-support-bundle-2022-02-05T14_00_24.tar.gz

P.S - something strange I've noticed is that kubectl get nodes displays Kubernetes version v1.21.6 while the template name is bottlerocket-v1.21.5-kubernetes-1-21-eks-8-amd64.

Assistance will be appreciated.

TerryHowe commented 2 years ago

I'm not sure why you would get kubelet v1.21.6 on the BR image, but it should work anyway.

itaytalmi commented 2 years ago

Just to make sure the image is "fresh" I've removed all the older images I had including the vSphere content library and let eksctl anywhere create it from scratch. I suppose the image name should be 1.21.6. But yes, clearly it should work anyway. That's just a discrepancy I've noticed.

TerryHowe commented 2 years ago

I'll investigate what is up with that image. I have the same thing on my little cluster with BR:

% k get nodes      
NAME            STATUS   ROLES                  AGE   VERSION
198.18.40.196   Ready    control-plane,master   46h   v1.21.6
198.18.81.135   Ready    <none>                 46h   v1.21.6

Any chance the cluster is just very slow? I have not looked at the support bundles yet, but from your post everything looks fine.

itaytalmi commented 2 years ago

Actually this environment has tons of resources in vSphere. I'm also deploying it to an all-flash vSAN datastore and the vSphere port group is backed by a 10Gb network. The time it takes to spin up the cluster actually seems fine. Just gets stuck for about 5 minutes on the Moving cluster management from bootstrap to workload cluster step. Also, as I mentioned, it works with 0.6.1.

itaytalmi commented 2 years ago

The exact same spec works with Ubuntu...

$ eksctl anywhere create cluster -f eksa-mgmt-cluster-ubuntu.yaml

Performing setup and validations
✅ Connected to server
✅ Authenticated to vSphere
✅ Datacenter validated
✅ Network validated
✅ Datastore validated
✅ Folder validated
✅ Resource pool validated
✅ Datastore validated
✅ Folder validated
✅ Resource pool validated
✅ Datastore validated
✅ Folder validated
✅ Resource pool validated
✅ Control plane and Workload templates validated
✅ Vsphere Provider setup is valid
✅ Create preflight validations pass
Creating new bootstrap cluster
Installing cluster-api providers on bootstrap cluster
Provider specific setup
Creating new workload cluster
Installing networking on workload cluster
Installing storage class on workload cluster
Installing cluster-api providers on workload cluster
Installing EKS-A secrets on workload cluster
Moving cluster management from bootstrap to workload cluster
Installing EKS-A custom components (CRD and controller) on workload cluster
Creating EKS-A CRDs instances on workload cluster
Installing AddonManager and GitOps Toolkit on workload cluster
GitOps field not specified, bootstrap flux skipped
Writing cluster config file
Deleting bootstrap cluster
🎉 Cluster created!

By the way, unlike the Bottlerocket image, the ubuntu-v1.21.5-kubernetes-1-21-eks-8-amd64 image actually contains 1.12.5...

$ kubectl get nodes

NAME                                 STATUS   ROLES                  AGE     VERSION
it-eksa-mgmt-426xv                   Ready    control-plane,master   7m14s   v1.21.5-eks-1-21
it-eksa-mgmt-67smp                   Ready    control-plane,master   5m55s   v1.21.5-eks-1-21
it-eksa-mgmt-md-0-5894445f56-42ttt   Ready    <none>                 6m      v1.21.5-eks-1-21
it-eksa-mgmt-md-0-5894445f56-fwjj8   Ready    <none>                 6m1s    v1.21.5-eks-1-21

TerryHowe commented 2 years ago

Okay, sounds like there is a problem with that bottle rocket build, but not entirely sure why it is not working for you, but I was able to create a cluster with it. It shouldn't have 1.21.6 either way though.

itaytalmi commented 2 years ago

Attempting to deploy using Bottlerocket once again. This time the Installing cluster-api providers on workload cluster step is just stuck and doesn't proceed at all. This is another behavior I mentioned in the original post. It is all very random. Bottlerocket deployments keep failing at different stages... I can see that during this stage, it doesn't even attempt to install the CAPI providers on the cluster.

$ eksctl anywhere create cluster -f eksa-mgmt-cluster.yaml

Performing setup and validations
✅ Connected to server
✅ Authenticated to vSphere
✅ Datacenter validated
✅ Network validated
✅ Datastore validated
✅ Folder validated
✅ Resource pool validated
✅ Datastore validated
✅ Folder validated
✅ Resource pool validated
✅ Datastore validated
✅ Folder validated
✅ Resource pool validated
✅ Control plane and Workload templates validated
✅ Vsphere Provider setup is valid
✅ Create preflight validations pass
Creating new bootstrap cluster
Installing cluster-api providers on bootstrap cluster
Provider specific setup
Creating new workload cluster
Installing networking on workload cluster
Installing storage class on workload cluster
Installing cluster-api providers on workload cluster

$ kubectl get pods -A

NAMESPACE      NAME                                       READY   STATUS    RESTARTS   AGE
cert-manager   cert-manager-7988d4fb6c-mrx82              1/1     Running   0          7m4s
cert-manager   cert-manager-cainjector-6bc8dcdb64-8cxt5   1/1     Running   0          7m4s
cert-manager   cert-manager-webhook-68979bfb95-ptkq4      1/1     Running   0          7m4s
kube-system    cilium-2tfhn                               1/1     Running   0          7m23s
kube-system    cilium-c2l6s                               1/1     Running   0          7m23s
kube-system    cilium-m42l8                               1/1     Running   0          7m23s
kube-system    cilium-operator-86d59d5c88-gvq58           1/1     Running   0          7m23s
kube-system    cilium-operator-86d59d5c88-qm9sn           1/1     Running   0          7m23s
kube-system    cilium-tx5xd                               1/1     Running   0          7m23s
kube-system    coredns-745c7986c7-9tv4s                   1/1     Running   0          9m24s
kube-system    coredns-745c7986c7-j66f5                   1/1     Running   0          9m24s
kube-system    kube-apiserver-10.100.137.115              1/1     Running   0          9m31s
kube-system    kube-apiserver-10.100.137.117              1/1     Running   0          7m27s
kube-system    kube-controller-manager-10.100.137.115     1/1     Running   0          9m31s
kube-system    kube-controller-manager-10.100.137.117     1/1     Running   0          7m27s
kube-system    kube-proxy-7s4td                           1/1     Running   0          7m51s
kube-system    kube-proxy-drtr6                           1/1     Running   0          7m33s
kube-system    kube-proxy-j4tgp                           1/1     Running   0          9m24s
kube-system    kube-proxy-jtqws                           1/1     Running   0          7m26s
kube-system    kube-scheduler-10.100.137.115              1/1     Running   0          9m31s
kube-system    kube-scheduler-10.100.137.117              1/1     Running   0          7m27s
kube-system    kube-vip-10.100.137.115                    1/1     Running   0          9m31s
kube-system    kube-vip-10.100.137.117                    1/1     Running   0          7m27s
kube-system    vsphere-cloud-controller-manager-7p4fc     1/1     Running   4          7m26s
kube-system    vsphere-cloud-controller-manager-m8rd5     1/1     Running   1          9m25s
kube-system    vsphere-cloud-controller-manager-vqh2r     1/1     Running   2          7m51s
kube-system    vsphere-cloud-controller-manager-z28s9     1/1     Running   3          7m33s
kube-system    vsphere-csi-controller-576c9c8dc8-6xbpx    5/5     Running   0          9m25s
kube-system    vsphere-csi-node-hxfdp                     3/3     Running   0          7m51s
kube-system    vsphere-csi-node-psxx7                     3/3     Running   0          7m33s
kube-system    vsphere-csi-node-r4hjc                     3/3     Running   0          9m25s
kube-system    vsphere-csi-node-ww2fg                     3/3     Running   0          7m26s

It also doesn't time out... So I'm not sure how I can generate support bundles at this point.

I believe this is related some race condition. I've noticed that the Installing cluster-api providers on workload cluster gets stuck only when the nodes do not become "Ready" within a certain period of time (although a very short period of time...)

itaytalmi commented 2 years ago

Just ran the deployment again with verbose (-v6). It actually gets stuck waiting for cert-manager and just never retries at all. Even logging just stops, so I can clearly see it doesn't retry.

I've noticed cert-manager takes about 2-3 minutes to come up but the process is just unaware of that...

....
2022-02-05T17:25:32.466Z        V4      Nodes ready     {"total": 4}
2022-02-05T17:25:32.467Z        V5      Retry execution successful      {"retries": 16, "duration": "27.913587775s"}
2022-02-05T17:25:32.467Z        V0      Installing networking on workload cluster
2022-02-05T17:25:34.055Z        V6      Executing command       {"cmd": "/usr/bin/docker exec -i eksa_1644081363153908882 kubectl apply -f - --kubeconfig it-eksa-mgmt/it-eksa-mgmt-eks-a-cluster.kubeconfig"}
2022-02-05T17:25:34.695Z        V5      Retry execution successful      {"retries": 1, "duration": "640.274586ms"}
2022-02-05T17:25:34.696Z        V0      Installing storage class on workload cluster
2022-02-05T17:25:34.696Z        V6      Executing command       {"cmd": "/usr/bin/docker exec -i eksa_1644081363153908882 kubectl apply -f - --kubeconfig it-eksa-mgmt/it-eksa-mgmt-eks-a-cluster.kubeconfig"}
2022-02-05T17:25:35.008Z        V5      Retry execution successful      {"retries": 1, "duration": "312.261637ms"}
2022-02-05T17:25:35.008Z        V0      Installing cluster-api providers on workload cluster
2022-02-05T17:25:36.465Z        V6      Executing command       {"cmd": "/usr/bin/docker exec -i -e VSPHERE_USERNAME=***** -e VSPHERE_PASSWORD=***** -e EXP_CLUSTER_RESOURCE_SET=true eksa_1644081363153908882 clusterctl init --core cluster-api:v1.0.2+c1f8c13 --bootstrap kubeadm:v1.0.2+bb954e1 --control-plane kubeadm:v1.0.2+2b94d76 --infrastructure vsphere:v1.0.1+2435934 --config it-eksa-mgmt/generated/clusterctl_tmp.yaml --bootstrap etcdadm-bootstrap:v1.0.0-rc4+b9ee67d --bootstrap etcdadm-controller:v1.0.0-rc4+f8caa60 --kubeconfig it-eksa-mgmt/it-eksa-mgmt-eks-a-cluster.kubeconfig"}
Fetching providers
Using Override="core-components.yaml" Provider="cluster-api" Version="v1.0.2+c1f8c13"
Using Override="bootstrap-components.yaml" Provider="bootstrap-kubeadm" Version="v1.0.2+bb954e1"
Using Override="bootstrap-components.yaml" Provider="bootstrap-etcdadm-bootstrap" Version="v1.0.0-rc4+b9ee67d"
Using Override="bootstrap-components.yaml" Provider="bootstrap-etcdadm-controller" Version="v1.0.0-rc4+f8caa60"
Using Override="control-plane-components.yaml" Provider="control-plane-kubeadm" Version="v1.0.2+2b94d76"
Using Override="infrastructure-components.yaml" Provider="infrastructure-vsphere" Version="v1.0.1+2435934"
Installing cert-manager Version="v1.5.3+66e1acc"
Using Override="cert-manager.yaml" Provider="cert-manager" Version="v1.5.3+66e1acc"
Waiting for cert-manager to be available...

That is basically the end of the log in this state:

NAMESPACE      NAME                                       READY   STATUS    RESTARTS   AGE
cert-manager   cert-manager-7988d4fb6c-w2xkr              1/1     Running   0          3m2s
cert-manager   cert-manager-cainjector-6bc8dcdb64-dbjj8   1/1     Running   0          3m2s
cert-manager   cert-manager-webhook-68979bfb95-gqntp      1/1     Running   0          3m2s
kube-system    cilium-4krpg                               1/1     Running   0          3m8s
kube-system    cilium-cmlxz                               1/1     Running   0          3m8s
kube-system    cilium-f48ml                               1/1     Running   0          3m8s
kube-system    cilium-operator-86d59d5c88-2jsnq           1/1     Running   0          3m8s
kube-system    cilium-operator-86d59d5c88-zzb8m           1/1     Running   0          3m8s
kube-system    cilium-sl8xk                               1/1     Running   0          3m8s
kube-system    coredns-745c7986c7-cqrqf                   1/1     Running   0          5m16s
kube-system    coredns-745c7986c7-l976j                   1/1     Running   0          5m16s
kube-system    kube-apiserver-10.100.137.124              1/1     Running   0          5m23s
kube-system    kube-apiserver-10.100.137.64               1/1     Running   0          3m10s
kube-system    kube-controller-manager-10.100.137.124     1/1     Running   0          5m23s
kube-system    kube-controller-manager-10.100.137.64      1/1     Running   0          3m10s
kube-system    kube-proxy-5qc2p                           1/1     Running   0          3m44s
kube-system    kube-proxy-ds8fm                           1/1     Running   0          3m10s
kube-system    kube-proxy-k5nh6                           1/1     Running   0          5m16s
kube-system    kube-proxy-lcv58                           1/1     Running   0          3m16s
kube-system    kube-scheduler-10.100.137.124              1/1     Running   0          5m23s
kube-system    kube-scheduler-10.100.137.64               1/1     Running   0          3m10s
kube-system    kube-vip-10.100.137.124                    1/1     Running   0          5m23s
kube-system    kube-vip-10.100.137.64                     1/1     Running   0          3m10s
kube-system    vsphere-cloud-controller-manager-r2mrl     1/1     Running   4          3m10s
kube-system    vsphere-cloud-controller-manager-vcm8d     1/1     Running   2          3m44s
kube-system    vsphere-cloud-controller-manager-wlvkk     1/1     Running   2          5m12s
kube-system    vsphere-cloud-controller-manager-wn2rr     1/1     Running   2          3m16s
kube-system    vsphere-csi-controller-576c9c8dc8-7nmhd    5/5     Running   0          5m12s
kube-system    vsphere-csi-node-6spcz                     3/3     Running   0          3m10s
kube-system    vsphere-csi-node-gd8h5                     3/3     Running   0          5m12s
kube-system    vsphere-csi-node-j5rj6                     3/3     Running   0          3m44s
kube-system    vsphere-csi-node-k4lgf                     3/3     Running   0          3m16

That is why I think it is a race condition. Sometimes it gets stuck at Installing cluster-api providers on workload cluster and sometimes at Moving cluster management from bootstrap to workload cluster.

I cannot explain why this only occurs using Bottlerocket.

itaytalmi commented 2 years ago

It eventually timed out after about an hour so I've uploaded log bundles for this issue as well: https://ts-k8s-pub.s3.eu-central-1.amazonaws.com/aws-eks-anywhere/logs/bootstrap-cluster-support-bundle-2022-02-05T17_55_58.tar.gz https://ts-k8s-pub.s3.eu-central-1.amazonaws.com/aws-eks-anywhere/logs/mgmt-cluster-support-bundle-2022-02-05T17_58_03.tar.gz

mitalipaygude commented 2 years ago

Investigating this issue on our side. Thank you for uploading the support bundles.

itaytalmi commented 2 years ago

Works using v0.7.1.

aws / eks-anywhere

"Moving cluster management from bootstrap to workload cluster" step fails in 0.7.0 using Bottlerocket OS #1156