EKS Anywhere Bare Metal Scaling Directions Don't Work

cprivitere commented 1 year ago

What happened: Trying to scale up cluster following directions from: https://anywhere.eks.amazonaws.com/docs/tasks/cluster/cluster-scale/baremetal-scale/

After I add new hardware to a cluster with eksctl anywhere upgrade cluster -f cluster.yaml --hardware-csv <hardware.csv> I cannot see it when I list it with: kubectl get hardware -n eksa-system --show-labels Attempting to boot it gets the following error in boots: {"level":"error","ts":1678917443.4466734,"caller":"boots/dhcp.go:101","msg":"retrieved job is empty","service":"github.com/tinkerbell/boots","pkg":"main","type":"DHCPDISCOVER","mac":"e8:eb:d3:10:0b:5a","error":"discover from dhcp message: no hardware found","errorVerbose":"no hardware found\ngithub.com/tinkerbell/boots/client/kubernetes.(*Finder).ByMAC\n\tgithub.com/tinkerbell/boots/client/kubernetes/hardware_finder.go:100\ngithub.com/tinkerbell/boots/job.(*Creator).CreateFromDHCP\n\tgithub.com/tinkerbell/boots/job/job.go:112\nmain.dhcpHandler.serve\n\tgithub.com/tinkerbell/boots/cmd/boots/dhcp.go:99\nmain.dhcpHandler.ServeDHCP.func1\n\tgithub.com/tinkerbell/boots/cmd/boots/dhcp.go:60\ngithub.com/gammazero/workerpool.(*WorkerPool).dispatch.func1\n\tgithub.com/gammazero/workerpool@v0.0.0-20200311205957-7b00833861c6/workerpool.go:169\nruntime.goexit\n\truntime/asm_amd64.s:1571\ndiscover from dhcp message"}

What you expected to happen: The hardware defined in the new hardware.csv file should be added as new hardware objects in the cluster.

How to reproduce it (as minimally and precisely as possible): Create a new node on Equinix Metal. Figure out its mac address and add it to the hardware.csv. Run the documented command: eksctl anywhere upgrade cluster -f cluster.yaml --hardware-csv <hardware.csv> See that no new hardware was added.

Anything else we need to know?: Doing the hardware generation manually with eksctl anywhere generate hardware and then applying the generated yaml files works fine.

Environment:

EKS Anywhere Release: v0.14.3
EKS Distro Release: N/A

pokearu commented 1 year ago

@cprivitere Hello. Thank you for creating the issue!

Could you please run the upgrade command again with -v9 and share the logs? I would like to check if the scale up operation failed, if the newly added hardware is not showing up from get hardware then the scale up operation itself must have failed saying not enough hardware to scale.

cprivitere commented 1 year ago

Sure thing, here's an output with -v9.

eksctl anywhere upgrade cluster -f my-eksa-cluster.yaml --hardware-csv hardware2.csv -v 9

root@eksa-2wgkhd-admin:~# eksctl anywhere upgrade cluster -f my-eksa-cluster.yaml --hardware-csv hardware2.csv  -v 9
2023-03-23T19:19:20.512Z        V4      Logger init completed   {"vlevel": 9}
2023-03-23T19:19:20.514Z        V0      Warning: The recommended number of control plane nodes is 3 or 5
2023-03-23T19:19:20.514Z        V6      Executing command       {"cmd": "/usr/bin/docker version --format {{.Client.Version}}"}
2023-03-23T19:19:20.608Z        V6      Executing command       {"cmd": "/usr/bin/docker info --format '{{json .MemTotal}}'"}
2023-03-23T19:19:20.754Z        V0      Warning: The recommended number of control plane nodes is 3 or 5
2023-03-23T19:19:20.818Z        V4      Reading bundles manifest        {"url": "https://anywhere-assets.eks.amazonaws.com/releases/bundles/29/manifest.yaml"}
2023-03-23T19:19:20.900Z        V0      Warning: The recommended number of control plane nodes is 3 or 5
2023-03-23T19:19:20.900Z        V5      Retrier:        {"timeout": "2562047h47m16.854775807s", "backoffFactor": null}
2023-03-23T19:19:20.900Z        V2      Pulling docker image    {"image": "public.ecr.aws/eks-anywhere/cli-tools:v0.14.3-eks-a-29"}
2023-03-23T19:19:20.900Z        V6      Executing command       {"cmd": "/usr/bin/docker pull public.ecr.aws/eks-anywhere/cli-tools:v0.14.3-eks-a-29"}
2023-03-23T19:19:21.319Z        V5      Retry execution successful      {"retries": 1, "duration": "418.98818ms"}
2023-03-23T19:19:21.319Z        V3      Initializing long running container     {"name": "eksa_1679599160900797177", "image": "public.ecr.aws/eks-anywhere/cli-tools:v0.14.3-eks-a-29"}
2023-03-23T19:19:21.319Z        V6      Executing command       {"cmd": "/usr/bin/docker run -d --name eksa_1679599160900797177 --network host -w /root -v /var/run/docker.sock:/var/run/docker.sock -v /root:/root --entrypoint sleep public.ecr.aws/eks-anywhere/cli-tools:v0.14.3-eks-a-29 infinity"}
2023-03-23T19:19:21.565Z        V4      Inferring local Tinkerbell Bootstrap IP from environment
2023-03-23T19:19:21.566Z        V4      Tinkerbell IP   {"tinkerbell-ip": "86.109.11.217"}
2023-03-23T19:19:21.566Z        V4      Task start      {"task_name": "setup-and-validate"}
2023-03-23T19:19:21.566Z        V0      Performing setup and validations
2023-03-23T19:19:21.566Z        V6      Executing command       {"cmd": "/usr/bin/docker exec -i eksa_1679599160900797177 kubectl get clusters.anywhere.eks.amazonaws.com -A -o jsonpath={.items[0]} --kubeconfig my-eksa-cluster/my-eksa-cluster-eks-a-cluster.kubeconfig --field-selector=metadata.name=my-eksa-cluster"}
2023-03-23T19:19:21.836Z        V6      Executing command       {"cmd": "/usr/bin/docker exec -i eksa_1679599160900797177 kubectl get bundles.anywhere.eks.amazonaws.com bundles-29 -o json --kubeconfig my-eksa-cluster/my-eksa-cluster-eks-a-cluster.kubeconfig --namespace eksa-system"}
2023-03-23T19:19:22.167Z        V6      Executing command       {"cmd": "/usr/bin/docker exec -i eksa_1679599160900797177 kubectl get --ignore-not-found -o json --kubeconfig my-eksa-cluster/my-eksa-cluster-eks-a-cluster.kubeconfig releases.distro.eks.amazonaws.com --namespace eksa-system kubernetes-1-25-eks-7"}
2023-03-23T19:19:22.443Z        V6      Executing command       {"cmd": "/usr/bin/docker exec -i eksa_1679599160900797177 kubectl get hardware.tinkerbell.org -l !v1alpha1.tinkerbell.org/ownerName --kubeconfig my-eksa-cluster/my-eksa-cluster-eks-a-cluster.kubeconfig -o json --namespace eksa-system"}
2023-03-23T19:19:22.699Z        V6      Executing command       {"cmd": "/usr/bin/docker exec -i eksa_1679599160900797177 kubectl get hardware.tinkerbell.org -l v1alpha1.tinkerbell.org/ownerName --kubeconfig my-eksa-cluster/my-eksa-cluster-eks-a-cluster.kubeconfig -o json --namespace eksa-system"}
2023-03-23T19:19:22.972Z        V0      ✅ Tinkerbell provider validation
2023-03-23T19:19:22.972Z        V6      Getting KubeadmControlPlane CRDs        {"cluster": "my-eksa-cluster"}
2023-03-23T19:19:22.972Z        V6      Executing command       {"cmd": "/usr/bin/docker exec -i eksa_1679599160900797177 kubectl get kubeadmcontrolplanes.controlplane.cluster.x-k8s.io my-eksa-cluster -o json --kubeconfig my-eksa-cluster/my-eksa-cluster-eks-a-cluster.kubeconfig --namespace eksa-system"}
2023-03-23T19:19:23.245Z        V6      waiting for nodes       {"cluster": "my-eksa-cluster"}
2023-03-23T19:19:23.245Z        V6      counting ready machine deployment replicas      {"cluster": "my-eksa-cluster"}
2023-03-23T19:19:23.245Z        V6      Executing command       {"cmd": "/usr/bin/docker exec -i eksa_1679599160900797177 kubectl get machinedeployments.cluster.x-k8s.io -o json --kubeconfig my-eksa-cluster/my-eksa-cluster-eks-a-cluster.kubeconfig --namespace eksa-system --selector=cluster.x-k8s.io/cluster-name=my-eksa-cluster"}
2023-03-23T19:19:23.530Z        V6      Executing command       {"cmd": "/usr/bin/docker exec -i eksa_1679599160900797177 kubectl get nodes -o go-template --template {{range .items}}{{.metadata.name}}\n{{end}} --kubeconfig my-eksa-cluster/my-eksa-cluster-eks-a-cluster.kubeconfig"}
2023-03-23T19:19:23.801Z        V6      Executing command       {"cmd": "/usr/bin/docker exec -i eksa_1679599160900797177 kubectl get node 147.75.88.51 -o go-template --template {{range .status.conditions}}{{if eq .type \"Ready\"}}{{.reason}}{{end}}{{end}} --kubeconfig my-eksa-cluster/my-eksa-cluster-eks-a-cluster.kubeconfig"}
2023-03-23T19:19:24.045Z        V6      Executing command       {"cmd": "/usr/bin/docker exec -i eksa_1679599160900797177 kubectl get node 147.75.88.52 -o go-template --template {{range .status.conditions}}{{if eq .type \"Ready\"}}{{.reason}}{{end}}{{end}} --kubeconfig my-eksa-cluster/my-eksa-cluster-eks-a-cluster.kubeconfig"}
2023-03-23T19:19:24.288Z        V6      Executing command       {"cmd": "/usr/bin/docker exec -i eksa_1679599160900797177 kubectl get crd clusters.cluster.x-k8s.io --kubeconfig my-eksa-cluster/my-eksa-cluster-eks-a-cluster.kubeconfig"}
2023-03-23T19:19:24.557Z        V6      Executing command       {"cmd": "/usr/bin/docker exec -i eksa_1679599160900797177 kubectl get clusters.cluster.x-k8s.io -o json --kubeconfig my-eksa-cluster/my-eksa-cluster-eks-a-cluster.kubeconfig --namespace eksa-system"}
2023-03-23T19:19:24.831Z        V6      Executing command       {"cmd": "/usr/bin/docker exec -i eksa_1679599160900797177 kubectl version -o json --kubeconfig my-eksa-cluster/my-eksa-cluster-eks-a-cluster.kubeconfig"}
2023-03-23T19:19:25.096Z        V3      calculating version differences {"inputVersion": "1.25", "clusterVersion": "1.25.6-eks-232056e"}
2023-03-23T19:19:25.096Z        V3      calculated version differences  {"majorVersionDifference": 0, "minorVersionDifference": 0}
2023-03-23T19:19:25.096Z        V6      Executing command       {"cmd": "/usr/bin/docker exec -i eksa_1679599160900797177 kubectl get clusters.anywhere.eks.amazonaws.com -A -o jsonpath={.items[0]} --kubeconfig my-eksa-cluster/my-eksa-cluster-eks-a-cluster.kubeconfig --field-selector=metadata.name=my-eksa-cluster"}
2023-03-23T19:19:25.376Z        V6      Executing command       {"cmd": "/usr/bin/docker exec -i eksa_1679599160900797177 kubectl get clusters.anywhere.eks.amazonaws.com -A -o jsonpath={.items[0]} --kubeconfig my-eksa-cluster/my-eksa-cluster-eks-a-cluster.kubeconfig --field-selector=metadata.name=my-eksa-cluster"}
2023-03-23T19:19:25.644Z        V6      Executing command       {"cmd": "/usr/bin/docker exec -i eksa_1679599160900797177 kubectl get tinkerbelldatacenterconfigs.anywhere.eks.amazonaws.com my-eksa-cluster -o json --kubeconfig my-eksa-cluster/my-eksa-cluster-eks-a-cluster.kubeconfig --namespace default"}
2023-03-23T19:19:25.925Z        V6      Executing command       {"cmd": "/usr/bin/docker exec -i eksa_1679599160900797177 kubectl get tinkerbellmachineconfigs.anywhere.eks.amazonaws.com my-eksa-cluster-cp -o json --kubeconfig my-eksa-cluster/my-eksa-cluster-eks-a-cluster.kubeconfig --namespace "}
2023-03-23T19:19:26.217Z        V6      Executing command       {"cmd": "/usr/bin/docker exec -i eksa_1679599160900797177 kubectl get tinkerbellmachineconfigs.anywhere.eks.amazonaws.com my-eksa-cluster -o json --kubeconfig my-eksa-cluster/my-eksa-cluster-eks-a-cluster.kubeconfig --namespace "}
2023-03-23T19:19:26.487Z        V0      ✅ Validate certificate for registry mirror
2023-03-23T19:19:26.487Z        V0      ✅ Control plane ready
2023-03-23T19:19:26.487Z        V0      ✅ Worker nodes ready
2023-03-23T19:19:26.487Z        V0      ✅ Nodes ready
2023-03-23T19:19:26.487Z        V0      ✅ Cluster CRDs ready
2023-03-23T19:19:26.487Z        V0      ✅ Cluster object present on workload cluster
2023-03-23T19:19:26.487Z        V0      ✅ Upgrade cluster kubernetes version increment
2023-03-23T19:19:26.487Z        V0      ✅ Validate authentication for git provider
2023-03-23T19:19:26.487Z        V0      ✅ Validate immutable fields
2023-03-23T19:19:26.487Z        V0      ✅ Upgrade preflight validations pass
2023-03-23T19:19:26.487Z        V4      Task finished   {"task_name": "setup-and-validate", "duration": "4.921349981s"}
2023-03-23T19:19:26.487Z        V4      ----------------------------------
2023-03-23T19:19:26.487Z        V4      Task start      {"task_name": "update-secrets"}
2023-03-23T19:19:26.487Z        V4      Task finished   {"task_name": "update-secrets", "duration": "1.357µs"}
2023-03-23T19:19:26.487Z        V4      ----------------------------------
2023-03-23T19:19:26.487Z        V4      Task start      {"task_name": "ensure-etcd-capi-components-exist"}
2023-03-23T19:19:26.487Z        V0      Ensuring etcd CAPI providers exist on management cluster before upgrade
2023-03-23T19:19:26.487Z        V6      Executing command       {"cmd": "/usr/bin/docker exec -i eksa_1679599160900797177 kubectl get namespace --field-selector=metadata.name=etcdadm-bootstrap-provider-system --kubeconfig my-eksa-cluster/my-eksa-cluster-eks-a-cluster.kubeconfig"}
2023-03-23T19:19:26.756Z        V6      Executing command       {"cmd": "/usr/bin/docker exec -i eksa_1679599160900797177 kubectl get provider --namespace etcdadm-bootstrap-provider-system --field-selector=metadata.name=bootstrap-etcdadm-bootstrap --kubeconfig my-eksa-cluster/my-eksa-cluster-eks-a-cluster.kubeconfig"}
2023-03-23T19:19:27.005Z        V6      Executing command       {"cmd": "/usr/bin/docker exec -i eksa_1679599160900797177 kubectl get namespace --field-selector=metadata.name=etcdadm-controller-system --kubeconfig my-eksa-cluster/my-eksa-cluster-eks-a-cluster.kubeconfig"}
2023-03-23T19:19:27.235Z        V6      Executing command       {"cmd": "/usr/bin/docker exec -i eksa_1679599160900797177 kubectl get provider --namespace etcdadm-controller-system --field-selector=metadata.name=bootstrap-etcdadm-controller --kubeconfig my-eksa-cluster/my-eksa-cluster-eks-a-cluster.kubeconfig"}
2023-03-23T19:19:27.485Z        V4      Task finished   {"task_name": "ensure-etcd-capi-components-exist", "duration": "997.360905ms"}
2023-03-23T19:19:27.485Z        V4      ----------------------------------
2023-03-23T19:19:27.485Z        V4      Task start      {"task_name": "pause-controllers-reconcile"}
2023-03-23T19:19:27.485Z        V0      Pausing EKS-A cluster controller reconcile
2023-03-23T19:19:27.485Z        V5      Retrier:        {"timeout": "2562047h47m16.854775807s", "backoffFactor": null}
2023-03-23T19:19:27.485Z        V6      Executing command       {"cmd": "/usr/bin/docker exec -i eksa_1679599160900797177 kubectl get --ignore-not-found -o json --kubeconfig my-eksa-cluster/my-eksa-cluster-eks-a-cluster.kubeconfig clusters.anywhere.eks.amazonaws.com --namespace default"}
2023-03-23T19:19:27.748Z        V5      Retry execution successful      {"retries": 1, "duration": "263.165421ms"}
2023-03-23T19:19:27.748Z        V5      Retrier:        {"timeout": "2562047h47m16.854775807s", "backoffFactor": null}
2023-03-23T19:19:27.748Z        V6      Executing command       {"cmd": "/usr/bin/docker exec -i eksa_1679599160900797177 kubectl annotate tinkerbelldatacenterconfigs.anywhere.eks.amazonaws.com my-eksa-cluster anywhere.eks.amazonaws.com/paused=true --overwrite --kubeconfig my-eksa-cluster/my-eksa-cluster-eks-a-cluster.kubeconfig --namespace default"}
2023-03-23T19:19:28.030Z        V5      Retry execution successful      {"retries": 1, "duration": "282.02966ms"}
2023-03-23T19:19:28.030Z        V5      Retrier:        {"timeout": "2562047h47m16.854775807s", "backoffFactor": null}
2023-03-23T19:19:28.030Z        V6      Executing command       {"cmd": "/usr/bin/docker exec -i eksa_1679599160900797177 kubectl annotate tinkerbellmachineconfigs.anywhere.eks.amazonaws.com my-eksa-cluster-cp anywhere.eks.amazonaws.com/paused=true --overwrite --kubeconfig my-eksa-cluster/my-eksa-cluster-eks-a-cluster.kubeconfig --namespace default"}
2023-03-23T19:19:28.325Z        V5      Retry execution successful      {"retries": 1, "duration": "294.720919ms"}
2023-03-23T19:19:28.325Z        V5      Retrier:        {"timeout": "2562047h47m16.854775807s", "backoffFactor": null}
2023-03-23T19:19:28.325Z        V6      Executing command       {"cmd": "/usr/bin/docker exec -i eksa_1679599160900797177 kubectl annotate tinkerbellmachineconfigs.anywhere.eks.amazonaws.com my-eksa-cluster anywhere.eks.amazonaws.com/paused=true --overwrite --kubeconfig my-eksa-cluster/my-eksa-cluster-eks-a-cluster.kubeconfig --namespace default"}
2023-03-23T19:19:28.611Z        V5      Retry execution successful      {"retries": 1, "duration": "286.541025ms"}
2023-03-23T19:19:28.611Z        V5      Retrier:        {"timeout": "2562047h47m16.854775807s", "backoffFactor": null}
2023-03-23T19:19:28.612Z        V6      Executing command       {"cmd": "/usr/bin/docker exec -i eksa_1679599160900797177 kubectl annotate clusters.anywhere.eks.amazonaws.com my-eksa-cluster anywhere.eks.amazonaws.com/paused=true --overwrite --kubeconfig my-eksa-cluster/my-eksa-cluster-eks-a-cluster.kubeconfig --namespace default"}
2023-03-23T19:19:28.911Z        V5      Retry execution successful      {"retries": 1, "duration": "299.526137ms"}
2023-03-23T19:19:28.911Z        V5      Retrier:        {"timeout": "2562047h47m16.854775807s", "backoffFactor": null}
2023-03-23T19:19:28.911Z        V6      Executing command       {"cmd": "/usr/bin/docker exec -i eksa_1679599160900797177 kubectl annotate clusters.anywhere.eks.amazonaws.com my-eksa-cluster anywhere.eks.amazonaws.com/managed-by-cli=true --overwrite --kubeconfig my-eksa-cluster/my-eksa-cluster-eks-a-cluster.kubeconfig --namespace default"}
2023-03-23T19:19:29.202Z        V5      Retry execution successful      {"retries": 1, "duration": "290.474344ms"}
2023-03-23T19:19:29.202Z        V0      Pausing GitOps cluster resources reconcile
2023-03-23T19:19:29.202Z        V4      GitOps field not specified, pause cluster resources reconcile skipped
2023-03-23T19:19:29.202Z        V4      Task finished   {"task_name": "pause-controllers-reconcile", "duration": "1.717128141s"}
2023-03-23T19:19:29.202Z        V4      ----------------------------------
2023-03-23T19:19:29.202Z        V4      Task start      {"task_name": "upgrade-core-components"}
2023-03-23T19:19:29.202Z        V0      Upgrading core components
2023-03-23T19:19:29.202Z        V1      Nothing to upgrade for Cilium, skipping
2023-03-23T19:19:29.202Z        V1      Checking for CAPI upgrades
2023-03-23T19:19:29.202Z        V1      Nothing to upgrade for CAPI
2023-03-23T19:19:29.202Z        V1      Checking for Flux upgrades
2023-03-23T19:19:29.202Z        V1      Skipping Flux upgrades, GitOps not enabled
2023-03-23T19:19:29.202Z        V1      Nothing to upgrade for Flux
2023-03-23T19:19:29.202Z        V1      Checking for EKS-A components upgrade
2023-03-23T19:19:29.202Z        V1      Nothing to upgrade for controller and CRDs
2023-03-23T19:19:29.202Z        V1      Checking for EKS-D components upgrade
2023-03-23T19:19:29.202Z        V1      Nothing to upgrade for EKS-D components
2023-03-23T19:19:29.202Z        V4      Task finished   {"task_name": "upgrade-core-components", "duration": "225.131µs"}
2023-03-23T19:19:29.202Z        V4      ----------------------------------
2023-03-23T19:19:29.202Z        V4      Task start      {"task_name": "upgrade-needed"}
2023-03-23T19:19:29.202Z        V6      Executing command       {"cmd": "/usr/bin/docker exec -i eksa_1679599160900797177 kubectl get clusters.anywhere.eks.amazonaws.com -A -o jsonpath={.items[0]} --kubeconfig my-eksa-cluster/my-eksa-cluster-eks-a-cluster.kubeconfig --field-selector=metadata.name=my-eksa-cluster"}
2023-03-23T19:19:29.488Z        V6      Executing command       {"cmd": "/usr/bin/docker exec -i eksa_1679599160900797177 kubectl get bundles.anywhere.eks.amazonaws.com bundles-29 -o json --kubeconfig my-eksa-cluster/my-eksa-cluster-eks-a-cluster.kubeconfig --namespace eksa-system"}
2023-03-23T19:19:29.812Z        V6      Executing command       {"cmd": "/usr/bin/docker exec -i eksa_1679599160900797177 kubectl get --ignore-not-found -o json --kubeconfig my-eksa-cluster/my-eksa-cluster-eks-a-cluster.kubeconfig releases.distro.eks.amazonaws.com --namespace eksa-system kubernetes-1-25-eks-7"}
2023-03-23T19:19:30.110Z        V3      Clusters are the same
2023-03-23T19:19:30.110Z        V0      No upgrades needed from cluster spec
2023-03-23T19:19:30.110Z        V4      Task finished   {"task_name": "upgrade-needed", "duration": "908.019697ms"}
2023-03-23T19:19:30.110Z        V4      ----------------------------------
2023-03-23T19:19:30.110Z        V4      Task start      {"task_name": "resume-eksa-and-gitops-kustomization"}
2023-03-23T19:19:30.110Z        V0      Resuming EKS-A controller reconcile
2023-03-23T19:19:30.110Z        V5      Retrier:        {"timeout": "2562047h47m16.854775807s", "backoffFactor": null}
2023-03-23T19:19:30.110Z        V6      Executing command       {"cmd": "/usr/bin/docker exec -i eksa_1679599160900797177 kubectl get --ignore-not-found -o json --kubeconfig my-eksa-cluster/my-eksa-cluster-eks-a-cluster.kubeconfig clusters.anywhere.eks.amazonaws.com"}
2023-03-23T19:19:30.386Z        V5      Retry execution successful      {"retries": 1, "duration": "275.815573ms"}
2023-03-23T19:19:30.386Z        V5      Retrier:        {"timeout": "2562047h47m16.854775807s", "backoffFactor": null}
2023-03-23T19:19:30.386Z        V6      Executing command       {"cmd": "/usr/bin/docker exec -i eksa_1679599160900797177 kubectl annotate tinkerbelldatacenterconfigs.anywhere.eks.amazonaws.com my-eksa-cluster anywhere.eks.amazonaws.com/paused- --kubeconfig my-eksa-cluster/my-eksa-cluster-eks-a-cluster.kubeconfig --namespace default"}
2023-03-23T19:19:30.657Z        V5      Retry execution successful      {"retries": 1, "duration": "271.122665ms"}
2023-03-23T19:19:30.657Z        V5      Retrier:        {"timeout": "2562047h47m16.854775807s", "backoffFactor": null}
2023-03-23T19:19:30.657Z        V6      Executing command       {"cmd": "/usr/bin/docker exec -i eksa_1679599160900797177 kubectl annotate tinkerbellmachineconfigs.anywhere.eks.amazonaws.com my-eksa-cluster-cp anywhere.eks.amazonaws.com/paused- --kubeconfig my-eksa-cluster/my-eksa-cluster-eks-a-cluster.kubeconfig --namespace default"}
2023-03-23T19:19:30.945Z        V5      Retry execution successful      {"retries": 1, "duration": "287.85848ms"}
2023-03-23T19:19:30.945Z        V5      Retrier:        {"timeout": "2562047h47m16.854775807s", "backoffFactor": null}
2023-03-23T19:19:30.945Z        V6      Executing command       {"cmd": "/usr/bin/docker exec -i eksa_1679599160900797177 kubectl annotate tinkerbellmachineconfigs.anywhere.eks.amazonaws.com my-eksa-cluster anywhere.eks.amazonaws.com/paused- --kubeconfig my-eksa-cluster/my-eksa-cluster-eks-a-cluster.kubeconfig --namespace default"}
2023-03-23T19:19:31.222Z        V5      Retry execution successful      {"retries": 1, "duration": "277.061899ms"}
2023-03-23T19:19:31.222Z        V5      Retrier:        {"timeout": "2562047h47m16.854775807s", "backoffFactor": null}
2023-03-23T19:19:31.223Z        V6      Executing command       {"cmd": "/usr/bin/docker exec -i eksa_1679599160900797177 kubectl annotate clusters.anywhere.eks.amazonaws.com my-eksa-cluster anywhere.eks.amazonaws.com/paused- --kubeconfig my-eksa-cluster/my-eksa-cluster-eks-a-cluster.kubeconfig --namespace default"}
2023-03-23T19:19:31.522Z        V5      Retry execution successful      {"retries": 1, "duration": "299.636052ms"}
2023-03-23T19:19:31.522Z        V5      Retrier:        {"timeout": "2562047h47m16.854775807s", "backoffFactor": null}
2023-03-23T19:19:31.522Z        V6      Executing command       {"cmd": "/usr/bin/docker exec -i eksa_1679599160900797177 kubectl annotate clusters.anywhere.eks.amazonaws.com my-eksa-cluster anywhere.eks.amazonaws.com/managed-by-cli- --kubeconfig my-eksa-cluster/my-eksa-cluster-eks-a-cluster.kubeconfig --namespace default"}
2023-03-23T19:19:31.804Z        V5      Retry execution successful      {"retries": 1, "duration": "281.555572ms"}
2023-03-23T19:19:31.804Z        V0      Updating Git Repo with new EKS-A cluster spec
2023-03-23T19:19:31.804Z        V0      GitOps field not specified, update git repo skipped
2023-03-23T19:19:31.804Z        V0      Forcing reconcile Git repo with latest commit
2023-03-23T19:19:31.804Z        V0      GitOps not configured, force reconcile flux git repo skipped
2023-03-23T19:19:31.804Z        V0      Resuming GitOps cluster resources kustomization
2023-03-23T19:19:31.804Z        V4      GitOps field not specified, resume cluster resources reconcile skipped
2023-03-23T19:19:31.804Z        V4      Task finished   {"task_name": "resume-eksa-and-gitops-kustomization", "duration": "1.693660479s"}
2023-03-23T19:19:31.804Z        V4      ----------------------------------
2023-03-23T19:19:31.804Z        V4      Tasks completed {"duration": "10.238254304s"}
2023-03-23T19:19:31.804Z        V3      Cleaning up long running container      {"name": "eksa_1679599160900797177"}
2023-03-23T19:19:31.804Z        V6      Executing command       {"cmd": "/usr/bin/docker rm -f -v eksa_1679599160900797177"}

root@eksa-2wgkhd-admin:~# kubectl get hardware -A
NAMESPACE     NAME                          STATE
eksa-system   eksa-2wgkhd-node-cp-001
eksa-system   eksa-2wgkhd-node-worker-001

And here's the hardware2.csv file

root@eksa-2wgkhd-admin:~# cat hardware2.csv
hostname,vendor,mac,ip_address,gateway,netmask,nameservers,disk,labels
eksa-2wgkhd-node-worker-002,Equinix,10:70:fd:86:ec:3e,147.75.88.60,147.75.88.49,255.255.255.240,8.8.8.8|8.8.4.4,/dev/sda,type=worker

pokearu commented 1 year ago

@cprivitere hey just catching up with the issue.

Looking at the logs, it seems like there were no changes detected in the cluster config and hence there was nothing to upgrade?

2023-03-23T19:19:30.110Z        V3      Clusters are the same
2023-03-23T19:19:30.110Z        V0      No upgrades needed from cluster spec
2023-03-23T19:19:30.110Z        V4      Task finished   {"task_name": "upgrade-needed", "duration": "908.019697ms"}

We dont apply the newly provided hardware to the cluster unless there is something to upgrade. The apply command for new hardware is run on the local KinD cluster that gets created for the upgrade process for management clusters.

Could we confirm that there was a change in the cluster.yaml file? Specifically for scaling a worker we need to increase the

  workerNodeGroupsConfiguration:
  - count: 1

cprivitere commented 1 year ago

Yep, looks like that's the piece I'm missing. I either missed it in the documentation or it was added after I last tried, but I'll give this a go.

cprivitere commented 1 year ago

I was able to give this a try and it looks like it doesn't work. I created a new hardware.csv with the new machine's name and mac address and other relevant info, passed it into the eksctl upgrade command, and no new hardware objects were created. And what's worse is the program just leaves me hanging waiting forever for the upgrade. I had to do a separate trouble shooting session just to finally confirm from boots that there was no hardware entry for the mac address, as well as confirming there was indeed no new hardware object.

Here's the artifacts I can give you. First, the output from eksctl:

root@eksa-ecqzeq-admin:~# eksctl anywhere upgrade cluster -f my-eksa-cluster.yaml --hardware-csv hardware.csv --kubeconfig $KUBECONFIG
Warning: The recommended number of control plane nodes is 3 or 5
Warning: The recommended number of control plane nodes is 3 or 5
Warning: The recommended number of control plane nodes is 3 or 5
Performing setup and validations
✅ Tinkerbell provider validation
✅ Validate certificate for registry mirror
✅ Control plane ready
✅ Worker nodes ready
✅ Nodes ready
✅ Cluster CRDs ready
✅ Cluster object present on workload cluster
✅ Upgrade cluster kubernetes version increment
✅ Validate authentication for git provider
✅ Validate immutable fields
✅ Upgrade preflight validations pass
Ensuring etcd CAPI providers exist on management cluster before upgrade
Pausing EKS-A cluster controller reconcile
Pausing GitOps cluster resources reconcile
Upgrading core components
Upgrading workload cluster

Then, the hardware.csv I used:

hostname,vendor,mac,ip_address,gateway,netmask,nameservers,disk,labels
eksa-ecqzeq-node-worker-002,Equinix,e8:eb:d3:10:0b:ae,147.75.88.230,147.75.88.225,255.255.255.240,8.8.8.8|8.8.4.4,/dev/sda,type=worker

Then the output from kubectl get hardware:

root@eksa-ecqzeq-admin:~# kubectl get hardware -n eksa-system --show-labels
NAME                          STATE   LABELS
eksa-ecqzeq-node-cp-001               type=cp,v1alpha1.tinkerbell.org/ownerName=my-eksa-cluster-control-plane-template-1682360077903-6jz8j,v1alpha1.tinkerbell.org/ownerNamespace=eksa-system
eksa-ecqzeq-node-worker-001           type=worker,v1alpha1.tinkerbell.org/ownerName=my-eksa-cluster-md-0-1682360077905-hg82j,v1alpha1.tinkerbell.org/ownerNamespace=eksa-system

Then finally the error message in boots:

{"level":"error","ts":1682373820.0116758,"caller":"boots/dhcp.go:101","msg":"retrieved job is empty","service":"github.com/tinkerbell/boots","pkg":"main","type":"DHCPDISCOVER","mac":"e8:eb:d3:10:0b:ae","error":"discover from dhcp message: no hardware found","errorVerbose":"no hardware found\ngithub.com/tinkerbell/boots/client/kubernetes.(*Finder).ByMAC\n\tgithub.com/tinkerbell/boots/client/kubernetes/hardware_finder.go:100\ngithub.com/tinkerbell/boots/job.(*Creator).CreateFromDHCP\n\tgithub.com/tinkerbell/boots/job/job.go:112\nmain.dhcpHandler.serve\n\tgithub.com/tinkerbell/boots/cmd/boots/dhcp.go:99\nmain.dhcpHandler.ServeDHCP.func1\n\tgithub.com/tinkerbell/boots/cmd/boots/dhcp.go:60\ngithub.com/gammazero/workerpool.(*WorkerPool).dispatch.func1\n\tgithub.com/gammazero/workerpool@v0.0.0-20200311205957-7b00833861c6/workerpool.go:169\nruntime.goexit\n\truntime/asm_amd64.s:1571\ndiscover from dhcp message"}

cprivitere commented 1 year ago

@pokearu

aws / eks-anywhere

EKS Anywhere Bare Metal Scaling Directions Don't Work #5316