aws / eks-anywhere

Run Amazon EKS on your own infrastructure 🚀
https://anywhere.eks.amazonaws.com
Apache License 2.0
1.96k stars 283 forks source link

fail to scale EKS anywhere cluster for bare metal #7614

Open ygao-armada opened 8 months ago

ygao-armada commented 8 months ago

What happened: I try to scale my cluster (from 1 cp node to 3 cp nodes out of 4 cp nodes, hardware.csv has 4 nodes, and I want to use 3 of them as cp nodes) with command:

eksctl anywhere upgrade cluster 
-f cluster.yaml 
--hardware-csv hardware.csv
--kubeconfig mgmt/mgmt-eks-a-cluster.kubeconfig

If I keep cp count 1 in the cluster.yaml, the command completes in 2 minutes, however, nothing change, even I don't see new hardware with follow command: kubectl get hardware -n eksa-system --show-labels

If I change cp count to 3 in the cluster.yaml, I keep seeing this message:

2024-02-15T17:26:49.783Z    V6   Executing command    {"cmd": "/usr/bin/docker exec -i eksa_1708017136282361003 kubectl get --ignore-not-found -o json --kubeconfig mgmt/mgmt-eks-a-cluster.kubeconfig Cluster.v1alpha1.anywhere.eks.amazonaws.com --namespace default mgmt02"}
2024-02-15T17:26:49.884Z    V9   Cluster generation and observedGeneration    {"Generation": 2, "ObservedGeneration": 1}
2024-02-15T17:26:49.884Z    V5   Error happened during retry   {"error": "cluster generation (2) and observedGeneration (1) differ", "retries": 782}
2024-02-15T17:26:49.884Z    V5   Sleeping before next retry   {"time": "1s"}

What you expected to happen: I expect the command to succeed with new hardware added

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

ph-armada commented 7 months ago

Hi admin - any update? Thanks.

ph-armada commented 7 months ago

Hi - EKSA team - may I know if you have a Slack channel or something that's more live for Q&A aside from GitHub issue triage as well? Thanks!

thecloudgarage commented 3 months ago

I am having the exact same issues

2024-07-16T20:59:47.604Z        V5      Error happened during retry     {"error": "cluster generation (6) and observedGeneration (5) differ", "retries": 188}
2024-07-16T20:59:47.604Z        V5      Sleeping before next retry      {"time": "1s"}

What's the solution/workaround

drewhemm commented 4 weeks ago

This seems to be caused by manually editing the cluster and then trying to upgrade it using the CLI. The docs do not make it clear whether the cluster should be modified by hand:

https://anywhere.eks.amazonaws.com/docs/clustermgmt/cluster-scale/baremetal-scale/#scaling-nodes-on-bare-metal-clusters

How to fix or ignore the issue though? Don't know yet...