aws / eks-anywhere

Run Amazon EKS on your own infrastructure 🚀
https://anywhere.eks.amazonaws.com
Apache License 2.0
1.98k stars 290 forks source link

[Bare Metal] Unable to scale down cluster (via CLI) with a failed node #6596

Open jacobweinstock opened 1 year ago

jacobweinstock commented 1 year ago

What happened: Scenario: I have a bare metal cluster. 1 control plane node and 3 worker nodes. One worker node fails. It becomes "NotReady" and triggers CAPI to attempt to scale up the cluster to handle the failed node. I do not have any available hardware.

Context: As a customer I accept the failed node and decide i want to just scale the cluster down. To do this, I update my cluster.yaml to scale down the worker nodes from 3 -> 2. Then I run eksctl anywhere upgrade cluster. This command fails with the following:

Performing setup and validations
✅ Tinkerbell provider validation
✅ Validate OS is compatible with registry mirror configuration
✅ Validate certificate for registry mirror
✅ Control plane ready
❌ Validation failed {"validation": "worker nodes ready", "error": "machine deployment is in ScalingUp phase", "remediation": "ensure machine deployments for cluster bottlerocket are Ready"}
❌ Validation failed {"validation": "nodes ready", "error": "node bottlerocket-md-0-dc6c59864xk66p4-fj2v2 is not ready, currently in NodeStatusUnknown state", "remediation": "check the Status of the control plane and worker nodes in cluster bottlerocket and verify they are Ready"}
✅ Cluster CRDs ready
✅ Cluster object present on workload cluster
✅ Upgrade cluster kubernetes version increment
✅ Upgrade cluster worker node group kubernetes version increment
✅ Validate authentication for git provider
✅ Validate immutable fields
✅ Validate cluster's eksaVersion matches EKS-A Version
✅ Validate pod disruption budgets
✅ Validate eksaVersion skew is one minor version
Error: failed to upgrade cluster: validations failed

Additional info

user@machine$ kubectl get machinedeployment -A
NAMESPACE     NAME                CLUSTER        REPLICAS   READY   UPDATED   UNAVAILABLE   PHASE       AGE   VERSION
eksa-system   bottlerocket-md-0   bottlerocket   3          2       3         1             ScalingUp   5d    v1.27.4-eks-1-27-9

user@machine$ kubectl get nodes
NAME                                       STATUS        ROLES           AGE     VERSION
bottlerocket-md-0-dc6c59864xk66p4-fj2v2    NotReady      <none>          5d1h    v1.27.1-eks-75a5dcc
bottlerocket-md-0-dc6c59864xk66p4-lzprn    Ready         <none>          17h     v1.27.1-eks-75a5dcc
bottlerocket-md-0-dc6c59864xk66p4-w2vjp    Ready         <none>          5d      v1.27.1-eks-75a5dcc
bottlerocket-rxxx8                         Ready         control-plane   5d14h   v1.27.1-eks-75a5dcc

What you expected to happen: In this scenario, if i modify the in cluster config, kubectl edit cluster mycluster i am able to scale down the cluster and bring the cluster back to a normal, healthy, running state. I expect the CLI to behave similar.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

czomo commented 1 year ago

@jacobweinstock did you managed to fix it?

jacobweinstock commented 1 year ago

@jacobweinstock did you managed to fix it?

Hey @czomo. Apologies, this has not been resolved yet.