aws / eks-anywhere

Run Amazon EKS on your own infrastructure 🚀
https://anywhere.eks.amazonaws.com
Apache License 2.0
1.97k stars 287 forks source link

EKSA Bare Metal: K8s Upgrade failing on self managed 4 node cluster as one of machines.cluster.x-k8s.io stuck in phase Provisioning despite the node completing tinkerbell workflow successfully and joining the cluster #6131

Open robbierafay opened 1 year ago

robbierafay commented 1 year ago

What happened: K8s Upgrade failed on self managed 4 node cluster (1 cp, 3 workers) as one of workers machines.cluster.x-k8s.io stuck in phase Provisioning despite the node completing tinkerbell workflow successfully and joining the cluster

What you expected to happen: K8s upgrade to go through successfully

How to reproduce it (as minimally and precisely as possible):

Provision a 4 node cluster (1 cp, 3 workers) using K8s 1.23. eksctl anywhere version 0.16.1 Add two spare hardwares - 1 cp and 1 for worker

Upgrade to k8s 1.24 using

eksctl anywhere upgrade cluster --no-timeouts --filename cluster-upgrade.yaml --force-cleanup

Nodes go through rolling upgrade. A existing worker node stuck in Provisioning Phase : eksabm1-dp-n-2 even though it joined cluster ok . eksctl anywhere upgrade cluster command never finishes. Cluster crds are stuck on bootstrap cluster and worker cluster is unmanageable until upgrade completes.

From bootstrap Kind cluster :

# kubectl get machines.c -A

NAMESPACE     NAME                                 CLUSTER   NODENAME                             PROVIDERID                                PHASE          AGE   VERSION
eksa-system   eksupr2-mz2gc                        eksupr2   eksupr2-mz2gc                        tinkerbell://eksa-system/eksabm1-cp-n-2   Running        51m   v1.24.13-eks-1-24-18
eksa-system   eksupr2-ng1-6956984fb8x5dnd9-hmxsr   eksupr2   eksupr2-ng1-6956984fb8x5dnd9-hmxsr   tinkerbell://eksa-system/eksabm1-dp-n-3   Running        51m   v1.23.17-eks-1-23-23
eksa-system   eksupr2-ng1-865d944788xkpz9x-7wdjt   eksupr2   eksupr2-ng1-865d944788xkpz9x-7wdjt   tinkerbell://eksa-system/eksabm1-dp-n-4   Running        41m   v1.24.13-eks-1-24-18
eksa-system   eksupr2-ng1-865d944788xkpz9x-msxrr   eksupr2                                                                                  Provisioning   18m   v1.24.13-eks-1-24-18
eksa-system   eksupr2-ng1-865d944788xkpz9x-vtgck   eksupr2   eksupr2-ng1-865d944788xkpz9x-vtgck   tinkerbell://eksa-system/eksabm1-dp-n-1   Running        32m   v1.24.13-eks-1-24-18
# kubectl get tinkerbellmachines -A
NAMESPACE     NAME                                                 CLUSTER   STATE   READY   INSTANCEID                                MACHINE
eksa-system   eksupr2-control-plane-template-1688724473017-rxzg4   eksupr2           true    tinkerbell://eksa-system/eksabm1-cp-n-2   eksupr2-mz2gc
eksa-system   eksupr2-ng1-1688720116994-c2sp5                      eksupr2           true    tinkerbell://eksa-system/eksabm1-dp-n-3   eksupr2-ng1-6956984fb8x5dnd9-hmxsr
eksa-system   eksupr2-ng1-1688724473433-ttdf7                      eksupr2           true    tinkerbell://eksa-system/eksabm1-dp-n-1   eksupr2-ng1-865d944788xkpz9x-vtgck
eksa-system   eksupr2-ng1-1688724473433-txchw                      eksupr2                   tinkerbell://eksa-system/eksabm1-dp-n-2   eksupr2-ng1-865d944788xkpz9x-msxrr
eksa-system   eksupr2-ng1-1688724473433-zm8gm                      eksupr2           true    tinkerbell://eksa-system/eksabm1-dp-n-4   eksupr2-ng1-865d944788xkpz9x-7wdjt
# kubectl get hardware -A
NAMESPACE     NAME             STATE
eksa-system   eksabm1-cp-n-1       <<<--- Original CP node 
eksa-system   eksabm1-cp-n-2       <<<--- Spare CP node added
eksa-system   eksabm1-dp-n-1.      <<<--- Original worker node 
eksa-system   eksabm1-dp-n-2.     <<<--- Original worker node 
eksa-system   eksabm1-dp-n-3.     <<<--- Original worker node 
eksa-system   eksabm1-dp-n-4.     <<<--- Spare worker node added 
# kubectl get kcp -A
NAMESPACE     NAME      CLUSTER   INITIALIZED   API SERVER AVAILABLE   REPLICAS   READY   UPDATED   UNAVAILABLE   AGE   VERSION
eksa-system   eksupr2   eksupr2   true          true                   1          1       1         0             51m   v1.24.13-eks-1-24-18
# kubectl get md -A
NAMESPACE     NAME          CLUSTER   REPLICAS   READY   UPDATED   UNAVAILABLE   PHASE     AGE   VERSION
eksa-system   eksupr2-ng1   eksupr2   4          3       3         1             Running   51m   v1.24.13-eks-1-24-18
# kubectl get workflows -A
NAMESPACE     NAME                                                 TEMPLATE                                             STATE
eksa-system   eksupr2-control-plane-template-1688724473017-rxzg4   eksupr2-control-plane-template-1688724473017-rxzg4   STATE_SUCCESS
eksa-system   eksupr2-ng1-1688724473433-c2awd                      eksupr2-ng1-1688724473433-c2awd                      STATE_SUCCESS
eksa-system   eksupr2-ng1-1688724473433-txchw                      eksupr2-ng1-1688724473433-txchw                      STATE_SUCCESS
eksa-system   eksupr2-ng1-1688724473433-zm8gm                      eksupr2-ng1-1688724473433-zm8gm                      STATE_SUCCESS

From workload cluster :

# kubectl get nodes
NAME                                 STATUS   ROLES           AGE     VERSION
eksupr2-mz2gc                        Ready    control-plane   45m     v1.24.12-eks-76dc719
eksupr2-ng1-6956984fb8x5dnd9-hmxsr   Ready    <none>          81m     v1.23.17-eks-712c388
eksupr2-ng1-865d944788xkpz9x-7wdjt   Ready    <none>          37m     v1.24.12-eks-76dc719
eksupr2-ng1-865d944788xkpz9x-msxrr   Ready    <none>          2m57s   v1.24.12-eks-76dc719.   <<<<<------ Problem node Joined cluster ok 
eksupr2-ng1-865d944788xkpz9x-vtgck   Ready    <none>          24m     v1.24.12-eks-76dc719

Anything else we need to know?:

cluster-upgrade.yaml.txt hardware.csv.txt hardware-upgrade.csv.txt cluster.yaml.txt

cli.log.txt

Environment:

robbierafay commented 1 year ago

Any ideas to force machines.cluster.x-k8s.io to Phase Running ?

robbierafay commented 1 year ago

To unblock myself, I was able to find a hack to force the upgrade to complete. I manually edited hardware crd for the problem node and set states to trigger the following condition in code which marked the tinkerbellmachine as Ready and machine.c as Running

hw.Spec.Metadata.State == inUse && hw.Spec.Metadata.Instance.State == provisioned

Cluster upgrade completes after this change.

Change in hardware crds

......
.....
  metadata:
    facility:
      facility_code: onprem
      plan_slug: c2.medium.x86
    instance:
      allow_pxe: true
      always_pxe: true
      hostname: eksabm1-dp-n-1
      id: 08:00:27:73:28:85
      ips:
      - address: 192.168.10.229
        family: 4
        gateway: 192.168.10.1
        netmask: 255.255.255.0
        public: true
      operating_system: {}
      state: provisioned <<<<<<<<----------------------
    state: in_use <<<<<<<<----------------------
  userData: |2-