aws / eks-anywhere

Run Amazon EKS on your own infrastructure 🚀
https://anywhere.eks.amazonaws.com
Apache License 2.0
1.94k stars 277 forks source link

upgrade kubernetes version of EKSA cluster for bare metal with 2 CP nodes (1 used + 1 idle) doesn't work #7820

Open ygao-armada opened 4 months ago

ygao-armada commented 4 months ago

What happened: It's ok for me to upgrade a cluster with 4 CP nodes (3 used + 1 idle).

However, when I try to upgrade a cluster with 2 CP nodes (1 used + 1 idle), the upgrade stuck after the idle node is "Provisioned":

armada@admin-machine2:~/eksa/mgmt02$ kubectl get workflow -A -o wide
NAMESPACE     NAME                                                            TEMPLATE                                                        STATE
eksa-system   mgmt02-standalone2-control-plane-template-1710122858425-44xpk   mgmt02-standalone2-control-plane-template-1710122858425-44xpk   STATE_SUCCESS

armada@admin-machine2:~/eksa/mgmt02$ kubectl get node -o wide
NAME              STATUS   ROLES           AGE     VERSION    INTERNAL-IP    EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
eksa-control-02   Ready    control-plane   5h16m   v1.26.14   10.20.22.224   <none>        Ubuntu 20.04.6 LTS   5.4.0-172-generic   containerd://1.7.10
eksa-wk-650-01    Ready    <none>          4h56m   v1.26.14   10.20.22.227   <none>        Ubuntu 20.04.6 LTS   5.4.0-172-generic   containerd://1.7.10

armada@admin-machine2:~/eksa/mgmt02$ kubectl get machines.cluster.x-k8s.io -A -o wide
NAMESPACE     NAME                                  CLUSTER              NODENAME          PROVIDERID                                      PHASE         AGE     VERSION
eksa-system   mgmt02-standalone2-b94wm              mgmt02-standalone2   eksa-control-02   tinkerbell://eksa-system/eksa-control-02        Running       4h54m   v1.26.10-eks-1-26-21
eksa-system   mgmt02-standalone2-md-0-dvxsx-ntvph   mgmt02-standalone2   eksa-wk-650-01    tinkerbell://eksa-system/eksa-wk-650-01         Running       4h54m   v1.26.10-eks-1-26-21
eksa-system   mgmt02-standalone2-p8fwl              mgmt02-standalone2                     tinkerbell://eksa-system/eksa-main-control-01   Provisioned   45m     v1.27.7-eks-1-27-15

An smoking issue after the idle node is "Provisioned" is that, when I run above "kubectl get ..." commands, I may see such error message:

...
E0311 07:45:31.895702  953673 memcache.go:265] couldn't get current server API group list: Get "https://10.20.22.222:6443/api?timeout=32s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes")
Unable to connect to the server: tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "kubernetes")

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

ndeksa commented 3 months ago

@ygao-armada, ideally there shouldn't be a difference due to the number of CPs; am curious if the node is in idle stage due to no resource or some event, or something else ?

ygao-armada commented 2 months ago

@ndeksa sorry, I should have used "spare" instead of "idle".