aws / eks-anywhere

Run Amazon EKS on your own infrastructure 🚀
https://anywhere.eks.amazonaws.com
Apache License 2.0
1.96k stars 284 forks source link

Upgrading eks anywhere baremetal cluster failed due to ipmi related issue #4540

Open or9327 opened 1 year ago

or9327 commented 1 year ago

What happened: Upgrading eks anywhere baremetal cluster failed due to ipmi related issue.

What you expected to happen: upgrade normally

How to reproduce it (as minimally and precisely as possible):

  1. create k8s cluster with 2 machines and 5 hardwares -> pass

    • hardware.csv containing 5 available hardwares
    • cluster-mgmt-cluster.yaml file with 1 cp, 1dp specification
  2. upgrade cluster -> fail

    • cluster-mgmt-cluster-upgrade.yaml, file that is just different in k8s version from the file that is used to create cluster

Anything else we need to know?:

model | idrac version -- | -- Dell PowerEdge R720 | 7 Dell PowerEdge R730 | 8 Dell PowerEdge R730 | 8 Dell PowerEdge R720 | 7 Dell PowerEdge R720 | 7

Environment:

abhinavmpandey08 commented 1 year ago

Hi @or9327, thanks for opening the issue. Can you verify IPMI over LAN is enabled on all the servers? Also, can you tell us which hardware was used in the first cluster creation?

or9327 commented 1 year ago

Hi @abhinavmpandey08 Thanks for reply. IPMI over LAN was enabled on all the servers, and the hardwares used to firsrt cluster creation were the two Dell PowerEdge R720 with idrac 7. Creating eksa cluster with above 5 hardwares worked fine.

abhinavmpandey08 commented 1 year ago

Thanks for that information! One clarification on

Creating eksa cluster with above 5 hardwares worked fine.

Did you create a 5 node cluster or just a 2 node cluster with 5 hardwares in the CSV?

or9327 commented 1 year ago

I created a 2 node cluster with 5 hardwares to have spare hardwares for upgrade. FYI, creating 5 node cluster worked fine.

abhinavmpandey08 commented 1 year ago

Okay so sounds like the IPMI worked during create but not during upgrade? It's possible then that there was a random network error when EKS-A tried to power on the node during upgrade. If you get a chance, can you re-try the upgrade process to figure out if it's a consistent issue or not?

or9327 commented 1 year ago

It seems like a consistent issue for me, since I've tried the upgrade process more than 5 times, and failed all the time.