RollingUpgrades observed on workload clusters when upgrading management cluster

maxdrib commented 2 years ago

What happened: Occasionally, changes such as the one in https://github.com/aws/eks-anywhere/pull/3402 can cause minor changes in the CAPI resources which the EKS-A controller picks up. When a management cluster is upgraded and a new EKS-A controller is installed on a cluster, it begins applying these new resources when reconciling not just the management cluster, but the workload clusters as well. This causes all of them to perform an unexpected rollingupgrade. I believe this rolling upgrade for workload clusters is an undesirable behavior for users and should be avoided or mitigated.

When I was running some large scale tests with many workload clusters, I observed that every single workload cluster began its rolling upgrade simultaneously. If I have ten workload clusters with VM's of 16GB and 8 CPU's, and all of them start a rolling upgrade simultaneously, that will require a huge spike in resources which I may not be prepared to handle. This issue becomes even more prevalent with larger VM's and more workload clusters.

What you expected to happen: Ideally, management cluster upgrades would never trigger rolling upgrades on workload clusters as a side effect. However, given that they share some resources, these upgrades should at least be staggered, and the CLI should wait for them all to complete and for the workload clusters to stabilize. Workload clusters should be paused for the duration of the management cluster's upgrade and unpaused gradually, while waiting for the cluster to become ready.

How to reproduce it (as minimally and precisely as possible): Upgrade a mgmt cluster with a workload cluster from v0.11.3 to v0.12.x (or the code in main at the time of writing of this issue). The management cluster upgrade will trigger a rolling upgrade for the workload cluster.

Subtasks:

Environment:

EKS Anywhere Release:main
EKS Distro Release:

maxdrib commented 2 years ago

Guillermo's thoughts on the topic:

here are a few strategies we can try. For example, only update capi objects upon a change in an eksa object. That has the drawback of not being able to fight drift. There are other options as well But tbh, we might not be able to avoid it every single time. I would be happy if we avoid it 99% and we just document the other 1% I believe what you have started doing is the right first step: have tests that allow us to detect when this happens I anticipate we will be able to find individual solutions for most of these instances (like the auth api version). And I hope the more we do it, the clearer the patterns will be. And with patterns it comes the ability to automate

maxdrib commented 2 years ago

One other consideration is there may be times where a rollingupgrade on workload clusters could be appropriate. One example would be https://github.com/aws/eks-anywhere/pull/3566. In this case, we would be adding labels on all the nodes for all the clusters managed by a single management cluster. In order for this change to have the intended outcome, VM's must re-run cloud-init. Once that happens, users will be able to better distribute their workloads on their workload clusters.

aws / eks-anywhere

RollingUpgrades observed on workload clusters when upgrading management cluster #3522