aws / eks-anywhere

Run Amazon EKS on your own infrastructure 🚀
https://anywhere.eks.amazonaws.com
Apache License 2.0
1.95k stars 283 forks source link

Error manually scaling vsphere worker nodes using latest EKSA #5710

Open dashkan opened 1 year ago

dashkan commented 1 year ago

What happened: Ran the below script to increase worker nodes from 1 to 3 nodes

eksctl anywhere upgrade cluster -f eksa/stg-ne-eks.yaml --kubeconfig stg-ne/stg-ne-eks-a-cluster.kubeconfig 
Warning: VSphereDatacenterConfig configured in insecure mode
Performing setup and validations
Warning: VSphereDatacenterConfig configured in insecure mode
✅ Connected to server
✅ Authenticated to vSphere
✅ Datacenter validated
✅ Network validated
✅ Datastore validated
✅ Folder validated
✅ Resource pool validated
✅ Datastore validated
✅ Folder validated
✅ Resource pool validated
✅ Datastore validated
✅ Folder validated
✅ Resource pool validated
✅ Machine config tags validated
✅ Control plane and Workload templates validated
✅ Vsphere provider validation
✅ Validate OS is compatible with registry mirror configuration
✅ Validate certificate for registry mirror
✅ Control plane ready
✅ Worker nodes ready
✅ Nodes ready
✅ Cluster CRDs ready
✅ Cluster object present on workload cluster
✅ Upgrade cluster kubernetes version increment
✅ Validate authentication for git provider
✅ Validate immutable fields
Ensuring etcd CAPI providers exist on management cluster before upgrade
Pausing EKS-A cluster controller reconcile
Pausing GitOps cluster resources reconcile
Upgrading core components
Upgrading workload cluster
collecting cluster diagnostics
collecting management cluster diagnostics
collecting workload cluster diagnostics
⏳ Collecting support bundle from cluster, this can take a while {"cluster": "stg-ne", "bundle": "stg-ne/generated/stg-ne-2023-04-25T21:40:42Z-bundle.yaml", "since": 1682448042771430226, "kubeconfig": "stg-ne/stg-ne-eks-a-cluster.kubeconfig"}
Support bundle archive created  {"path": "support-bundle-2023-04-25T21_40_43.tar.gz"}
Analyzing support bundle    {"bundle": "stg-ne/generated/stg-ne-2023-04-25T21:40:42Z-bundle.yaml", "archive": "support-bundle-2023-04-25T21_40_43.tar.gz"}
Analysis output generated   {"path": "stg-ne/generated/stg-ne-2023-04-25T21:41:32Z-analysis.yaml"}
Error: failed to upgrade cluster: applying capi control plane spec: executing apply: The kubeadmcontrolplanes "stg-ne" is invalid: metadata.resourceVersion: Invalid value: 0x0: must be specified for an update

I actually upgraded this cluster to latest control plane last week.

What you expected to happen: Cluster scales up

How to reproduce it (as minimally and precisely as possible): Cluster is already running EKSA v0.15.2. Increase worker node count and rerun upgrade.

Anything else we need to know?:

Environment:

abhay-krishna commented 1 year ago

Thanks for reporting the issue @dashkan! Could you please provide your cluster config file?

dashkan commented 1 year ago

@abhay-krishna , I just built new 1.26 image and will try to upgrade to v0.15.3 first. I'll close issue if I can upgrade.

dashkan commented 1 year ago

Same issue as before.

My cluster config. I changed some private values

apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: Cluster
metadata:
  name: stg-ne
  namespace: default
spec:
  bundlesRef:
    apiVersion: anywhere.eks.amazonaws.com/v1alpha1
    name: bundles-36
    namespace: eksa-system
  clusterNetwork:
    cniConfig:
      cilium: {}
    pods:
      cidrBlocks:
      - 192.168.0.0/16
    services:
      cidrBlocks:
      - 10.96.0.0/12
  controlPlaneConfiguration:
    count: 3
    endpoint:
      host: 100.99.84.9
    machineGroupRef:
      kind: VSphereMachineConfig
      name: stg-ne-cp
  datacenterRef:
    kind: VSphereDatacenterConfig
    name: stg-ne
  externalEtcdConfiguration:
    count: 3
    machineGroupRef:
      kind: VSphereMachineConfig
      name: stg-ne-etcd
  kubernetesVersion: "1.26"
  managementCluster:
    name: stg-ne
  workerNodeGroupConfigurations:
  - count: 3
    machineGroupRef:
      kind: VSphereMachineConfig
      name: stg-ne
    name: md-0
  - count: 1
    machineGroupRef:
      kind: VSphereMachineConfig
      name: stg-ne
    name: md-1
    taints:
    - effect: NoSchedule
      key: myapp/infra
  - count: 3
    machineGroupRef:
      kind: VSphereMachineConfig
      name: stg-ne
    name: md-2
    taints:
    - effect: NoSchedule
      key: myapp/app

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereDatacenterConfig
metadata:
  name: stg-ne
  namespace: default
spec:
  datacenter: default
  insecure: true
  network: /default/network/VM Network
  server: myvspherehostname
  thumbprint: ""

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereMachineConfig
metadata:
  annotations:
    anywhere.eks.amazonaws.com/control-plane: "true"
  name: stg-ne-cp
  namespace: default
spec:
  cloneMode: fullClone
  datastore: /default/datastore/isilon-ds1
  diskGiB: 25
  folder: /default/vm/staging
  memoryMiB: 8192
  numCPUs: 4
  osFamily: ubuntu
  resourcePool: /default/host/myapp/Resources/staging
  template: /default/vm/Templates/ubuntu-2004-kube-v1.26.4
  users:
  - name: engineering
    sshAuthorizedKeys:
    - ssh-rsa MYSSHPUBLICKEY

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereMachineConfig
metadata:
  name: stg-ne
  namespace: default
spec:
  cloneMode: fullClone
  datastore: /default/datastore/isilon-ds1
  diskGiB: 25
  folder: /default/vm/staging
  memoryMiB: 16384
  numCPUs: 8
  osFamily: ubuntu
  resourcePool: /default/host/myapp/Resources/staging
  template: /default/vm/Templates/ubuntu-2004-kube-v1.26.4
  users:
  - name: engineering
    sshAuthorizedKeys:
    - ssh-rsa MYSSHPUBLICKEY

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereMachineConfig
metadata:
  annotations:
    anywhere.eks.amazonaws.com/etcd: "true"
  name: stg-ne-etcd
  namespace: default
spec:
  cloneMode: fullClone
  datastore: /default/datastore/isilon-ds1
  diskGiB: 25
  folder: /default/vm/staging
  memoryMiB: 8192
  numCPUs: 4
  osFamily: ubuntu
  resourcePool: /default/host/myapp/Resources/staging
  template: /default/vm/Templates/ubuntu-2004-kube-v1.26.4
  users:
  - name: engineering
    sshAuthorizedKeys:
    - ssh-rsa MYSSHPUBLICKEY

---
✅ Connected to server
✅ Authenticated to vSphere
✅ Datacenter validated
✅ Network validated
✅ Datastore validated
✅ Folder validated
✅ Resource pool validated
✅ Datastore validated
✅ Folder validated
✅ Resource pool validated
✅ Datastore validated
✅ Folder validated
✅ Resource pool validated
✅ Machine config tags validated
✅ Control plane and Workload templates validated
✅ Vsphere provider validation
✅ Validate OS is compatible with registry mirror configuration
✅ Validate certificate for registry mirror
✅ Control plane ready
✅ Worker nodes ready
✅ Nodes ready
✅ Cluster CRDs ready
✅ Cluster object present on workload cluster
✅ Upgrade cluster kubernetes version increment
✅ Validate authentication for git provider
✅ Validate immutable fields
Ensuring etcd CAPI providers exist on management cluster before upgrade
Pausing EKS-A cluster controller reconcile
Pausing GitOps cluster resources reconcile
Upgrading core components
Upgrading workload cluster
collecting cluster diagnostics
collecting management cluster diagnostics
collecting workload cluster diagnostics
⏳ Collecting support bundle from cluster, this can take a while {"cluster": "stg-ne", "bundle": "stg-ne/generated/stg-ne-2023-05-01T20:22:59Z-bundle.yaml", "since": 1682961779987645163, "kubeconfig": "stg-ne/stg-ne-eks-a-cluster.kubeconfig"}
Support bundle archive created  {"path": "support-bundle-2023-05-01T20_23_01.tar.gz"}
Analyzing support bundle    {"bundle": "stg-ne/generated/stg-ne-2023-05-01T20:22:59Z-bundle.yaml", "archive": "support-bundle-2023-05-01T20_23_01.tar.gz"}
Analysis output generated   {"path": "stg-ne/generated/stg-ne-2023-05-01T20:24:00Z-analysis.yaml"}
Error: failed to upgrade cluster: applying capi control plane spec: executing apply: The kubeadmcontrolplanes "stg-ne" is invalid: metadata.resourceVersion: Invalid value: 0x0: must be specified for an update
dashkan commented 1 year ago

@abhay-krishna any thoughts on what is going on?

abhay-krishna commented 1 year ago

I think we can get a better picture if we try running upgrade with verbosity 9, so that we know what commands are being run.

Also could you try describing that kubeadmcontrolplane object? It should be in the eksa-system namespace.

dashkan commented 1 year ago

describe-output.txt upgrade-plan-verbose.txt upgrade-verbose.txt @abhay-krishna Attached all requested

jplewes commented 1 year ago

Also running into this exact problem.

My scenario is that we added a new workerNodeGroupConfigurations config and adjusted machine counts, added labels.

  workerNodeGroupConfigurations:
  - count: 6
    machineGroupRef:
      kind: VSphereMachineConfig
      name: group1
    name: db-0
    labels:
      "labelone": "1"
  - count: 3
    machineGroupRef:
      kind: VSphereMachineConfig
      name: group2
    name: md-0
    labels:
      "labeltwo": "2"

Error happened during retry {"error": "executing apply: The kubeadmcontrolplanes \"testcluster\" is invalid: metadata.resourceVersion: Invalid value: 0x0: must be specified for an update\n", "retries": 30}

jplewes commented 1 year ago

@abhay-krishna Any update on your thoughts?

I re-attempted a cluster upgrade with the only change being adding labels to a workerNodeGroupConfiguration and got the same error.

Invalid value: 0x0: must be specified for an update

Attached are sanitized kubeadmcontrolpane yaml/describe outputs. kubeadmcontrolplane-yamlout.txt kubeadmcontrolplane-describe.txt

jplewes commented 1 year ago

@dashkan @abhay-krishna

So for a temporary workaround:

My KubeadmControlPlane version is v1.23.17-eks-1-23-19 and it is left with a kubectl.kubernetes.io/last-applied-configuration metadata item. On other clusters that are version v1.23.15-eks-1-23-12 this metadata item is not present.

To get past this issue I edited the KubeadmControlPlane and removed the kubectl.kubernetes.io/last-applied-configuration under metadata. This now permits the cluster upgrade to complete.

After the successful upgrade, the kubectl.kubernetes.io/last-applied-configuration metadata item is again present on the KubeadmControlPlane. Oddly enough now, subsequent upgrades complete without throwing the error. Something must be broken when merging the configuration of KubeadmControlPlane under certain circumstances.