eksctl-io / eksctl

The official CLI for Amazon EKS
https://eksctl.io
Other
4.94k stars 1.41k forks source link

running `utils update-kube-proxy` right after `upgrade cluster` sometimes mistakenly skips the update #2434

Closed rndstr closed 4 years ago

rndstr commented 4 years ago

What happened? eksctl utils update-kube-proxy queries the server version through the raw kubernetes client. Once eksctl upgrade cluster finishes, the version returned alternates between new and old version for a while.

$ kubectl upgrade cluster -f simple-15.yaml
…
[✔]  cluster "roli-dev" control plane has been upgraded to version "1.15"                                                                                                             
[ℹ]  you will need to follow the upgrade procedure for all of nodegroups and add-ons       
[ℹ]  re-building cluster stack "eksctl-roli-dev-cluster"                                   
[✔]  all resources in cluster stack "eksctl-roli-dev-cluster" are up-to-date               
[ℹ]  checking security group configuration for all nodegroups                              
[ℹ]  all nodegroups have up-to-date configuration                                                                                                                                     

$ while true; do ./eksctl utils update-kube-proxy --cluster=roli-dev --region=us-west-2; done                                                                        
[ℹ]  eksctl version 0.25.0-dev+e59bf81a.2020-07-13T13:16:20Z                               
[ℹ]  using region us-west-2                                                                
[ℹ]  imageParts = [602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/kube-proxy v1.14.6], desiredTag = v1.14.9                                                                         
[✖]  (plan) "kube-proxy" is not up-to-date
[!]  no changes were applied, run again with '--approve' to apply the changes              
[ℹ]  eksctl version 0.25.0-dev+e59bf81a.2020-07-13T13:16:20Z
[ℹ]  using region us-west-2                                                                
[ℹ]  imageParts = [602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/kube-proxy v1.14.6], desiredTag = v1.15.11                                                                        
[✖]  (plan) "kube-proxy" is not up-to-date
[!]  no changes were applied, run again with '--approve' to apply the changes
[ℹ]  eksctl version 0.25.0-dev+e59bf81a.2020-07-13T13:16:20Z
[ℹ]  using region us-west-2
[ℹ]  imageParts = [602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/kube-proxy v1.14.6], desiredTag = v1.14.9
[✖]  (plan) "kube-proxy" is not up-to-date
[!]  no changes were applied, run again with '--approve' to apply the changes
[ℹ]  eksctl version 0.25.0-dev+e59bf81a.2020-07-13T13:16:20Z
[ℹ]  using region us-west-2
[ℹ]  imageParts = [602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/kube-proxy v1.14.6], desiredTag = v1.15.11
[✖]  (plan) "kube-proxy" is not up-to-date
[!]  no changes were applied, run again with '--approve' to apply the changes
[ℹ]  eksctl version 0.25.0-dev+e59bf81a.2020-07-13T13:16:20Z
[ℹ]  using region us-west-2
[ℹ]  imageParts = [602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/kube-proxy v1.14.6], desiredTag = v1.14.9
[✖]  (plan) "kube-proxy" is not up-to-date
[!]  no changes were applied, run again with '--approve' to apply the changes
[ℹ]  eksctl version 0.25.0-dev+e59bf81a.2020-07-13T13:16:20Z
[ℹ]  using region us-west-2
[ℹ]  imageParts = [602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/kube-proxy v1.14.6], desiredTag = v1.14.9
[✖]  (plan) "kube-proxy" is not up-to-date
[!]  no changes were applied, run again with '--approve' to apply the changes

(see desiredTag output)

Therefore, depending which version happens to be returned, the call to eksctl utils update-kube-proxy may say "no action required" while we actually do want to update.

What you expected to happen? Either a) eksctl upgrade cluster to wait until all control planes are updated, or b) eksctl utils update-kube-proxy to update the proxy to the newest control plane version

How to reproduce it?

simple-14.yaml

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: simple
  region: us-west-2
  version: "1.14"
nodeGroups:
  - name: ng-1
    instanceType: m5.large
    desiredCapacity: 1

simple-15.yaml with version: "1.15"

$ kubectl create cluster -f simple-14.yaml
$ kubectl utils write-kubeconfig --name=simple --region=us-west-2
$ kubectl update cluster -f simple-15.yaml; while true; do kubectl version; done

Anything else we need to know?

Versions Please paste in the output of these commands:

$ eksctl version
0.25.0-dev+e59bf81a.2020-07-13T13:16:20Z
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.2", GitCommit:"59603c6e503c87169aea6106f57b9f242f64df89", GitTreeState:"clean", BuildDate:"2020-01-18T23:30:10Z", GoVersion:"go1.13.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.11-eks-14f01f", GitCommit:"14f01fe8f04411d5e187b220034ca2117d79f7de", GitTreeState:"clean", BuildDate:"2020-05-23T21:32:47Z", GoVersion:"go1.12.17", Compiler:"gc", Platform:"linux/amd64"}

Logs Update.Status from EKS::DescribeUpdate once eksctl proceeds from upgrade cluster:

 {
  Update: {
    CreatedAt: 2020-07-13 20:17:39 +0000 UTC, 
    Errors: [],
    Id: "b880fbe0-f8a6-4fcf-ba12-42db712b0edc",
    Params: [{
        Type: "Version",
        Value: "1.15"
      },{
        Type: "PlatformVersion",
        Value: "eks.3"
      }],
    Status: "Successful",
    Type: "VersionUpdate"
  }
}
rndstr commented 4 years ago

For: a) eksctl upgrade cluster to wait until all control planes are updated

https://docs.aws.amazon.com/eks/latest/APIReference/API_UpdateClusterVersion.html

Cluster updates are asynchronous, and they should finish within a few minutes. During an update, the cluster status moves to UPDATING (this status transition is eventually consistent). When the update is complete (either Failed or Successful), the cluster status moves to Active.

I don't see any mention that it means only part of the control plane has been updated.

saada commented 4 years ago

This is still an issue after the patch. It seems to be happening 50% of the time where the version reported back is not consistent, it sometimes shows the old version and other times shows the new version.

Is there a way we can wait for the control plane to be fully upgraded?

saada commented 4 years ago

From https://github.com/aws/aws-sdk-go/blob/v1.34.25/service/eks/api.go#L2185

It shows

Cluster updates are asynchronous, and they should finish within a few minutes. During an update, the cluster status moves to UPDATING (this status transition is eventually consistent). When the update is complete (either Failed or Successful), the cluster status moves to Active.

Can we add a wait for the Active status?

michaelbeaumont commented 4 years ago

At the moment we already have a wait for the Update itself to move to Successful, I suppose we can add a wait for cluster status to Active as well. (EDIT: although it isn't clear why one but not the other would ensure a complete update)

saada commented 4 years ago

From the SDK code comment above

Cluster updates are asynchronous

Perhaps the successful return of the command is just to confirm that the async request was successful? Not that the upgrade was completed?