eksctl-io / eksctl

The official CLI for Amazon EKS
https://eksctl.io
Other
4.94k stars 1.41k forks source link

eksctl 0.17.0 Timeout when upgrading to 1.14 -> 1.15 #2096

Closed InAnimaTe closed 4 years ago

InAnimaTe commented 4 years ago

What happened?

2020-04-27T07:12:47-04:00 [▶]  waiting for requested "VersionUpdate" in cluster "arryn-staging-redux" to succeed
Error: waiting for requested "VersionUpdate" in cluster "arryn-staging-redux" to succeed: RequestCanceled: waiter context canceled
caused by: context deadline exceeded

A few minutes after this, the cluster does appear to be healthy and shows version 1.15.

What you expected to happen?

I expected the cluster to be upgraded within the ~25 minute timeframe, as the other 3 clusters I did this upgrade on completed within time. Example:

2020-04-27T07:11:42-04:00 [▶]  waiting for requested "VersionUpdate" in cluster "dev-sandbox-redux" to succeed
2020-04-27T07:11:42-04:00 [▶]  done after 24m12.255459373s of waiting for requested "VersionUpdate" in cluster "dev-sandbox-redux" to succeed
2020-04-27T07:11:42-04:00 [✔]  cluster "dev-sandbox-redux" control plane has been upgraded to version "1.15"
2020-04-27T07:11:42-04:00 [ℹ]  you will need to follow the upgrade procedure for all of nodegroups and add-ons

Additionally, I can't see anything in Cloudformation (the cluster stack doesn't have any new events, as I'd expect) about the upgrade as I presume some of eksctl's post upgrade process didn't run? Am I now in a bad state? How do I properly "finish" this upgrade with eksctl?

How to reproduce it?

eksctl update cluster -n arryn -w -v 4 --approve

Anything else we need to know?

Versions

Post upgrade:

Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.1", GitCommit:"d224476cd0730baca2b6e357d144171ed74192d6", GitTreeState:"clean", BuildDate:"2020-01-15T15:50:25Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.11-eks-af3caf", GitCommit:"af3caf6136cd355f467083651cc1010a499f59b1", GitTreeState:"clean", BuildDate:"2020-03-27T21:51:36Z", GoVersion:"go1.12.17", Compiler:"gc", Platform:"linux/amd64"}

Logs https://gist.github.com/InAnimaTe/814af03d2a247b1667bbc10a8b05837e

martina-if commented 4 years ago

Hi @InAnimaTe, thanks for reporting this. What happens if you run the update command again? (In this case it should be safe since it won't upgrade to 1.16).

InAnimaTe commented 4 years ago

Looks like that seemed to work. (I'd share the output but its massive and I don't feel like filtering out private stuff) It ran successfully, did some things, and I see an updated Cloudformation stack:

2020-04-28 15:14:24 UTC-0400    eksctl-arryn-cluster    UPDATE_COMPLETE -
2020-04-28 15:14:23 UTC-0400    eksctl-arryn-cluster    UPDATE_COMPLETE_CLEANUP_IN_PROGRESS -
2020-04-28 15:14:20 UTC-0400    IngressDefaultClusterToNodeSG   CREATE_COMPLETE -
2020-04-28 15:14:20 UTC-0400    IngressNodeToDefaultClusterSG   CREATE_COMPLETE -
2020-04-28 15:14:20 UTC-0400    IngressDefaultClusterToNodeSG   CREATE_IN_PROGRESS  Resource creation Initiated
2020-04-28 15:14:19 UTC-0400    IngressDefaultClusterToNodeSG   CREATE_IN_PROGRESS  -
2020-04-28 15:14:19 UTC-0400    IngressNodeToDefaultClusterSG   CREATE_IN_PROGRESS  Resource creation Initiated
2020-04-28 15:14:19 UTC-0400    IngressNodeToDefaultClusterSG   CREATE_IN_PROGRESS  -
2020-04-28 15:14:13 UTC-0400    eksctl-arryn-cluster    UPDATE_IN_PROGRESS  User Initiated
2020-02-20 18:33:40 UTC-0500    eksctl-arryn-cluster    UPDATE_COMPLETE

@martina-if could you clarify what things are actually being done here post-upgrade (or point me to the relevant documentation/functions)? Seems like some security group tuning for Ingress to function properly.

So basically, if this step times out, just re-run the command and I should be good?

martina-if commented 4 years ago

@InAnimaTe eksctl performs a few operations but mainly upgrading the control plane and the stacks so I think rerunning it would perform the stack upgrade. This is not something to rely on because there are other things happening too and as I said, there is no continuation implemented.

I'm glad that it worked :+1: