aws / eks-anywhere

Run Amazon EKS on your own infrastructure 🚀
https://anywhere.eks.amazonaws.com
Apache License 2.0
1.97k stars 286 forks source link

SnowBallEdge EKS-A Upgrade fails for Instance Type Upgrade. #5812

Open elamaran11 opened 1 year ago

elamaran11 commented 1 year ago

What happened:

SnowBallEdge EKS-A Upgrade fails for Instance Type Upgrade. I tried to upgrade from large instance to 2xlarge instance for nodes and the upgrade fails with below errors

2023-05-09T15:53:23.575Z    V6  Executing command   {"cmd": "/usr/bin/docker rm -f -v eksa_1683647549800094637"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x110 pc=0x2a2604c]

goroutine 1 [running]:
github.com/aws/eks-anywhere/pkg/providers/snow.(*SnowProvider).EnvMap(0x2dc0e40?, 0xc000570f80)
    github.com/aws/eks-anywhere/pkg/providers/snow/snow.go:197 +0x4c
github.com/aws/eks-anywhere/pkg/executables.(*Clusterctl).InstallEtcdadmProviders(0xc00037bc40, {0x3a65990, 0xc000126000}, 0x1c?, 0xc00041a330, {0x3a73eb0, 0xc0002eb860?}, {0xc0001415e0, 0x2, 0x2})
    github.com/aws/eks-anywhere/pkg/executables/clusterctl.go:422 +0x2df
github.com/aws/eks-anywhere/pkg/clusterapi.(*Installer).EnsureEtcdProvidersInstallation(0xc00056e618, {0x3a65990, 0xc000126000}, 0xc00041a330, {0x3a73eb0, 0xc0002eb860}, 0xc000570f80)
    github.com/aws/eks-anywhere/pkg/clusterapi/installer.go:49 +0x2bc
github.com/aws/eks-anywhere/pkg/workflows.(*ensureEtcdCAPIComponentsExistTask).Run(0xc000548e80?, {0x3a65990, 0xc000126000}, 0xc0001320f0)
    github.com/aws/eks-anywhere/pkg/workflows/upgrade.go:206 +0x96
github.com/aws/eks-anywhere/pkg/task.(*taskRunner).RunTask(0xc00041a360, {0x3a65990, 0xc000126000}, 0xc0001320f0)
    github.com/aws/eks-anywhere/pkg/task/task.go:155 +0x3e3
github.com/aws/eks-anywhere/pkg/workflows.(*Upgrade).Run(0xc000adfc80, {0x3a65990, 0xc000126000}, 0xc0002edd80, 0xc00041a330, 0xc00041a330, {0x3a32d80?, 0xc00056e640}, 0x0?)
    github.com/aws/eks-anywhere/pkg/workflows/upgrade.go:78 +0x435
github.com/aws/eks-anywhere/cmd/eksctl-anywhere/cmd.(*upgradeClusterOptions).upgradeCluster(0x52758c0, 0x5253c00)
    github.com/aws/eks-anywhere/cmd/eksctl-anywhere/cmd/upgradecluster.go:149 +0xa98
github.com/aws/eks-anywhere/cmd/eksctl-anywhere/cmd.glob..func26(0x5253c00?, {0x34ec5e9?, 0x3?, 0x3?})
    github.com/aws/eks-anywhere/cmd/eksctl-anywhere/cmd/upgradecluster.go:37 +0x26
github.com/spf13/cobra.(*Command).execute(0x5253c00, {0xc00058c360, 0x3, 0x3})
    github.com/spf13/cobra@v1.5.0/command.go:872 +0x694
github.com/spf13/cobra.(*Command).ExecuteC(0x5253200)
    github.com/spf13/cobra@v1.5.0/command.go:990 +0x3b4
github.com/spf13/cobra.(*Command).Execute(...)
    github.com/spf13/cobra@v1.5.0/command.go:918
github.com/spf13/cobra.(*Command).ExecuteContext(...)
    github.com/spf13/cobra@v1.5.0/command.go:911
github.com/aws/eks-anywhere/cmd/eksctl-anywhere/cmd.Execute()
    github.com/aws/eks-anywhere/cmd/eksctl-anywhere/cmd/root.go:59 +0x54
main.main()
    github.com/aws/eks-anywhere/cmd/eksctl-anywhere/main.go:29 +0x125

What you expected to happen:

Upgrade of worker nodes to 2xlarge instance.

How to reproduce it (as minimally and precisely as possible):

Create an EKS-A Cluster on Snow with large instance of 3 node size for CP and DP and try to upgrade the instance type to 2xlarge for workers alone.

Anything else we need to know?:

Environment:

elamaran11 commented 1 year ago

Thankyou @yxinchen @jiayiwang7 being on the call. Since the cluster nodes are having disk pressure, the cluster is beyond the point to be recoverable. So per your recommendation, im going ahead with crashing the cluster, recreating it with xlarge with container volume for CP with 100 GB and 2xlarge with container volume of 500 GB for DP.

csplinter commented 1 year ago

Hi @elamaran11 - was this issue resolved once you increased resources on your cluster?

elamaran11 commented 1 year ago

Hi @csplinter We are seeing this again from last couple of weeks. I want to try to upgrade to latest version of K8s in Snow and see if i see the issue again. If i dont see, i will close it. If not will reach back.