aws / amazon-vpc-cni-k8s

Networking plugin repository for pod networking in Kubernetes using Elastic Network Interfaces on AWS
Apache License 2.0
2.25k stars 733 forks source link

[EKS] Pods stuck in ContainerCreating status after upgrading to Kubernetes version 1.30 #2970

Open Gier32o opened 2 months ago

Gier32o commented 2 months ago

Pods are stuck in ContainerCreating status after upgrading to Kubernetes version 1.30 on EKS. We have 'Security Groups for Pods' feature turned on, and when we're trying to upgrade from:

ami_id             = "ami-066d744867bb80fce"
vpc_cni_version    = "v1.16.2-eksbuild.1"
kubernetes_version = "1.29"

to

ami_id             = "ami-05e7e986227a095a9"
vpc_cni_version    = "v1.18.2-eksbuild.1"
kubernetes_version = "1.30"

we're getting failing pods:

  Warning  FailedCreatePodSandBox  19m                 kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "cffd4f13c293011d5f6e967bd5859c234ab1f83731fbf1e40c46330e6276fdd7": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  66s (x85 over 19m)  kubelet            (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "38c6783c31d39443b9b0fe4873868fdf972c92d499176b5b44c9df42b4461865": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container

There is no issue when 'Security Groups for Pods' feature is turned off How to reproduce: https://github.com/Gier32o/k8s-upgrade-problem

orsenthil commented 2 months ago

Hello @Gier32o, does /var/log/aws-routed-eni/plugin.log or /var/log/aws-routed-eni/ipamd.log logs show any detailed about on the ip assignment or failure? Is aws-node pod running? Usually during K8s upgrade, the CNI version does not change, we keep the CNI version same while performing K8s upgrade. After the k8s upgrade, you can do the CNI upgrade. Does this workflow give the desirable outcome?

Gier32o commented 2 months ago

Hi, the aws-node pods are running fine. You were right - upgrading addon version before or at the same time as kubernetes and worker AMIs results in this error. If I run upgrade in two batches: 1. (K8s + AMIs) -> 2. (Addon) it works fine. Thanks! Is there any way to fix such a broken cluster afterwards?

orsenthil commented 2 months ago

Is there any way to fix such a broken cluster afterwards?

I am not sure what could have led to this stage. But you can downgrade the addon the previous version, and restart the pods, and upgrade the addons again.

hikouki-gumo commented 2 months ago

Hi @orsenthil, I upgraded in the order you recommend, (K8s + AMIs) first, then Addon, but got the same problem. It even randomly failed, not all the time.

vpc_cni_version    = "v1.16.3-eksbuild.2"
kubernetes_version = "1.28"

to

vpc_cni_version    = "v1.18.2-eksbuild."
kubernetes_version = "1.29"
Gier32o commented 2 months ago

Error: updating EKS Add-On (test:vpc-cni): operation error EKS: UpdateAddon, https response error StatusCode: 400, RequestID: 608f24fe-795a-4c7c-acba-8d11836aa01b, InvalidParameterException: Addon version specified is not supported when trying to downgrade the plugin 1.18.2 -> 1.16.3.

Nothing changed when I downgraded to 1.17.1

Gier32o commented 1 month ago

So is it a bug or is it something wrong with configuration?