failed to allocate ip to pod

songhohoon commented 8 months ago

What happened:

pod stuck in init or container creating status.

Attach logs

sent log file to k8s-awscni-triage@amazon.com with email thdghgns@gmail.com

Events:
  Type     Reason                     Age                   From                     Message
  ----     ------                     ----                  ----                     -------
  Normal   Scheduled                  49m                   default-scheduler        Successfully assigned watch/watch-api-79574c44db-klk7z to ip-10-8-58-221.ap-northeast-2.compute.internal
  Warning  FailedCreatePodSandBox     49m                   kubelet                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "5499422a7c0dd169f1600782ef8b49976c15fa11caad0577706595922ab685e1": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox     49m                   kubelet                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "4595c377d1f382e07dc64fd2a6919f3a6f5dea69c3930c176fe20abf31a0ce16": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox     49m                   kubelet                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "c12bc8fba125d47b071e61ba2184adb1b0acff961d0f21d6702ad6ea3e74ec42": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox     48m                   kubelet                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "03574e55f179fa15cbcc8023f135ff90f7d4685a36128ffe40376baf0b438d09": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox     48m                   kubelet                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "e0309c2cf07a3484c2d011dab35a5774a48a23a1da1172aac2a46ae04347ccac": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox     48m                   kubelet                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "3146fff70aa3f8441385d5636db9c0aa707c117b5ea128a834989481b8ab2ef6": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox     47m                   kubelet                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "198c432e48b115c50c802f59c724e622d611ec4ff4bd4508e59a53a110b59bf2": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox     47m                   kubelet                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "3ab9b2a117619c880b6e0e00a94d065b79802e25833bbe3e881c82837be34837": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox     47m                   kubelet                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "71660542696f0e8ab9eb37a3d6d5b5e77692b3646f4f258e0f6542b708dac4a5": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container
  Normal   SecurityGroupRequested     5m29s (x21 over 49m)  vpc-resource-controller  Pod will get the following Security Groups [sg-0b91edea7758a450e sg-038ae5f385bc8e045 sg-062ebf6eaba912cb5 sg-0a3384ea78dc93e15 sg-05a98d50e24a79f41]
  Warning  BranchENIAnnotationFailed  5m28s (x21 over 49m)  vpc-resource-controller  failed to annotate pod with branch ENI details: Pod "watch-api-79574c44db-klk7z" is invalid: spec: Forbidden: pod updates may not change fields other than `spec.containers[*].image`,`spec.initContainers[*].image`,`spec.activeDeadlineSeconds`,`spec.tolerations` (only additions to existing tolerations),`spec.terminationGracePeriodSeconds` (allow it to be set to 1 if it was previously negative)
  core.PodSpec{
    Volumes:        {{Name: "aws-iam-token", VolumeSource: {Projected: &{Sources: {{ServiceAccountToken: &{Audience: "sts.amazonaws.com", ExpirationSeconds: 86400, Path: "token"}}}, DefaultMode: &420}}}, {Name: "apmsocketpath", VolumeSource: {HostPath: &{Path: "/var/run/datadog/", Type: &""}}}, {Name: "heapdumps", VolumeSource: {EmptyDir: &{}}}, {Name: "kube-api-access-6vsfs", VolumeSource: {Projected: &{Sources: {{ServiceAccountToken: &{ExpirationSeconds: 3607, Path: "token"}}, {ConfigMap: &{LocalObjectReference: {Name: "kube-root-ca.crt"}, Items: {{Key: "ca.crt", Path: "ca.crt"}}}}, {DownwardAPI: &{Items: {{Path: "namespace", FieldRef: &{APIVersion: "v1", FieldPath: "metadata.namespace"}}}}}}, DefaultMode: &420}}}},
    InitContainers: nil,
    Containers: []core.Container{
      {
        ... // 13 identical fields
        ReadinessProbe:           &{ProbeHandler: {HTTPGet: &{Path: "/actuator/info", Port: {IntVal: 8080}, Scheme: "HTTP"}}, InitialDelaySeconds: 180, TimeoutSeconds: 2, PeriodSeconds: 5, ...},
        StartupProbe:             nil,
-       Lifecycle:                nil,
+       Lifecycle:                &core.Lifecycle{PreStop: &core.LifecycleHandler{Exec: &core.ExecAction{Command: []string{...}}}},
        TerminationMessagePath:   "/dev/termination-log",
        TerminationMessagePolicy: "File",
        ... // 5 identical fields
      },
    },
    EphemeralContainers: nil,
    RestartPolicy:       "Always",
    ... // 28 identical fields
  }
  Warning  FailedCreatePodSandBox  4m26s (x196 over 47m)  kubelet  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "303a2d86590b842a1fea740469947b598964a911914d533558c82c51710bd05f": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: failed to assign an IP address to container

What you expected to happen: I expected pod created generally

How to reproduce it (as minimally and precisely as possible):

schedule node group asg start at work time and stop at after work time
pod will try to create at node start up
it happens some of pods.

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version): v1.27.9-eks-5e0fdde
CNI Version : v1.15.1
OS (e.g: cat /etc/os-release): Amazon Linux 2
Kernel (e.g. uname -a): Linux ip-10-8-58-221.ap-northeast-2.compute.internal 5.10.199-190.747.amzn2.x86_64 #1 SMP Sat Nov 4 16:55:14 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

jdn5126 commented 8 months ago

@songhohoon from:

  Warning  BranchENIAnnotationFailed  5m28s (x21 over 49m)  vpc-resource-controller  failed to annotate pod with branch ENI details: Pod "watch-api-79574c44db-klk7z" is invalid: spec: Forbidden: pod updates may not change fields other than `spec.containers[*].image`,`spec.initContainers[*].image`,`spec.activeDeadlineSeconds`,`spec.tolerations` (only additions to existing tolerations),`spec.terminationGracePeriodSeconds` (allow it to be set to 1 if it was previously negative)

it looks like the VPC Resource Controller (https://github.com/aws/amazon-vpc-resource-controller-k8s/blob/master/pkg/provider/branch/provider.go#L385) failed to annotate the pod with a branch ENI.

Based on the error message from the k8s API call, it sounds like this patch operation was blocked. Are you installing any pod validation or admission webhooks in your cluster? Are you running any tools that are modifying the ClusterRole objects installed by EKS? Have you ever had this Security Groups for Pods solution working?

songhohoon commented 7 months ago

hi. @jdn5126 thanks for reply.

Are you installing any pod validation or admission webhooks in your cluster? -> yes. I installed kyverno and using it for pod validation and mutate some config like prestop hook.

Are you running any tools that are modifying the ClusterRole objects installed by EKS? -> no, I don't.

Have you ever had this Security Groups for Pods solution working? -> yes. I'am using SGP(Security Groups for Pod) for most of my pod.

additional info When this situation occurs, it is usually resolved by observing it and deleting and recreating the pod. However, if I leave it without deleting it, the situation will persist.

jdn5126 commented 7 months ago

@songhohoon Judging from the error message, it seems very likely that the patch operation is being blocked by a pod validation webhook. It is possible that Kyverno is playing that role, but since this is all happening in the control plane and not in the AWS VPC CNI, I think the best path forward is for you to create an AWS support case. Then we can investigate the control plane logs and figure out what is blocking this patching operation from time to time.

songhohoon commented 7 months ago

@jdn5126 thank you for reply. I deep dived into the problem and I figured out there is admission controller order. in my case some of admission controller failed to inject config. after the failed admission controller the pod manifest is not annotatable. so CNI controller cannot annotate the allocated ip address in pod. the failure is aws api limit exceed. because in my case it was development environment and it was scheduled every morning. so a lot of pods created at a time. I tried kubectl annotate pods ${pod_name} test=test and it failed with stuck pod. but succeed with regular pod.

github-actions[bot] commented 7 months ago

This issue is now closed. Comments on closed issues are hard for our team to see. If you need more assistance, please either tag a team member or open a new issue that references this one.

jdn5126 commented 7 months ago

@songhohoon ah I see, thank you for explaining, and glad you figured it out!

nyunyunyunyu commented 7 months ago

@jdn5126 In README.md, https://github.com/aws/amazon-vpc-cni-k8s/blame/87115cf204dafd148c765ea3c8d184ba73c3a09a/README.md#L498 still mentions:

Setting ENABLE_POD_ENI to true will allow IPAMD to add the vpc.amazonaws.com/has-trunk-attached label to the node if the instance has the capacity to attach an additional ENI.

Is this expected?

aws / amazon-vpc-cni-k8s

failed to allocate ip to pod #2814