aws / amazon-vpc-cni-k8s

Networking plugin repository for pod networking in Kubernetes using Elastic Network Interfaces on AWS

Apache License 2.0

2.28k stars 741 forks source link

AWS EKS - add cmd: failed to assign an IP address to container #1791

Closed GoGoPenguin closed 2 years ago

GoGoPenguin commented 2 years ago

What happened:

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "efafab567a68ffb237e67050c4d70d2d1084bf1aca3631b74e5a0802146d150a" network for pod "xxx-9d8b7c98d-j8ldn": networkPlugin cni failed to set up pod "xxx-9d8b7c98d-j8ldn_default" network: add cmd: failed to assign an IP address to container

eks_i-089d9482970086cc5_2021-12-11_1011-UTC_0.6.2.tar.gz

Environment:

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.1", GitCommit:"632ed300f2c34f6d6d15ca4cef3d3c7073412212", GitTreeState:"clean", BuildDate:"2021-08-19T15:38:26Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"darwin/arm64"}
Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.2-eks-06eac09", GitCommit:"5f6d83fe4cb7febb5f4f4e39b3b2b64ebbbe3e97", GitTreeState:"clean", BuildDate:"2021-09-13T14:20:15Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}

CNI Version

amazon-k8s-cni-init:v1.10.1-eksbuild.1
amazon-k8s-cni:v1.10.1-eksbuild.1

OS (e.g: cat /etc/os-release):

NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"

Kernel (e.g. uname -a):

Linux ip-192-168-75-48.ap-northeast-2.compute.internal 5.4.156-83.273.amzn2.x86_64 #1 SMP Sat Oct 30 12:59:07 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Shreya027 commented 2 years ago

Hi, I went through the logs but I couldn't locate the sandbox container you mentioned in the issue. However, I did find similar scenarios for a number of sandbox containers.

It seems like the pattern of the error occurrence is of the following form in ipamd.log:

Initially there in an AddNetworkRequest received by IPAMD

{"level":"info","ts":"2021-12-10xxx","caller":"rpc/rpc.pb.go:486","msg":"Received AddNetwork for NS /proc/xxx/ns/net, Sandbox 7c7xxx, ifname xxx"}
{"level":"debug","ts":"2021-12-10xxx","caller":"rpc/rpc.pb.go:486","msg":"AddNetworkRequest: ClientVersion:\"v1.7.5\" K8S_POD_NAME:\"xxx" K8S_POD_NAMESPACE:\"xxx" K8S_POD_INFRA_CONTAINER_ID:\"xxx" ContainerID:\"7c7xxx" IfName:\"xxx\" NetworkName:\"aws-cni\" Netns:\"/proc/xxx/ns/net\" "}

The enis don't have available addresses to assign to the sandbox container

{"level":"debug","ts":"2021-12-10xxx","caller":"ipamd/rpc_handler.go:142","msg":"AssignIPv4Address: IP address pool stats: total: 15, assigned 14"}
{"level":"debug","ts":"2021-12-10xxx","caller":"ipamd/rpc_handler.go:142","msg":"AssignPodIPv4Address: ENI eni-027xxx does not have available addresses"}
{"level":"debug","ts":"2021-12-10xxx","caller":"ipamd/rpc_handler.go:142","msg":"AssignPodIPv4Address: ENI eni-090xxx does not have available addresses"}
{"level":"debug","ts":"2021-12-10xxx","caller":"ipamd/rpc_handler.go:142","msg":"AssignPodIPv4Address: ENI eni-0d5xxx does not have available addresses"}

No available IP addresses error message sent in AddNetworkReply as shown below

{"level":"error","ts":"2021-12-10xxx","caller":"ipamd/rpc_handler.go:142","msg":"DataStore has no available IP addresses"}
{"level":"debug","ts":"2021-12-10xxx","caller":"rpc/rpc.pb.go:486","msg":"VPC CIDR 192.xx.0.0/xx"}
{"level":"info","ts":"2021-12-10xxx","caller":"rpc/rpc.pb.go:486","msg":"Send AddNetworkReply: IPv4Addr , DeviceNumber: -1, err: assignPodIPv4AddressUnsafe: no available IP addresses"}

Next an DelNetworkRequest received by IPAMD with reason "PodDeleted". However, the pod was never assigned IP Address because of the error above, because of which the DelNetworkRequest should fail

{"level":"info","ts":"2021-12-10xxx","caller":"rpc/rpc.pb.go:504","msg":"Received DelNetwork for Sandbox 7c7xxx"}
{"level":"debug","ts":"2021-12-10xxx","caller":"rpc/rpc.pb.go:504","msg":"DelNetworkRequest: ClientVersion:\"v1.7.5\" K8S_POD_NAME:\"xxx\" K8S_POD_NAMESPACE:\"xxx" K8S_POD_INFRA_CONTAINER_ID:\"7c7xxx\" Reason:\"PodDeleted\" ContainerID:\"7c7xxx" IfName:\"eth0\" NetworkName:\"aws-cni\" "}
{"level":"debug","ts":"2021-12-10xxx","caller":"ipamd/rpc_handler.go:221","msg":"UnassignPodIPv4Address: IP address pool stats: total:15, assigned 14, sandbox aws-cni/7c7xxx/eth0"}
{"level":"debug","ts":"2021-12-10xxx","caller":"ipamd/rpc_handler.go:221","msg":"UnassignPodIPv4Address: Failed to find IPAM entry under full key, trying CRI-migrated version"}
{"level":"warn","ts":"2021-12-10xxx","caller":"ipamd/rpc_handler.go:221","msg":"UnassignPodIPv4Address: Failed to find sandbox _migrated-from-cri/7c7xxx/unknown"}
{"level":"info","ts":"2021-12-10xxx","caller":"rpc/rpc.pb.go:504","msg":"Send DelNetworkReply: IPv4Addr , DeviceNumber: 0, err: datastore: unknown pod"}

The above DelNetworkRequest is received more than one time because of retries.

Shreya027 commented 2 years ago

I'm looking into the reason as to why the enis does not have available addresses

GoGoPenguin commented 2 years ago

@Shreya027 Thanks for quick response.

jayanthvn commented 2 years ago

@GoGoPenguin -

Instance type is t3.medium which supports 3 ENIs and 5 secondary IPs so 15 IPs will be available.

Based on the state file - 192.168.68.129 is available hence the log line -

{"level":"debug","ts":"2021-12-11T10:11:49.740Z","caller":"ipamd/ipamd.go:2057",
"msg":"IP pool stats: total = 15, used = 14, 
IPs in Cooldown = 0, c.maxIPsPerENI = 5"}

Additional ENIs cannot be added since max of 3 ENIs are reached.

192.168.68.129 was last freed at 2021-12-10T15:50:55.701Z-

{"level":"info","ts":"2021-12-10T15:50:55.701Z","caller":"ipamd/rpc_handler.go:220","msg":"UnassignPodIPAddress: sandbox aws-cni/a70537939fce4d23a2ce65259f6a689438f75439045a879295fa93046d978458/eth0's ipAddr 192.168.68.129, DeviceNumber 2"}
{"level":"info","ts":"2021-12-10T15:50:55.701Z","caller":"rpc/rpc.pb.go:731","msg":"Send DelNetworkReply: IPv4Addr 192.168.68.129, DeviceNumber: 2, err: <nil>"}
{"level":"debug","ts":"2021-12-10T15:51:00.331Z","caller":"ipamd/ipamd.go:2057","msg":"IP pool stats: total = 15, used = 13, IPs in Cooldown = 2, c.maxIPsPerENI = 5"}

At this time 2 IPs are in cool down - Cooldown = 2

Out of the 2 IPs, one IP [192.168.85.235] was out of the cool down around 2021-12-10T15:51:16.676Z and got assigned to a pod -

plugin logs -

{"level":"info","ts":"2021-12-10T15:51:16.673Z","caller":"routed-eni-cni-plugin/cni.go:111","msg":"Received CNI add request: ContainerID(1a9bda7adf0eb0ddadca3cb0e40e806b558689a5f6efd972d6aee6ff4100f7ab) Netns(/proc/18799/ns/net) IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=cityfarm-development-7d8c4b764b-t6zcw;K8S_POD_INFRA_CONTAINER_ID=1a9bda7adf0eb0ddadca3cb0e40e806b558689a5f6efd972d6aee6ff4100f7ab) Path(/opt/cni/bin) argsStdinData({\"cniVersion\":\"0.3.1\",\"mtu\":\"9001\",\"name\":\"aws-cni\",\"pluginLogFile\":\"/var/log/aws-routed-eni/plugin.log\",\"pluginLogLevel\":\"DEBUG\",\"type\":\"aws-cni\",\"vethPrefix\":\"eni\"})"}
{"level":"debug","ts":"2021-12-10T15:51:16.674Z","caller":"routed-eni-cni-plugin/cni.go:111","msg":"MTU value set is 9001:"}
{"level":"info","ts":"2021-12-10T15:51:16.679Z","caller":"routed-eni-cni-plugin/cni.go:111","msg":"Received add network response for container 1a9bda7adf0eb0ddadca3cb0e40e806b558689a5f6efd972d6aee6ff4100f7ab interface eth0: Success:true IPv4Addr:\"192.168.85.235\" DeviceNumber:2 VPCv4CIDRs:\"192.168.0.0/16\""}
{"level":"debug","ts":"2021-12-10T15:51:16.679Z","caller":"routed-eni-cni-plugin/cni.go:205","msg":"SetupNS: hostVethName=eni6a0b1b8b13c, contVethName=eth0, netnsPath=/proc/18799/ns/net, deviceNumber=2, mtu=9001"}
{"level":"debug","ts":"2021-12-10T15:51:16.679Z","caller":"driver/driver.go:280","msg":"v4addr: 192.168.85.235/32; v6Addr: <nil>\n"}

IPAMD logs -

{"level":"debug","ts":"2021-12-10T15:51:16.676Z","caller":"datastore/data_store.go:757","msg":"Returning Free IP 192.168.85.235"}
{"level":"debug","ts":"2021-12-10T15:51:16.676Z","caller":"datastore/data_store.go:680","msg":"New IP from CIDR pool- 192.168.85.235"}
{"level":"info","ts":"2021-12-10T15:51:16.676Z","caller":"datastore/data_store.go:784","msg":"AssignPodIPv4Address: Assign IP 192.168.85.235 to sandbox aws-cni/1a9bda7adf0eb0ddadca3cb0e40e806b558689a5f6efd972d6aee6ff4100f7ab/eth0"}
{"level":"debug","ts":"2021-12-10T15:51:16.677Z","caller":"rpc/rpc.pb.go:713","msg":"VPC CIDR 192.168.0.0/16"}
{"level":"info","ts":"2021-12-10T15:51:16.677Z","caller":"rpc/rpc.pb.go:713","msg":"Send AddNetworkReply: IPv4Addr 192.168.85.235, IPv6Addr: , DeviceNumber: 2, err: <nil>"}
{"level":"debug","ts":"2021-12-10T15:51:20.338Z","caller":"ipamd/ipamd.go:2057","msg":"IP pool stats: total = 15, used = 14, IPs in Cooldown = 1, c.maxIPsPerENI = 5"}

{"level":"debug","ts":"2021-12-10T15:51:25.343Z","caller":"ipamd/ipamd.go:2057","msg":"IP pool stats: total = 15, used = 14, IPs in Cooldown = 1, c.maxIPsPerENI = 5"}

At around 2021-12-10T15:51:30.348Z even 192.168.68.129 is out of cool down.

{"level":"debug","ts":"2021-12-10T15:51:30.348Z","caller":"ipamd/ipamd.go:2057","msg":"IP pool stats: total = 15, used = 14, IPs in Cooldown = 0, c.maxIPsPerENI = 5"}

But I don't see any more ADD requests after 2021-12-10T15:51:16.676Z hence 192.168.68.129 never got assigned to any pod. Can you retry scheduling the pod?

GoGoPenguin commented 2 years ago

@jayanthvn

Okay, I tried to redeploy my pod. I have sent the log file to aws-security@amazon.com

Re deploy

kubectl rollout restart deployment cityfarm-development

Describe

Events:
  Type     Reason                  Age                From               Message
  ----     ------                  ----               ----               -------
  Normal   Scheduled               14s                default-scheduler  Successfully assigned default/cityfarm-development-7997fb8976-gtgxp to ip-192-168-81-85.ap-northeast-2.compute.internal
  Warning  FailedCreatePodSandBox  12s                kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "ebb530230479f5cbbb6c2cbebadd117a09f954112064bb2ea7e7db3a7ae0a6ce" network for pod "cityfarm-development-7997fb8976-gtgxp": networkPlugin cni failed to set up pod "cityfarm-development-7997fb8976-gtgxp_default" network: add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  11s                kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "7bf9e05ac04f8234a7dcc6abc973bb46f7bf88e44fb601b00b979dc7ce038ee7" network for pod "cityfarm-development-7997fb8976-gtgxp": networkPlugin cni failed to set up pod "cityfarm-development-7997fb8976-gtgxp_default" network: add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  10s                kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "7c09e808295a4554e39bf133d8b8a992ad5becd7a29e1c58a1f0151694b74b40" network for pod "cityfarm-development-7997fb8976-gtgxp": networkPlugin cni failed to set up pod "cityfarm-development-7997fb8976-gtgxp_default" network: add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  9s                 kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "eb6dfe5fc8aed5d494c47d6af38af39e1e0ffff0cc623d4284562a56b627cd07" network for pod "cityfarm-development-7997fb8976-gtgxp": networkPlugin cni failed to set up pod "cityfarm-development-7997fb8976-gtgxp_default" network: add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  8s                 kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "8921569cf713a0a9486bd7ad16404379f399f2a90cd0c9ffc27975b5b4888fbc" network for pod "cityfarm-development-7997fb8976-gtgxp": networkPlugin cni failed to set up pod "cityfarm-development-7997fb8976-gtgxp_default" network: add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  7s                 kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "77a1cd4815ecfa65187649a97c4113de8a88403bc69bba2a8b29c10f2f22d6d7" network for pod "cityfarm-development-7997fb8976-gtgxp": networkPlugin cni failed to set up pod "cityfarm-development-7997fb8976-gtgxp_default" network: add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  6s                 kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "e4a7119500377a0339425e7e3c9fd48d33f03fa5d09bb6614d57d51cec8213d7" network for pod "cityfarm-development-7997fb8976-gtgxp": networkPlugin cni failed to set up pod "cityfarm-development-7997fb8976-gtgxp_default" network: add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  5s                 kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "115ce7ea2070f0efa4a0823b07c9736eeb03a83ca013c6153705f98918c0e89c" network for pod "cityfarm-development-7997fb8976-gtgxp": networkPlugin cni failed to set up pod "cityfarm-development-7997fb8976-gtgxp_default" network: add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  4s                 kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "aa296b49488058e6a9857cf991557fbf9b9555fc8bf28aa933a64237f1c425f6" network for pod "cityfarm-development-7997fb8976-gtgxp": networkPlugin cni failed to set up pod "cityfarm-development-7997fb8976-gtgxp_default" network: add cmd: failed to assign an IP address to container
  Warning  FailedCreatePodSandBox  3s                 kubelet            (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "56a700c514d4b769fa64f7af3daa671dbd4ff06dfbc5bd76e76505e86f50d623" network for pod "cityfarm-development-7997fb8976-gtgxp": networkPlugin cni failed to set up pod "cityfarm-development-7997fb8976-gtgxp_default" network: add cmd: failed to assign an IP address to container
  Normal   SandboxChanged          2s (x10 over 11s)  kubelet            Pod sandbox changed, it will be killed and re-created.
  Normal   Pulling                 2s                 kubelet            Pulling image "ID.dkr.ecr.REGION.amazonaws.com/cityfarm-backend:main-cecff2b3-1636616305"

Shreya027 commented 2 years ago

Hi @GoGoPenguin, can you resend logs to k8s-awscni-triage@amazon.com instead. Thanks!

GoGoPenguin commented 2 years ago

Hi @GoGoPenguin, can you resend logs to k8s-awscni-triage@amazon.com instead. Thanks!

@Shreya027 Done. Thank you.

Shreya027 commented 2 years ago

Thanks, Looking into it, will get back soon.

Shreya027 commented 2 years ago

So, in the new logs, I see a similar pattern as mentioned by me above, however with the following error messages: Unable to get IP address from CIDR: no free IP available in the prefix and assignPodIPv4AddressUnsafe: no available IP/Prefix addresses

Shreya027 commented 2 years ago

Hi @GoGoPenguin , I see the pod is assigned successfully later, after the failed to assign IP address events:

After the IP address assignment fails for container 56a700c514d4b769fa64f7af3daaxxxxx for pod cityfarm-development-7997fb8976-gtgxp , the same pod gets IP successfully with container ID 79d32e3aae71f0xxxxx as seen below:

{"level":"info","ts":"2021-12-13T02:48:48.916Z","caller":"routed-eni-cni-plugin/cni.go:111","msg":"Received CNI add request: ContainerID(79d32e3aae71f0xxxxx) Netns(/proc/6756/ns/net) IfName(eth0) Args(IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=cityfarm-development-7997fb8976-gtgxp;K8S_POD_INFRA_CONTAINER_ID=79d32e3aae71f0xxxxx) Path(/opt/cni/bin) argsStdinData({\"cniVersion\":\"0.3.1\",\"mtu\":\"9001\",\"name\":\"aws-cni\",\"pluginLogFile\":\"/var/log/aws-routed-eni/plugin.log\",\"pluginLogLevel\":\"DEBUG\",\"type\":\"aws-cni\",\"vethPrefix\":\"eni\"})"}
{"level":"debug","ts":"2021-12-13T02:48:48.916Z","caller":"routed-eni-cni-plugin/cni.go:111","msg":"MTU value set is 9001:"}
{"level":"info","ts":"2021-12-13T02:48:48.925Z","caller":"routed-eni-cni-plugin/cni.go:111","msg":"Received add network response for container 79d32e3aae71f0xxxxx interface eth0: Success:true IPv4Addr:\"192.168.74.87\" DeviceNumber:1 VPCv4CIDRs:\"192.168.0.0/16\""}

Were you able to see the pod deployment running successfully later?

If you were wondering why the initial errors were seen before the successful deployment, I have stated the reason below:

The previous failures were because of 13 of 15 total IPs being used and 2 IPs being present in cool down throughout the failed to assign IP address period. When one of the IP from cool down is return to warm pool, it gets assigned to the pod mentioned above as seen below in ipamd.json logs. As instance type is t3.medium as @jayanthvn mentioned, it supports 3 ENIs and 5 secondary IPs so only 15 IPs will be available at any time.

{"level":"debug","ts":"2021-12-13T02:48:48.920Z","caller":"datastore/data_store.go:757","msg":"Returning Free IP 192.168.74.87"}
{"level":"debug","ts":"2021-12-13T02:48:48.920Z","caller":"datastore/data_store.go:680","msg":"New IP from CIDR pool- 192.168.74.87"}
{"level":"info","ts":"2021-12-13T02:48:48.920Z","caller":"datastore/data_store.go:784","msg":"AssignPodIPv4Address: Assign IP 192.168.74.87 to sandbox aws-cni/79d32e3aae71f0xxxxx"}

Note: I have replaced contained IDs with xxxxx at end. You could use the timestamps to find mappings in your log.

GoGoPenguin commented 2 years ago

@Shreya027 Sometimes the pod gets stuck in the ContainerCreating state and cannot get the ip from CNI.

Shreya027 commented 2 years ago

Hi @GoGoPenguin , could you please send me logs for this case you are referring to? It will be helpful for me to debug further. The logs you sent earlier had the IP assigned to pod eventually, so I didn't find any issues there.

jwitrick commented 2 years ago

I am seeing similar behaviors for my EKS cluster.

I looked through the above outputs and the error messages are nearly identical.

Im wondering if upgrading the version from 3.19.1 -> 3.21.x would help with the network latency and prevent pods from getting stuck in the failure state due to this networking issue.

Shreya027 commented 2 years ago

Hi @jwitrick, would it be possible for you to send your error logs to k8s-awscni-triage@amazon.com ? The logs sent earlier in the issue had the pod IP assigned eventually.

ecliptik commented 2 years ago

We are experiencing a similar issue and getting these same ipamd messages with host/pods losing networking for periods of time, even though the subnet has plenty of free IP addresses. I've sent an email to k8s-awscni-triage@amazon.com with additional details, case IDs, and logs from an affect EKS node.

marcellodesales commented 2 years ago

🐞 Similar problem: networkPlugin cni failed to set up pod

I see the same similar cases while deploying a service that uses CNI and POD Security Groups a described at https://docs.aws.amazon.com/eks/latest/userguide/security-groups-for-pods.html
After setting up everything, I tried to deploy a service
- Deployments with Security Groups with access to RDS https://www.eksworkshop.com/beginner/115_sg-per-pod/50_deploy/

The problem is that my pods can't find the RDS endpoints because they can't resolve the endpoint... I can see the same error about the sandbox as follows:

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "d423cb5bb261338d384bf2266fbadc05bc074b432319df49b6011c7f954364f3" network for pod "x-y-service-aws-sae1-prdt-ppd-dev-789b656b462tbt4": networkPlugin cni failed to set up pod "x-y-service-aws-sae1-prdt-ppd-dev-789b656b462tbt4_x-aws-sae1-prdt-ppd-dev" network: add cmd: failed to assign an IP address to container

Events:
  Type     Reason                  Age                  From                     Message
  ----     ------                  ----                 ----                     -------
  Normal   Scheduled               56m                  default-scheduler        Successfully assigned x-aws-sae1-prdt-ppd-dev/x-y-service-aws-sae1-prdt-ppd-dev-789b656b462tbt4 to ip-172-16-3-127.sa-east-1.compute.internal
  Normal   SecurityGroupRequested  56m                  vpc-resource-controller  Pod will get the following Security Groups [sg-05c167fb067217b5e]
  Warning  FailedCreatePodSandBox  56m                  kubelet                  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "d423cb5bb261338d384bf2266fbadc05bc074b432319df49b6011c7f954364f3" network for pod "x-y-service-aws-sae1-prdt-ppd-dev-789b656b462tbt4": networkPlugin cni failed to set up pod "x-y-service-aws-sae1-prdt-ppd-dev-789b656b462tbt4_x-aws-sae1-prdt-ppd-dev" network: add cmd: failed to assign an IP address to container
  Normal   ResourceAllocated       56m                  vpc-resource-controller  Allocated [{"eniId":"eni-061b36f5ef1acac44","ifAddress":"0a:d1:a0:62:57:18","privateIp":"172.16.3.254","vlanId":1,"subnetCidr":"172.16.3.0/24"}] to the pod
  Normal   SandboxChanged          56m                  kubelet                  Pod sandbox changed, it will be killed and re-created.
  Normal   Pulled                  56m                  kubelet                  Successfully pulled image "registry.gitlab.com/x/y/x-service:354452df-develop" in 5.934951686s
  Normal   Pulled                  55m                  kubelet                  Successfully pulled image "registry.gitlab.com/x/y/x-service:354452df-develop" in 1.260281578s
  Normal   Pulled                  55m                  kubelet                  Successfully pulled image "registry.gitlab.com/x/y/x-service:354452df-develop" in 1.169348684s
  Normal   Pulled                  54m                  kubelet                  Successfully pulled image "registry.gitlab.com/x/y/x-service:354452df-develop" in 1.138185066s
  Normal   Created                 54m (x4 over 56m)    kubelet                  Created container x-service
  Normal   Started                 54m (x4 over 56m)    kubelet                  Started container x-service
  Normal   Pulling                 26m (x11 over 56m)   kubelet                  Pulling image "registry.gitlab.com/x/y/x-service:354452df-develop"
  Warning  BackOff                 69s (x231 over 55m)  kubelet                  Back-off restarting failed container

⏲️ VPC-CNI Plugin status still on `Creating`

Maybe something happened in the backend in AWS??? I just noticed, while writing this comment, that the VPC-CNI plugin is still on the state "Creating"...

$ aws eks describe-addon \
    --cluster-name eks-ppd-prdt-x-y \
    --addon-name vpc-cni

{
    "addon": {
        "addonName": "vpc-cni",
        "clusterName": "eks-ppd-prdt-x-y",
        "status": "CREATING",
        "addonVersion": "v1.10.1-eksbuild.1",
        "health": {
            "issues": []
        },
        "addonArn": "arn:aws:eks:sa-east-1:xxx:addon/eks-ppd-prdt-x-y/vpc-cni/f8bf5219-13e8-3d54-76da-0ef9415aad0e",
        "createdAt": "2022-01-28T20:57:37.678000-08:00",
        "modifiedAt": "2022-01-28T20:57:37.698000-08:00",
        "serviceAccountRoleArn": "arn:aws:iam::xxx:role/AmazonEKSCNIRole",
        "tags": {}
    }
}

❓ Potential Problem: Missing Role `AmazonEKSCNIRole`

I can't find the role in my cluster

$ aws iam list-roles | jq -r '.Roles[] |  select(.RoleName == "AmazonEKSCNIRole")'

📝 EDIT: A couple of things still

The pods for the aws-node have been crashing

$ kubectl get pods -n kube-system  -l k8s-app=aws-node
NAME             READY   STATUS             RESTARTS   AGE
aws-node-6gdnx   0/1     CrashLoopBackOff   319        22h
aws-node-6rj5d   0/1     Running            320        22h
aws-node-c9cxv   0/1     Running            320        22h
aws-node-cst8j   0/1     Running            321        22h
aws-node-j7gbn   0/1     CrashLoopBackOff   320        22h
aws-node-jtjmm   0/1     CrashLoopBackOff   320        22h
aws-node-k6bvl   0/1     Running            321        22h

I had a the `AmazonEKS_CNI_Policy` role missing and I manually attached

As described at https://docs.aws.amazon.com/eks/latest/userguide/cni-iam-role.html, I had to attach the policy to a role in the cluster

$ aws iam attach-role-policy --policy-arn arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy --role-name eks-ppd-prdt-x-y20220128224446671500000001

☁️  aws-cli@2.2.32 🔖 aws-iam-authenticator@0.5.3
☸️  kubectl@1.22.4 📛 kustomize@v4.3.0 🎡 helm@3.7.0 👽 argocd@2.2.0 ✈️  glooctl@1.9.0
👤 axs-marcello-root 🗂️   🌎 sa-east-1
🏗   🔐 arn:aws:eks:sa-east-1:xxxxxx:cluster/eks-ppd-prdt-x-y 🍱 default
~/dev/gitlab.com/x/services-deploy/xy-service-deploy on  master! 📅 01-29-2022 ⌚20:09:04
$ aws iam list-attached-role-policies --role-name eks-ppd-prdt-x-y20220128224446671500000001
{
    "AttachedPolicies": [
        {
            "PolicyName": "eks-ppd-prdt-x-y-elb-sl-role-creation20220128224446672600000002",
            "PolicyArn": "arn:aws:iam::xxxxyyyzzz:policy/eks-ppd-prdt-x-yelb-sl-role-creation20220128224446672600000002"
        },
        {
            "PolicyName": "AmazonEKSClusterPolicy",
            "PolicyArn": "arn:aws:iam::aws:policy/AmazonEKSClusterPolicy"
        },
        {
            "PolicyName": "AmazonEKSServicePolicy",
            "PolicyArn": "arn:aws:iam::aws:policy/AmazonEKSServicePolicy"
        },
        {
            "PolicyName": "AmazonEKS_CNI_Policy",
            "PolicyArn": "arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy"
        },
        {
            "PolicyName": "AmazonEKSVPCResourceController",
            "PolicyArn": "arn:aws:iam::aws:policy/AmazonEKSVPCResourceController"
        }
    ]
}

Failure in Console

The error of the CNI install was as follows:

InsufficientNumberOfReplicas | The add-on is unhealthy because it doesn't have the desired number of replicas.

Screen Shot 2022-01-29 at 8 31 15 PM

marcellodesales commented 2 years ago

Installing from the UI: Works

It seems to be working when not specifying a ServiceAccountRole...
- the UI shows Service account role: Info Inherited from node

$ kubectl get pods -n kube-system  -l k8s-app=aws-node
NAME             READY   STATUS    RESTARTS   AGE
aws-node-769nf   1/1     Running   0          30s
aws-node-77r4w   1/1     Running   0          31s
aws-node-jp8tg   1/1     Running   0          27s
aws-node-pvsx9   1/1     Running   0          26s
aws-node-s24xp   1/1     Running   0          27s
aws-node-s54jd   1/1     Running   0          35s
aws-node-xnxvx   1/1     Running   0          29s

Installing from AWS CLI: Fails when specifying a Role

Specifying the role I had created has the pods restarting...
I had the following code to add the Add-on

   aws eks create-addon \
     --cluster-name ${EKS_CLUSTER_NAME} \
     --addon-name vpc-cni \
     --addon-version ${CNI_COMPATIBLE_VERSION} \
     --resolve-conflicts OVERWRITE
     --service-account-role-arn ${ROLE_ARN} \

The creation fails...

$ kubectl get pods -n kube-system  -l k8s-app=aws-node
NAME             READY   STATUS    RESTARTS   AGE
aws-node-4ntn7   0/1     Running   0          88s
aws-node-4rtwk   0/1     Running   0          89s
aws-node-7fcsw   0/1     Running   0          86s
aws-node-7p8cv   0/1     Running   0          86s
aws-node-mb6bf   0/1     Running   0          81s
aws-node-nndk8   0/1     Running   0          84s
aws-node-strhl   0/1     Running   0          85s

...
...

$ kubectl get pods -n kube-system  -l k8s-app=aws-node
NAME             READY   STATUS    RESTARTS   AGE
aws-node-4ntn7   0/1     Running   2          3m58s
aws-node-4rtwk   0/1     Running   2          3m59s
aws-node-7fcsw   0/1     Running   2          3m56s
aws-node-7p8cv   0/1     Running   2          3m56s
aws-node-mb6bf   0/1     Running   2          3m51s
aws-node-nndk8   0/1     Running   2          3m54s
aws-node-strhl   0/1     Running   2          3m55s

The logs show it is waiting on IPAM-D... Not sure what that is...

$ kubectl logs -n kube-system aws-node-strhl

{"level":"info","ts":"2022-01-30T04:36:56.122Z","caller":"entrypoint.sh","msg":"Validating env variables ..."}
{"level":"info","ts":"2022-01-30T04:36:56.124Z","caller":"entrypoint.sh","msg":"Install CNI binaries.."}
{"level":"info","ts":"2022-01-30T04:36:56.141Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2022-01-30T04:36:56.145Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
I0130 04:36:57.229531      12 request.go:621] Throttling request took 1.040271681s, request: GET:https://10.100.0.1:443/apis/argoproj.io/v1alpha1?timeout=32s
{"level":"info","ts":"2022-01-30T04:36:58.154Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-01-30T04:37:00.161Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-01-30T04:37:02.168Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-01-30T04:37:04.174Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-01-30T04:37:06.181Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}

Changing to not specifying a role

Not sure if this is the new way, but not specifying the role worked.

   aws eks create-addon \
     --cluster-name ${EKS_CLUSTER_NAME} \
     --addon-name vpc-cni \
     --addon-version ${CNI_COMPATIBLE_VERSION} \
     --resolve-conflicts OVERWRITE
     #--service-account-role-arn ${ROLE_ARN} \

So far the pods came back up

$ kubectl get pods -n kube-system  -l k8s-app=aws-node
NAME             READY   STATUS    RESTARTS   AGE
aws-node-27p49   1/1     Running   0          23s
aws-node-k72xr   1/1     Running   0          16s
aws-node-m28lc   1/1     Running   0          15s
aws-node-smg5p   1/1     Running   0          22s
aws-node-whhtb   1/1     Running   0          11s
aws-node-xsc5f   1/1     Running   0          19s
aws-node-z9qz7   1/1     Running   0          17s

marcellodesales commented 2 years ago

Fixed

I properly set up a CNI Role with the policy AmazonEKS_CNI_Policy and it worked
- https://docs.aws.amazon.com/eks/latest/userguide/cni-iam-role.html
- Testing with example https://www.eksworkshop.com/beginner/115_sg-per-pod/50_deploy/ (Still can't get the DNS resolution to work)

$ kubectl describe pod -n zzz-aws-x-y-z-dev green-pod-5db68f6449-n6m8x
Name:         green-pod-5db68f6449-n6m8x
Namespace:    zzz-aws-x-y-z-dev
Priority:     0
Node:         ip-172-16-1-214.sa-east-1.compute.internal/172.16.1.214
Start Time:   Sun, 30 Jan 2022 01:17:32 -0800
Labels:       app=green-pod
              pod-template-hash=5db68f6449
Annotations:  kubernetes.io/psp: eks.privileged
              vpc.amazonaws.com/pod-eni:
                [{"eniId":"eni-0243826591076371e","ifAddress":"02:4a:df:8f:23:84","privateIp":"172.16.1.224","vlanId":2,"subnetCidr":"172.16.1.0/24"}]
Status:       Running
IP:           172.16.1.89
IPs:
  IP:           172.16.1.89
Controlled By:  ReplicaSet/green-pod-5db68f6449
Containers:
  green-pod:
    Container ID:   docker://bf1dd9d5fbad507bd0d503baa2cc2f65d2d96652b4c69da7a4d48bc2b8ff84c7
    Image:          fmedery/app:latest
    Image ID:       docker-pullable://fmedery/app@sha256:64fadcdfe9f826b842a8c576ae4b9dbc4e18a9865226e556baad71bfea239292
    Port:           <none>
    Host Port:      <none>
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Sun, 30 Jan 2022 01:17:54 -0800
      Finished:     Sun, 30 Jan 2022 01:17:54 -0800
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Sun, 30 Jan 2022 01:17:37 -0800
      Finished:     Sun, 30 Jan 2022 01:17:37 -0800
    Ready:          False
    Restart Count:  2
    Limits:
      cpu:                        512m
      memory:                     512Mi
      vpc.amazonaws.com/pod-eni:  1
    Requests:
      cpu:                        500m
      memory:                     256Mi
      vpc.amazonaws.com/pod-eni:  1
    Environment:
      HOST:      <set to the key 'host' in secret 'rds-postgres'>  Optional: false
      DBNAME:    dbnameeee
      USER:      <set to the key 'username' in secret 'rds-postgres'>  Optional: false
      PASSWORD:  <set to the key 'password' in secret 'rds-postgres'>  Optional: false
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-mp4k8 (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  kube-api-access-mp4k8:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
                             vpc.amazonaws.com/pod-eni:NoSchedule op=Exists
Events:
  Type     Reason                  Age                From                     Message
  ----     ------                  ----               ----                     -------
  Normal   Scheduled               34s                default-scheduler        Successfully assigned zzz-zzz-sae1-prdt-ppd-dev/green-pod-5db68f6449-n6m8x to ip-172-16-1-214.sa-east-1.compute.internal
  Normal   SecurityGroupRequested  34s                vpc-resource-controller  Pod will get the following Security Groups [sg-05c167fb067217b5e]
  Normal   ResourceAllocated       33s                vpc-resource-controller  Allocated [{"eniId":"eni-0243826591076371e","ifAddress":"02:4a:df:8f:23:84","privateIp":"172.16.1.224","vlanId":2,"subnetCidr":"172.16.1.0/24"}] to the pod
  Normal   Pulled                  31s                kubelet                  Successfully pulled image "fmedery/app:latest" in 1.485599101s
  Normal   Pulled                  29s                kubelet                  Successfully pulled image "fmedery/app:latest" in 1.462975983s
  Normal   Pulling                 13s (x3 over 33s)  kubelet                  Pulling image "fmedery/app:latest"
  Normal   Created                 12s (x3 over 31s)  kubelet                  Created container green-pod
  Normal   Started                 12s (x3 over 31s)  kubelet                  Started container green-pod
  Warning  BackOff                 12s (x3 over 29s)  kubelet                  Back-off restarting failed container
  Normal   Pulled                  12s                kubelet                  Successfully pulled image "fmedery/app:latest" in 1.439726978s

Interface

It was created "properly"
- I still can't access the RDS DB
- the Network Interface created for the POD is as follows: eni-0243826591076371e

$ aws ec2 describe-network-interfaces | jq -r '.NetworkInterfaces[] | select(.NetworkInterfaceId == "eni-0243826591076371e")'
{
  "AvailabilityZone": "sa-east-1a",
  "Description": "aws-k8s-branch-eni",
  "Groups": [
    {
      "GroupName": "conn-4-pod-rds-group",
      "GroupId": "sg-05c167fb067217b5e"
    }
  ],
  "InterfaceType": "branch",
  "Ipv6Addresses": [],
  "MacAddress": "02:4a:df:8f:23:84",
  "NetworkInterfaceId": "eni-0243826591076371e",
  "OwnerId": "806101772216",
  "PrivateDnsName": "ip-172-16-1-224.sa-east-1.compute.internal",
  "PrivateIpAddress": "172.16.1.224",
  "PrivateIpAddresses": [
    {
      "Primary": true,
      "PrivateDnsName": "ip-172-16-1-224.sa-east-1.compute.internal",
      "PrivateIpAddress": "172.16.1.224"
    }
  ],
  "RequesterId": "285275063451",
  "RequesterManaged": false,
  "SourceDestCheck": true,
  "Status": "in-use",
  "SubnetId": "subnet-0a130d65efd4f0071",
  "TagSet": [
    {
      "Key": "eks:eni:owner",
      "Value": "eks-vpc-resource-controller"
    },
    {
      "Key": "vpcresources.k8s.aws/trunk-eni-id",
      "Value": "eni-017f74c86c392f663"
    },
    {
      "Key": "kubernetes.io/cluster/eks-ppd-prdt-super-cash",
      "Value": "owned"
    },
    {
      "Key": "vpcresources.k8s.aws/vlan-id",
      "Value": "2"
    }
  ],
  "VpcId": "vpc-04858bd8c565075ae"
}

justin-obn commented 2 years ago

@GoGoPenguin Were you finally able to get a fix?

cgchinmay commented 2 years ago

@justin-obn are you facing a similar issue ? what version of vpc-cni are you using. Could you share your logs at k8s-awscni-triage@amazon.com

justin-obn commented 2 years ago

@cgchinmay Thanks for you quick reply. I'm not facing this issue anymore

jayanthvn commented 2 years ago

@GoGoPenguin - This is expected behavior on your cluster and we see the max IPs are reached and once freed we are seeing new pods are getting IPs. If the issue persists please feel free to open an issue.

github-actions[bot] commented 2 years ago

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see. If you need more assistance, please open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.

riteshsonawane1372 commented 1 year ago

Having similar issue.

wdonne commented 5 months ago

I deployed an EKS cluster with Kubernetes 1.29 and I had to update the kube-proxy DaemonSet to the latest version to make it work.

aws / amazon-vpc-cni-k8s

AWS EKS - add cmd: failed to assign an IP address to container #1791

🐞 Similar problem: networkPlugin cni failed to set up pod

⏲️ VPC-CNI Plugin status still on Creating

❓ Potential Problem: Missing Role AmazonEKSCNIRole

📝 EDIT: A couple of things still

The pods for the aws-node have been crashing

I had a the AmazonEKS_CNI_Policy role missing and I manually attached

Failure in Console

Installing from the UI: Works

Installing from AWS CLI: Fails when specifying a Role

Changing to not specifying a role

Fixed

Interface

⚠️COMMENT VISIBILITY WARNING⚠️

⏲️ VPC-CNI Plugin status still on `Creating`

❓ Potential Problem: Missing Role `AmazonEKSCNIRole`

I had a the `AmazonEKS_CNI_Policy` role missing and I manually attached