aws / amazon-vpc-cni-k8s

Networking plugin repository for pod networking in Kubernetes using Elastic Network Interfaces on AWS
Apache License 2.0
2.26k stars 735 forks source link

`aws-node` pod stuck after starting #3016

Open carlosrejano opened 1 month ago

carlosrejano commented 1 month ago

What happened: We are running an EKS cluster with 1.28 Kubernetes version, this cluster uses Karpenter for dynamically scale the cluster. There is constant movement in the cluster so new nodes are constantly appearing.

We've found that in some cases after a new node appears the aws-node in that node is stuck and new pods can not start due to aws-node not being reachable so the container networking is not configured. See the pod event:

Warning  FailedCreatePodSandBox  3m53s (x268 over 62m)  kubelet  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "94ee9c38f7a821bf5abf88a511f2ec99b13c37
43e3dfc6327e82ad84833a9e69": plugin type="aws-cni" name="aws-cni" failed (add): add cmd: Error received from AddNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 12
7.0.0.1:50051: connect: connection refused"

Checking the aws-node I see two things:

  1. The container is marked as not fully running:
    NAME             READY   STATUS        RESTARTS   AGE
    aws-node-4xbrr   1/2     Running   0          4h5m
  2. The logs show that it seems to get stuck:
    Defaulted container "aws-node" out of: aws-node, aws-eks-nodeagent, aws-vpc-cni-init (init)
    Installed /host/opt/cni/bin/aws-cni
    Installed /host/opt/cni/bin/egress-cni
    time="2024-08-28T05:11:45Z" level=info msg="Starting IPAM daemon... "
    time="2024-08-28T05:11:45Z" level=info msg="Checking for IPAM connectivity... "
    time="2024-08-28T05:11:50Z" level=info msg="Copying config file... "
    time="2024-08-28T05:11:50Z" level=info msg="Successfully copied CNI plugin binary and config file."

I did not have debug logs enabled.

After restarting the pod it works again.

Node AMI: v1.28.11-eks-1552ad0 AWS CNI: v1.18.2-eksbuild.1

Thanks!