aws / amazon-vpc-cni-k8s

Networking plugin repository for pod networking in Kubernetes using Elastic Network Interfaces on AWS
Apache License 2.0
2.28k stars 743 forks source link

Pods stuck in ContainerCreating due to pull error unauthorized #2030

Closed gh-axel-czarniak closed 2 years ago

gh-axel-czarniak commented 2 years ago

We recently switched our cluster to EKS 1.22 with managed node group and since we have sometime this error when container are created. We don't have a fix except replacing the node where the pod try to be scheduled.

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "602401143452.dkr.ecr.eu-north-1.amazonaws.com/eks/pause:3.1-eksbuild.1": failed to pull image "602401143452.dkr.ecr.eu-north-1.amazonaws.com/eks/pause:3.1-eksbuild.1": failed to pull and unpack image "602401143452.dkr.ecr.eu-north-1.amazonaws.com/eks/pause:3.1-eksbuild.1": failed to resolve reference "602401143452.dkr.ecr.eu-north-1.amazonaws.com/eks/pause:3.1-eksbuild.1": pulling from host 602401143452.dkr.ecr.eu-north-1.amazonaws.com failed with status code [manifests 3.1-eksbuild.1]: 401 Unauthorized   

I don't know if it's the right place to ask for this. If this is not, please tell me where I can post this issue.

jayanthvn commented 2 years ago

@axelczk - Can you please open a support ticket for this? Team should be able to check if it is any permission issues to pull from ECR. Looks like you are getting a 401. This issue doesn't belong to CNI.

gh-axel-czarniak commented 2 years ago

I know I'm getting a 401. The real question is why when the node just started it's working and I can pull this image and after some days or hours, it's not working anymore ?

I don't know which service is responsible for this.

dotsuber commented 2 years ago

Hi, I'm having this exactly issue too after upgrading EKS. Is there any solution to that?

juris commented 1 year ago

Just had the same issue and found this ticket. In my case, pause image was gone after pruning unused images and it turns out that it can not be downloaded back by containerd. So I had to manually download it.

I'm using BottlerocketOS and it was not that trivial. Here's how to do it.

  1. Get your aws ecs auth token first (did it with my aws access key / secret key on a laptop)
    aws ecr get-login-password --region <your-region>
  2. Login to the affected instance and get crictl
    cd /tmp
    yum install tar -y
    curl -fsL -o crictl.tar.gz https://github.com/kubernetes-sigs/cri-tools/releases/download/v1.26.0/crictl-v1.26.0-linux-amd64.tar.gz
    tar zxf crictl.tar.gz
    chmod u+x crictl
  3. Pull the pause image
    ./crictl --runtime-endpoint=unix:///.bottlerocket/rootfs/run/dockershim.sock pull --creds "AWS:TOKEN_FROM_STEP_1" XXXXX.dkr.ecr.XXXXXXX.amazonaws.com/eks/pause:3.1-eksbuild.1

    Now you have that pause image in place, so pods should be able to start normally.

gh-axel-czarniak commented 1 year ago

Hi, I'm having this exactly issue too after upgrading EKS. Is there any solution to that?

There is an error on their side on EKS node. You need to add this bootstrap extra arg: '--pod-infra-container-image=602401143452.dkr.ecr.${var.region}.amazonaws.com/eks/pause:3.1-eksbuild.1'

Using this, the garbage collector will not remove the pause container and you will not have the need to pull the image.

juris commented 1 year ago

As far as I know, garbage collector takes only disk space into account. In my case, the server was running out of inodes, so I had to manually prune images.

gh-axel-czarniak commented 1 year ago

We have contacted the AWS support on our side, and after days of exchange and debugging this was the explanation we found. The garbage collector was pruning image on the node and removing also the pause container with others images. I still have the ticket somewhere and can check for the full explanation if necessary.

dotsuber commented 1 year ago

Hi, I'm having the same issue after upgrading EKS to 1.25. Is this solution still valid? I think this feature flag is deprecated.

Hi, I'm having this exactly issue too after upgrading EKS. Is there any solution to that?

There is an error on their side on EKS node. You need to add this bootstrap extra arg: '--pod-infra-container-image=602401143452.dkr.ecr.${var.region}.amazonaws.com/eks/pause:3.1-eksbuild.1'

Using this, the garbage collector will not remove the pause container and you will not have the need to pull the image.

ddl-slevine commented 1 year ago

Having the same issue in an EKS upgrade to 1.24

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5": failed to pull image "602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5": failed to pull and unpack image "602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5": failed to resolve reference "602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5": pulling from host 602401143452.dkr.ecr.us-east-1.amazonaws.com failed with status code [manifests 3.5]: 401 Unauthorized

interair commented 1 year ago

Having the same issue in an EKS after upgrade to 1.27, can anyone help me, please?

Aug 10 10:50:06  kubelet[3229]: E0810 10:50:06.304292    3229 pod_workers.go:1294] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \ with CreatePodSandboxError: \"Failed to create sandbox for pod \\\: rpc error: code = Unknown desc = failed to get sandbox image \\\"602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/pause:3.5\\\": failed to pull image \\\"602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/pause:3.5\\\": failed to pull and unpack image \\\"602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/pause:3.5\\\": failed to resolve reference \\\"602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/pause:3.5\\\": pulling from host 602401143452.dkr.ecr.us-west-2.amazonaws.com failed with status code [manifests 3.5]: 401 Unauthorized\"" 

journalctl -xu kubelet.service --no-pager | grep -i credent

Aug 10 12:31:11 kubelet[3222]: W0810 12:31:11.489570    3222 feature_gate.go:241] Setting GA feature gate KubeletCredentialProviders=true. It will be removed in a future release.
Aug 10 12:31:11 kubelet[3222]: I0810 12:31:11.490173    3222 flags.go:64] FLAG: --image-credential-provider-bin-dir="/etc/eks/image-credential-provider"
Aug 10 12:31:11 kubelet[3222]: I0810 12:31:11.490179    3222 flags.go:64] FLAG: --image-credential-provider-config="/etc/eks/image-credential-provider/config.json"
Aug 10 12:31:11 kubelet[3222]: W0810 12:31:11.490930    3222 feature_gate.go:241] Setting GA feature gate KubeletCredentialProviders=true. It will be removed in a future release.
Aug 10 12:31:11 kubelet[3222]: I0810 12:31:11.490944    3222 feature_gate.go:249] feature gates: &{map[KubeletCredentialProviders:true RotateKubeletServerCertificate:true]}
Aug 10 12:31:11 kubelet[3222]: W0810 12:31:11.495660    3222 feature_gate.go:241] Setting GA feature gate KubeletCredentialProviders=true. It will be removed in a future release.
Aug 10 12:31:11 kubelet[3222]: I0810 12:31:11.495678    3222 feature_gate.go:249] feature gates: &{map[KubeletCredentialProviders:true RotateKubeletServerCertificate:true]}
Aug 10 12:31:11 kubelet[3222]: W0810 12:31:11.495793    3222 feature_gate.go:241] Setting GA feature gate KubeletCredentialProviders=true. It will be removed in a future release.
Aug 10 12:31:11 kubelet[3222]: I0810 12:31:11.495808    3222 feature_gate.go:249] feature gates: &{map[KubeletCredentialProviders:true RotateKubeletServerCertificate:true]}
Aug 10 12:31:17 kubelet[3222]: I0810 12:31:17.568804    3222 provider.go:102] Refreshing cache for provider: *credentialprovider.defaultDockerConfigProvider

ps aux | grep bin/kubelet | grep -v grep

root      3222  2.0  0.6 1812360 104096 ?      Ssl  12:31   3:34 /usr/bin/kubelet --config /etc/kubernetes/kubelet/kubelet-config.json --kubeconfig /var/lib/kubelet/kubeconfig --container-runtime-endpoint unix:///run/containerd/containerd.sock --image-credential-provider-config /etc/eks/image-credential-provider/config.json --image-credential-provider-bin-dir /etc/eks/image-credential-provider --node-ip=xxxxx --v=2 --hostname-override=ip-xxxxx.us-west-2.compute.internal --cloud-provider=external --node-labels=eks.amazonaws.com/nodegroup-image=ami-xxxxx,eks.amazonaws.com/capacityType=SPOT,environment=test,eks.amazonaws.com/nodegroup=testSpot --max-pods=58

cat /etc/eks/image-credential-provider/config.json

{
  "apiVersion": "kubelet.config.k8s.io/v1",
  "kind": "CredentialProviderConfig",
  "providers": [
    {
      "name": "ecr-credential-provider",
      "matchImages": [
        "*.dkr.ecr.*.amazonaws.com",
        "*.dkr.ecr.*.amazonaws.com.cn",
        "*.dkr.ecr-fips.*.amazonaws.com",
        "*.dkr.ecr.*.c2s.ic.gov",
        "*.dkr.ecr.*.sc2s.sgov.gov"
      ],
      "defaultCacheDuration": "12h",
      "apiVersion": "credentialprovider.kubelet.k8s.io/v1"
    }
  ]
}

ls -la /etc/eks/image-credential-provider

drwxr-xr-x 2 root root       56 Jul 28 04:18 .
drwxr-xr-x 5 root root      265 Jul 28 04:18 ..
-rw-r--r-- 1 root root      477 Jul 28 04:15 config.json
-rwxrwxr-x 1 root root 16072704 Jun 30 18:40 ecr-credential-provider

Getting token works at node: aws ecr get-login-password --region us-west-2

Fetching image via ./crictl works fine from node

VikramPunnam commented 1 year ago

hi @interair

Hi, We are also having same issue in our environment..

The kubelet is able to pull all system images(amazon-k8s-cni-init, amazon-k8s-cni) except pause image as given below.

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "900889452093.dkr.ecr.ap-south-2.amazonaws.com/eks/pause:3.5": failed to pull image "900889452093.dkr.ecr.ap-south-2.amazonaws.com/eks/pause:3.5": failed to pull and unpack image "900889452093.dkr.ecr.ap-south-2.amazonaws.com/eks/pause:3.5": failed to resolve reference "900889452093.dkr.ecr.ap-south-2.amazonaws.com/eks/pause:3.5": pulling from host 900889452093.dkr.ecr.ap-south-2.amazonaws.com failed with status code [manifests 3.5]: 401 Unauthorized

Fetching image via ./crictl is not a possible solution in production kind of environment,

Can anyone help me, please?

hamdallahjodah commented 1 year ago

any updates here same issue!!

jdn5126 commented 1 year ago

@VikramPunnam @hamdallahjodah @interair @ddl-slevine I am not familiar with this issue, and it is not an issue with the VPC CNI, so I suggest opening an AWS support case to get help. That will be the fastest way to a resolution, and you can your findings here.

ohrab-hacken commented 10 months ago

I am have the same issue when upgrade to 1.29. Some node can download pause image, but some node cannot. So all pods no the node just hung in create state. Doesn't understand why pause image have 401 only some times.

elvishsu66 commented 9 months ago

We also have this issue after upgrading to 1.29. Do we have a few good hints so I can start digging?

nightmareze1 commented 9 months ago

I have the same issue with EKS 1.29 :(

pjanouse commented 9 months ago

I've observed the same after v1.29 upgrade today too. Tried to re-place an affected compute node with the fresh one and it seems helped (at least for awhile). So far so good...

nightmareze1 commented 9 months ago

I think the problem is happening after 12hs when the session token expires and curiously the instance where I tested it didn't have any inodes/space problems.

cartermckinnon commented 9 months ago

If this is happening on the official EKS AMI, can you open an issue in our repo so we can look into it? https://github.com/awslabs/amazon-eks-ami

ohrab-hacken commented 9 months ago

--pod-infra-container-image flag is set on kubelet. I found that my disk on node really become full after some time and kubelet image garbage collector delete pause image. So, instead of delete different images, it deletes pause image. After pause image deleted, node doesn't work. I found the reason of full disk. In my case, I have ttlSecondsAfterFinished: 7200 for dagster jobs, and it consume all disk space. I've changed it to ttlSecondsAfterFinished: 120 and jobs cleaned up more frequently and we don't have this issue any more. It's strange cause I didn't have this issue on 1.28, and I didn't change any Dagster configuration between version upgrade. My guess, it kubelet image garbage collector works different in 1.28 and 1.29.

jdn5126 commented 9 months ago

@ohrab-hacken --pod-infra-container-image was deprecated in k8s 1.27. As I understand it, the container runtime will prune the image unless it is marked as pinned. From the EKS 1.28 AMI, it does seem like the pause image is not pinned for some reason. @cartermckinnon do you know if it should be?

jdn5126 commented 9 months ago

This issue is being discussed at https://github.com/awslabs/amazon-eks-ami/issues/1597

avisaradir commented 9 months ago

Is there any new progress solving this matter?

jdn5126 commented 9 months ago

Is there any new progress solving this matter?

Did you follow the issue I linked to? This issue is in the EKS AMI, not the VPC CNI, so short and long-term resolutions are being discussed there

avisaradir commented 9 months ago

Is there any new progress solving this matter?

Did you follow the issue I linked to? This issue is in the EKS AMI, not the VPC CNI, so short and long-term resolutions are being discussed there

I will take another look at that.

ForbiddenEra commented 8 months ago

Just started running into this today?

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "602401143452.dkr.ecr.ca-central-1.amazonaws.com/eks/pause:3.5": failed to pull image "602401143452.dkr.ecr.ca-central-1.amazonaws.com/eks/pause:3.5": failed to pull and unpack image "602401143452.dkr.ecr.ca-central-1.amazonaws.com/eks/pause:3.5": failed to resolve reference "602401143452.dkr.ecr.ca-central-1.amazonaws.com/eks/pause:3.5": unexpected status from HEAD request to https://602401143452.dkr.ecr.ca-central-1.amazonaws.com/v2/eks/pause/manifests/3.5: 401 Unauthorized

2/3 replicas for my pod deployed; all were scheduled on different nodes but all nodes are self-managed and running the same AMI.. thought maybe it was only affecting one AZ, tried re-deploy again and now only 1/3 worked. Not sure yet if it's only affecting specific nodes or what...

Edit: So, I don't see any patterns with regards to node type, node group, AZ or specific resources or anything.

Seems to have started a few days ago. Not AMI-related really. Not sure if it's specifically VPC-CNI related either though it of course prevented me from updating that plugin.

Doing an instance refresh and/or terminating/re-creating the nodes/instances that were failing seems to have resolved the issue (for now?) - they were all redeployed with the same AMI and everything. No idea WTH.