Closed gh-axel-czarniak closed 2 years ago
@axelczk - Can you please open a support ticket for this? Team should be able to check if it is any permission issues to pull from ECR. Looks like you are getting a 401. This issue doesn't belong to CNI.
I know I'm getting a 401. The real question is why when the node just started it's working and I can pull this image and after some days or hours, it's not working anymore ?
I don't know which service is responsible for this.
Hi, I'm having this exactly issue too after upgrading EKS. Is there any solution to that?
Just had the same issue and found this ticket. In my case, pause image was gone after pruning unused images and it turns out that it can not be downloaded back by containerd. So I had to manually download it.
I'm using BottlerocketOS and it was not that trivial. Here's how to do it.
aws ecr get-login-password --region <your-region>
cd /tmp
yum install tar -y
curl -fsL -o crictl.tar.gz https://github.com/kubernetes-sigs/cri-tools/releases/download/v1.26.0/crictl-v1.26.0-linux-amd64.tar.gz
tar zxf crictl.tar.gz
chmod u+x crictl
./crictl --runtime-endpoint=unix:///.bottlerocket/rootfs/run/dockershim.sock pull --creds "AWS:TOKEN_FROM_STEP_1" XXXXX.dkr.ecr.XXXXXXX.amazonaws.com/eks/pause:3.1-eksbuild.1
Now you have that pause image in place, so pods should be able to start normally.
Hi, I'm having this exactly issue too after upgrading EKS. Is there any solution to that?
There is an error on their side on EKS node. You need to add this bootstrap extra arg: '--pod-infra-container-image=602401143452.dkr.ecr.${var.region}.amazonaws.com/eks/pause:3.1-eksbuild.1'
Using this, the garbage collector will not remove the pause container and you will not have the need to pull the image.
As far as I know, garbage collector takes only disk space into account. In my case, the server was running out of inodes, so I had to manually prune images.
We have contacted the AWS support on our side, and after days of exchange and debugging this was the explanation we found. The garbage collector was pruning image on the node and removing also the pause container with others images. I still have the ticket somewhere and can check for the full explanation if necessary.
Hi, I'm having the same issue after upgrading EKS to 1.25. Is this solution still valid? I think this feature flag is deprecated.
Hi, I'm having this exactly issue too after upgrading EKS. Is there any solution to that?
There is an error on their side on EKS node. You need to add this bootstrap extra arg: '--pod-infra-container-image=602401143452.dkr.ecr.${var.region}.amazonaws.com/eks/pause:3.1-eksbuild.1'
Using this, the garbage collector will not remove the pause container and you will not have the need to pull the image.
Having the same issue in an EKS upgrade to 1.24
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5": failed to pull image "602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5": failed to pull and unpack image "602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5": failed to resolve reference "602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause:3.5": pulling from host 602401143452.dkr.ecr.us-east-1.amazonaws.com failed with status code [manifests 3.5]: 401 Unauthorized
Having the same issue in an EKS after upgrade to 1.27, can anyone help me, please?
Aug 10 10:50:06 kubelet[3229]: E0810 10:50:06.304292 3229 pod_workers.go:1294] "Error syncing pod, skipping" err="failed to \"CreatePodSandbox\" for \ with CreatePodSandboxError: \"Failed to create sandbox for pod \\\: rpc error: code = Unknown desc = failed to get sandbox image \\\"602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/pause:3.5\\\": failed to pull image \\\"602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/pause:3.5\\\": failed to pull and unpack image \\\"602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/pause:3.5\\\": failed to resolve reference \\\"602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/pause:3.5\\\": pulling from host 602401143452.dkr.ecr.us-west-2.amazonaws.com failed with status code [manifests 3.5]: 401 Unauthorized\""
journalctl -xu kubelet.service --no-pager | grep -i credent
Aug 10 12:31:11 kubelet[3222]: W0810 12:31:11.489570 3222 feature_gate.go:241] Setting GA feature gate KubeletCredentialProviders=true. It will be removed in a future release.
Aug 10 12:31:11 kubelet[3222]: I0810 12:31:11.490173 3222 flags.go:64] FLAG: --image-credential-provider-bin-dir="/etc/eks/image-credential-provider"
Aug 10 12:31:11 kubelet[3222]: I0810 12:31:11.490179 3222 flags.go:64] FLAG: --image-credential-provider-config="/etc/eks/image-credential-provider/config.json"
Aug 10 12:31:11 kubelet[3222]: W0810 12:31:11.490930 3222 feature_gate.go:241] Setting GA feature gate KubeletCredentialProviders=true. It will be removed in a future release.
Aug 10 12:31:11 kubelet[3222]: I0810 12:31:11.490944 3222 feature_gate.go:249] feature gates: &{map[KubeletCredentialProviders:true RotateKubeletServerCertificate:true]}
Aug 10 12:31:11 kubelet[3222]: W0810 12:31:11.495660 3222 feature_gate.go:241] Setting GA feature gate KubeletCredentialProviders=true. It will be removed in a future release.
Aug 10 12:31:11 kubelet[3222]: I0810 12:31:11.495678 3222 feature_gate.go:249] feature gates: &{map[KubeletCredentialProviders:true RotateKubeletServerCertificate:true]}
Aug 10 12:31:11 kubelet[3222]: W0810 12:31:11.495793 3222 feature_gate.go:241] Setting GA feature gate KubeletCredentialProviders=true. It will be removed in a future release.
Aug 10 12:31:11 kubelet[3222]: I0810 12:31:11.495808 3222 feature_gate.go:249] feature gates: &{map[KubeletCredentialProviders:true RotateKubeletServerCertificate:true]}
Aug 10 12:31:17 kubelet[3222]: I0810 12:31:17.568804 3222 provider.go:102] Refreshing cache for provider: *credentialprovider.defaultDockerConfigProvider
ps aux | grep bin/kubelet | grep -v grep
root 3222 2.0 0.6 1812360 104096 ? Ssl 12:31 3:34 /usr/bin/kubelet --config /etc/kubernetes/kubelet/kubelet-config.json --kubeconfig /var/lib/kubelet/kubeconfig --container-runtime-endpoint unix:///run/containerd/containerd.sock --image-credential-provider-config /etc/eks/image-credential-provider/config.json --image-credential-provider-bin-dir /etc/eks/image-credential-provider --node-ip=xxxxx --v=2 --hostname-override=ip-xxxxx.us-west-2.compute.internal --cloud-provider=external --node-labels=eks.amazonaws.com/nodegroup-image=ami-xxxxx,eks.amazonaws.com/capacityType=SPOT,environment=test,eks.amazonaws.com/nodegroup=testSpot --max-pods=58
cat /etc/eks/image-credential-provider/config.json
{
"apiVersion": "kubelet.config.k8s.io/v1",
"kind": "CredentialProviderConfig",
"providers": [
{
"name": "ecr-credential-provider",
"matchImages": [
"*.dkr.ecr.*.amazonaws.com",
"*.dkr.ecr.*.amazonaws.com.cn",
"*.dkr.ecr-fips.*.amazonaws.com",
"*.dkr.ecr.*.c2s.ic.gov",
"*.dkr.ecr.*.sc2s.sgov.gov"
],
"defaultCacheDuration": "12h",
"apiVersion": "credentialprovider.kubelet.k8s.io/v1"
}
]
}
ls -la /etc/eks/image-credential-provider
drwxr-xr-x 2 root root 56 Jul 28 04:18 .
drwxr-xr-x 5 root root 265 Jul 28 04:18 ..
-rw-r--r-- 1 root root 477 Jul 28 04:15 config.json
-rwxrwxr-x 1 root root 16072704 Jun 30 18:40 ecr-credential-provider
Getting token works at node:
aws ecr get-login-password --region us-west-2
Fetching image via ./crictl
works fine from node
hi @interair
Hi, We are also having same issue in our environment..
The kubelet is able to pull all system images(amazon-k8s-cni-init, amazon-k8s-cni) except pause image as given below.
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "900889452093.dkr.ecr.ap-south-2.amazonaws.com/eks/pause:3.5": failed to pull image "900889452093.dkr.ecr.ap-south-2.amazonaws.com/eks/pause:3.5": failed to pull and unpack image "900889452093.dkr.ecr.ap-south-2.amazonaws.com/eks/pause:3.5": failed to resolve reference "900889452093.dkr.ecr.ap-south-2.amazonaws.com/eks/pause:3.5": pulling from host 900889452093.dkr.ecr.ap-south-2.amazonaws.com failed with status code [manifests 3.5]: 401 Unauthorized
Fetching image via ./crictl is not a possible solution in production kind of environment,
Can anyone help me, please?
any updates here same issue!!
@VikramPunnam @hamdallahjodah @interair @ddl-slevine I am not familiar with this issue, and it is not an issue with the VPC CNI, so I suggest opening an AWS support case to get help. That will be the fastest way to a resolution, and you can your findings here.
I am have the same issue when upgrade to 1.29. Some node can download pause image, but some node cannot. So all pods no the node just hung in create state. Doesn't understand why pause image have 401 only some times.
We also have this issue after upgrading to 1.29. Do we have a few good hints so I can start digging?
I have the same issue with EKS 1.29 :(
I've observed the same after v1.29 upgrade today too. Tried to re-place an affected compute node with the fresh one and it seems helped (at least for awhile). So far so good...
I think the problem is happening after 12hs when the session token expires and curiously the instance where I tested it didn't have any inodes/space problems.
If this is happening on the official EKS AMI, can you open an issue in our repo so we can look into it? https://github.com/awslabs/amazon-eks-ami
--pod-infra-container-image flag is set on kubelet. I found that my disk on node really become full after some time and kubelet image garbage collector delete pause image. So, instead of delete different images, it deletes pause image. After pause image deleted, node doesn't work. I found the reason of full disk. In my case, I have ttlSecondsAfterFinished: 7200 for dagster jobs, and it consume all disk space. I've changed it to ttlSecondsAfterFinished: 120 and jobs cleaned up more frequently and we don't have this issue any more. It's strange cause I didn't have this issue on 1.28, and I didn't change any Dagster configuration between version upgrade. My guess, it kubelet image garbage collector works different in 1.28 and 1.29.
@ohrab-hacken --pod-infra-container-image
was deprecated in k8s 1.27. As I understand it, the container runtime will prune the image unless it is marked as pinned
. From the EKS 1.28 AMI, it does seem like the pause image is not pinned for some reason. @cartermckinnon do you know if it should be?
This issue is being discussed at https://github.com/awslabs/amazon-eks-ami/issues/1597
Is there any new progress solving this matter?
Is there any new progress solving this matter?
Did you follow the issue I linked to? This issue is in the EKS AMI, not the VPC CNI, so short and long-term resolutions are being discussed there
Is there any new progress solving this matter?
Did you follow the issue I linked to? This issue is in the EKS AMI, not the VPC CNI, so short and long-term resolutions are being discussed there
I will take another look at that.
Just started running into this today?
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "602401143452.dkr.ecr.ca-central-1.amazonaws.com/eks/pause:3.5": failed to pull image "602401143452.dkr.ecr.ca-central-1.amazonaws.com/eks/pause:3.5": failed to pull and unpack image "602401143452.dkr.ecr.ca-central-1.amazonaws.com/eks/pause:3.5": failed to resolve reference "602401143452.dkr.ecr.ca-central-1.amazonaws.com/eks/pause:3.5": unexpected status from HEAD request to https://602401143452.dkr.ecr.ca-central-1.amazonaws.com/v2/eks/pause/manifests/3.5: 401 Unauthorized
2/3 replicas for my pod deployed; all were scheduled on different nodes but all nodes are self-managed and running the same AMI.. thought maybe it was only affecting one AZ, tried re-deploy again and now only 1/3 worked. Not sure yet if it's only affecting specific nodes or what...
Edit: So, I don't see any patterns with regards to node type, node group, AZ or specific resources or anything.
Seems to have started a few days ago. Not AMI-related really. Not sure if it's specifically VPC-CNI related either though it of course prevented me from updating that plugin.
Doing an instance refresh and/or terminating/re-creating the nodes/instances that were failing seems to have resolved the issue (for now?) - they were all redeployed with the same AMI and everything. No idea WTH.
We recently switched our cluster to EKS 1.22 with managed node group and since we have sometime this error when container are created. We don't have a fix except replacing the node where the pod try to be scheduled.
I don't know if it's the right place to ask for this. If this is not, please tell me where I can post this issue.