Open javilaadevinta opened 1 day ago
pull access denied, repository does not exist or may require authorization: authorization failed: no basic auth credentials
That means containerd is attempting to pull the sandbox image itself (without ECR credentials), which won’t work. Do you see ImageDelete events in the logs or was the sandbox image never present on this node?
There’s a systemd unit on AL2 that pulls the image (using ECR credentials) that may have failed, you can check:
journalctl -u sandbox-image
We face the same issue yesterday to fetch pause image: 602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5
,
in pod events we got this error:
kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5": failed to pull image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5": failed to pull and unpack image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5": failed to resolve reference "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5": pull access denied, repository does not exist or may require authorization: authorization failed: no basic auth credentials
We run on eks 1.29 controller and nodes. but it sound like something not related the ami itself happened.
Our workaround for now in the nodes that had this issue is to edit /etc/containerd/config.toml
pause image to public.ecr.aws/eks-distro/kubernetes/pause:v1.29.0-eks-1-29-latest
and then to restart containerd: sudo systemctl restart containerd
Do you know about global issue in eu-central-1?
pull access denied, repository does not exist or may require authorization: authorization failed: no basic auth credentials
That means containerd is attempting to pull the sandbox image itself (without ECR credentials), which won’t work. Do you see ImageDelete events in the logs or was the sandbox image never present on this node?
There’s a systemd unit on AL2 that pulls the image (using ECR credentials) that may have failed, you can check:
journalctl -u sandbox-image
That's a good point. Sadly, we have no more system logs for this specific node, but analyzing the usage of other metrics could fit, as we didn't comply with any GC condition.
What happened:
One node in one of our clusters has an error related to the sandbox container and is unable to be pulled. AMI: amazon-eks-node-1.30-v20241109
Warning FailedCreatePodSandBox 3m31s (x210 over 48m) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/e ks/pause:3.5": failed to pull image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5": failed to pull and unpack image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5": failed to resolve reference "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5": pull access denied, repository does not exist or may require authorization: authorization failed: no basic auth credentials
The disk usage was around 10%. We have no access anymore to the node as it was deleted.
This is our config on the containerd side:
And this is our kubelet flag:
Could be somehow a regression of this issue? https://github.com/awslabs/amazon-eks-ami/issues/1597
What you expected to happen: Kubelet shouldn't GC the image, or at least nodes should be able to pull it again if somehow was deleted.
How to reproduce it (as minimally and precisely as possible): I have no idea how to reproduce it again, as it is the first time that we observed this happening in one of our clusters.
Environment: