awslabs / amazon-eks-ami

Packer configuration for building a custom EKS AMI
https://awslabs.github.io/amazon-eks-ami/
MIT No Attribution
2.46k stars 1.15k forks source link

Sandbox container image being GC'd in 1.30 #2061

Open javilaadevinta opened 1 day ago

javilaadevinta commented 1 day ago

What happened:

One node in one of our clusters has an error related to the sandbox container and is unable to be pulled. AMI: amazon-eks-node-1.30-v20241109

Warning FailedCreatePodSandBox 3m31s (x210 over 48m) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/e ks/pause:3.5": failed to pull image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5": failed to pull and unpack image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5": failed to resolve reference "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5": pull access denied, repository does not exist or may require authorization: authorization failed: no basic auth credentials

The disk usage was around 10%. We have no access anymore to the node as it was deleted.

This is our config on the containerd side:

[plugins."io.containerd.grpc.v1.cri"]
sandbox_image = "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5"

And this is our kubelet flag:

--pod-infra-container-image=602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5

Could be somehow a regression of this issue? https://github.com/awslabs/amazon-eks-ami/issues/1597

What you expected to happen: Kubelet shouldn't GC the image, or at least nodes should be able to pull it again if somehow was deleted.

How to reproduce it (as minimally and precisely as possible): I have no idea how to reproduce it again, as it is the first time that we observed this happening in one of our clusters.

Environment:

cartermckinnon commented 22 hours ago

pull access denied, repository does not exist or may require authorization: authorization failed: no basic auth credentials

That means containerd is attempting to pull the sandbox image itself (without ECR credentials), which won’t work. Do you see ImageDelete events in the logs or was the sandbox image never present on this node?

There’s a systemd unit on AL2 that pulls the image (using ECR credentials) that may have failed, you can check:

journalctl -u sandbox-image
hany-mhajna-payu-gpo commented 5 hours ago

We face the same issue yesterday to fetch pause image: 602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5,

in pod events we got this error:

kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5": failed to pull image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5": failed to pull and unpack image "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5": failed to resolve reference "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5": pull access denied, repository does not exist or may require authorization: authorization failed: no basic auth credentials

We run on eks 1.29 controller and nodes. but it sound like something not related the ami itself happened. Our workaround for now in the nodes that had this issue is to edit /etc/containerd/config.toml pause image to public.ecr.aws/eks-distro/kubernetes/pause:v1.29.0-eks-1-29-latest and then to restart containerd: sudo systemctl restart containerd

Do you know about global issue in eu-central-1?

javilaadevinta commented 1 hour ago

pull access denied, repository does not exist or may require authorization: authorization failed: no basic auth credentials

That means containerd is attempting to pull the sandbox image itself (without ECR credentials), which won’t work. Do you see ImageDelete events in the logs or was the sandbox image never present on this node?

There’s a systemd unit on AL2 that pulls the image (using ECR credentials) that may have failed, you can check:

journalctl -u sandbox-image

That's a good point. Sadly, we have no more system logs for this specific node, but analyzing the usage of other metrics could fit, as we didn't comply with any GC condition.