Sandbox container image being GC'd in 1.29

nightmareze1 commented 8 months ago

AMI: amazon-eks-node-1.29-v20240117

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "602401143452.dkr.ecr.eu-west-2.amazonaws.com/eks/pause:3.5": failed to pull image "602401143452.dkr.ecr.eu-west-2.amazonaws.com/eks/pause:3.5": failed to pull and unpack image "602401143452.dkr.ecr.eu-west-2.amazonaws.com/eks/pause:3.5": failed to resolve reference "602401143452.dkr.ecr.eu-west-2.amazonaws.com/eks/pause:3.5": unexpected status from HEAD request to https://602401143452.dkr.ecr.eu-west-2.amazonaws.com/v2/eks/pause/manifests/3.5: 401 Unauthorized

1 day after upgrading EKS to 1.29

cartermckinnon commented 8 months ago

It sounds like something deleted your pause container image.

I would check:

Make sure that the --pod-infra-container-image flag passed to kubelet matches the sandbox_image in /etc/containerd/config.toml. This will prevent kubelet from deleting it during its image garbage collection process.
Look for RemoveImage CRI calls in your containerd logs. It's likely that some other CRI client (not kubelet) is deleting the image.

nightmareze1 commented 8 months ago

[~]# systemctl kubelet status

          └─3729 /usr/bin/kubelet --config /etc/kubernetes/kubelet/kubelet-config.json --kubeconfig /var/lib/kubelet/kubeconfig --container-runtime-endpoint unix:///run/containerd/containerd.sock --image-credential-provider-config /etc/eks/image-credential-provider/config.json --image-credential-provider-bin-dir /etc/eks/image-credential-provider --pod-infra-container-image=602401143452.dkr.ecr.eu-west-2.amazonaws.com/eks/pause:3.5 --v=2

[~]# cat /etc/containerd/config.toml |grep 602401143452.dkr.ecr.eu-west-2.amazonaws.com/eks/pause:3.5

sandbox_image = "602401143452.dkr.ecr.eu-west-2.amazonaws.com/eks/pause:3.5"

jrsparks86 commented 8 months ago

We also have noticed this issue after updating to 1.29. If we rotate out the nodes it recovers for some time then comes back a day later.

nightmareze1 commented 8 months ago

I'm using a temporal workaround proposed by a person in the issue created in aws-node repo(I modified a little but works)

curl -fsL -o crictl.tar.gz https://github.com/kubernetes-sigs/cri-tools/releases/download/v1.29.0/crictl-v1.29.0-linux-amd64.tar.gz
tar zxf crictl.tar.gz
chmod u+x crictl
mv crictl /usr/bin/crictl

cat <<EOF > /etc/eks/eks_creds_puller.sh
IMAGE_TOKEN=@@@(aws ecr get-login-password --region eu-west-2)
crictl --runtime-endpoint=unix:///run/containerd/containerd.sock  pull --creds "AWS:\$IMAGE_TOKEN" 602401143452.dkr.ecr.eu-west-2.amazonaws.com/eks/pause:3.5
EOF

sed -i 's/@@@/\$/g' /etc/eks/eks_creds_puller.sh

chmod u+x /etc/eks/eks_creds_puller.sh

echo "*/5 * * * * /etc/eks/eks_creds_puller.sh >> /var/log/eks_creds_puller 2>&1" | crontab -

ohrab-hacken commented 8 months ago

I am experience same issue. --pod-infra-container-image flag is set on kubelet. I found that my disk on node really become full after some time and kubelet garbage collector delete pause image. So, instead of delete different images, it deletes pause image. After pause image deleted, node doesn't work. I found the reason of full disk. In my case, I have ttlSecondsAfterFinished: 7200 for dagster jobs, and it consume all disk space. I've changed it to ttlSecondsAfterFinished: 120 and jobs cleaned up more frequently and we don't have this issue any more. It's strange cause I didn't have this issue on 1.28, and I didn't change any Dagster configuration between version upgrade. My guess, it kubelet image garbage collector works different in 1.28 and 1.29.

ghost commented 8 months ago

We're experiencing the same issue as well.

wiseelf commented 8 months ago

I'm having that same issue after upgrading to 1.29 on both AL2 and Bottlerocket nodes.

havilchis commented 8 months ago

The Kubelet flag --pod-infra-container-image is deprecated in 1.27+ [https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/]. The current implementation is that GC reads the properties of the Image set by the Container Runtime.

In the case of containerd, the GC should avoid images tagged with the property "pinned: true".

And containerd should flag the sandbox_image as pinned [https://github.com/containerd/containerd/pull/7944].

I believe that issue is related to containerD and the sandbox_image.

Although is set in config.toml, this is not flagged as "pinned: true".

I do not know if this is a general issue in ContainerD, but at least in my EKS Cluster in 1.29 the sandbox image appears as "pinned:false";

./crictl images | grep pause | grep us-east-1 | grep pause
602401143452.dkr.ecr-fips.us-east-1.amazonaws.com/eks/pause                    3.5                          6996f8da07bd4       299kB
602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause                         3.5                          6996f8da07bd4       299kB

./crictl inspecti 6996f8da07bd4 | grep pinned
    "pinned": false

cartermckinnon commented 8 months ago

It definitely seems like image pinning is the problem here. I'm trying to put a fix together 👍

cartermckinnon commented 8 months ago

I think the issue here is the version of containerd being used by Amazon Linux does not have pinned image support, which was added in 1.7.3: https://github.com/containerd/containerd/compare/v1.7.2...v1.7.3

I'm verifying that this hasn't been cherry-picked by the AL team. We'll probably have to do a hotfix in the immediate term.

cartermckinnon commented 8 months ago

AL intends to push containerd-1.7.11 to the package repositories soon, but I'll go ahead and put together a hotfix on our end.

cartermckinnon commented 8 months ago

I think the best bandaid for now is to periodically pull the sandbox image (if necessary), that's what #1601 does. @mmerkes @suket22 PTAL.

Idan-Lazar commented 8 months ago

any updates?

StefanoMantero commented 8 months ago

We're experiencing the same issue as well, pretty random tho, any updates ?

dekelummanu commented 8 months ago

+1

spatelwearpact commented 8 months ago

None of our applications or jobs are running in the cluster now! This is literally the highest priority issue with 1.29!

Tenzer commented 8 months ago

A small workaround I've done on our end to help alleviate the issue, is to give the nodes in the cluster a bigger disk. This means it will take longer time for the nodes to use enough disk space to trigger the garbage collection which deletes the pause image.

wiseelf commented 8 months ago

A small workaround I've done on our end to help alleviate the issue, is to give the nodes in the cluster a bigger disk. This means it will take longer time for the nodes to use enough disk space to trigger the garbage collection which deletes the pause image.

I did the same, it just increases a time for issue to occur and brings additional expenses. Agree that it is a top priority issue because it is impossible to downgrade to 1.28 without recreating the cluster.

cartermckinnon commented 8 months ago

The way we pull the image is part of the problem, this label is only applied (with containerd 1.7.3+) at pull time in the cri-containerd server, ctr pull won't do the trick.

dims commented 8 months ago

cc @henry118

cartermckinnon commented 8 months ago

While we work to get a fix out, swapping out the sandbox container image to one that doesn't require ECR credentials is another workaround:

registry.k8s.io/pause:3.9
public.ecr.aws/eks-distro/kubernetes/pause:v1.29.0-eks-1-29-latest

mlagoma commented 8 months ago

While we work to get a fix out, swapping out the sandbox container image to one that doesn't require ECR credentials is another workaround:

registry.k8s.io/pause:3.9

public.ecr.aws/eks-distro/kubernetes/pause:v1.29.0-eks-1-29-latest

Greetings, does anybody have any guidance on how I can make this modification to my EKS cluster? Is it part of the Dockerfile build of the container image? The kube deployment manifest (which uses my container image)? Somewhere else? Better to just wait it out for the fix?

dims commented 8 months ago

@mlagoma /etc/containerd/config.toml is the configuration file for containerd, you will see an entry (key / value) for a sandbox_image this points to an image in ECR usually. @cartermckinnon was talking about switching that.

However, it is better to talk to AWS support and get help if you are not comfortable.

sebastianplawner commented 8 months ago

Reading from a comment at https://github.com/awslabs/amazon-eks-ami/pull/1601, it is possible to use the 1.28 AMI version. So pods can be created while this issue is resolved in version 1.29

In my case I use Karpenter and I added the following code to the EC2NodeClass, within the spec block:

amiSelectorTerms:
  - name: amazon-eks-node-1.28-*

I got the image name from here And the documentation on how to modify NodeClass is here

This way, all nodes started rotating to the latest available version of 1.28

victor-chan-groundswell commented 8 months ago

I verified that rolling the node AMI (I use AL2) back to a 1.28 version is a good workaround as well. I have everything tied to a ASG/Launch Template so it's just a matter of going back to a previous version of a launch template.

dims commented 8 months ago

@victor-chan-groundswell thanks for confirming!

mlagoma commented 8 months ago

Edit: @macnibblet Disregard my previous comment below. The update (1.29.0-20240129) just got it working again temporarily (due to node restart) then the issue came up again later.

Updated Managed EKS node in Clusters -> Compute -> Node group and working fine per merge of #1601

macnibblet commented 8 months ago

@mlagoma Which EKS AMI Version are you on because we updated this morning and we are still facing the same issue

Edit: The latest version doesn't work, because the version released today fixes something else, see changelog.

The best bet is to downgrade to 1.28 AMI as it's compatible with 1.29 API server

ghost commented 8 months ago

Thank you @macnibblet for the changelog link, I upgraded this morning and wondered why we were still seeing the Sandbox error. (Using EKS 1.29 still)

cartermckinnon commented 8 months ago

This was closed by a PR hook; we'll re-close it once an AMI release is available with a fix. 👍

tzneal commented 8 months ago

As an immediate workaround to prevent this from occurring until the new AMI is released, you can:

Determine the sandbox image that is configured for use in the cluster by looking at the /etc/containerd/config.toml on a node in the cluster:
```
$ grep sandbox /etc/containerd/config.toml
sandbox_image = "602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/pause:3.5"
```

Modify and deploy the following DaemonSet to reference the sandbox image that containerd uses. Replace <sandbox image source> with the image reference that was in containerd's config.toml file:

apiVersion: apps/v1
kind: DaemonSet
metadata:
name: prevent-sandbox-gc
namespace: kube-system
labels:
app: prevent-sandbox-gc
spec:
selector:
matchLabels:
  name: prevent-sandbox-gc
template:
metadata:
  labels:
    name: prevent-sandbox-gc
spec:
  tolerations:
  # run everywhere regardless of taints
  - operator: Exists
  containers:
  - name: pause
    image: <sandbox image source>
    resources:
      requests:
        cpu: 1m
        memory: 1Mi

Note: This will not fix nodes that are already broken, but it will mark the sandbox image as in use from kubelet's perspective and prevent the image from being garbage collected in the future on nodes where it is running.

wiseelf commented 8 months ago

what about bottlerocket?

imuneeeb commented 8 months ago

any luck with the issue?

dims commented 8 months ago

any luck with the issue?

Hang in there please! :)

foluso-adewumi commented 8 months ago

@tzneal please how do I remove your workaround when a new AMI is released that fixes the issues ?

tzneal commented 8 months ago

@tzneal please how do I remove your workaround when a new AMI is released that fixes the issues ?

You can just delete that dameonset with kubectl delete daemonset -n kube-system prevent-sandbox-gc.

cartermckinnon commented 8 months ago

This issue should be fixed in AMI release v20240202. We were able to include containerd-1.7.11 which properly reports the sandbox_image as pinned to kubelet, after the changes in #1605.

dims commented 8 months ago

Follow up to https://github.com/awslabs/amazon-eks-ami/pull/1601#issuecomment-1919628687

@dims your comment is not helpful, the issue we're having is in production and is directly related to this PR. So how about you guys get cracking on this and get it merged instead of telling me to report things to my TAM!

@spatelwearpact Please see note above from @cartermckinnon

arvindpunk commented 8 months ago

Can confirm v20240202 fixed the issue. Unfortunately since GC occurs at random times - we definitely incurred a downtime before we noticed the issue. Perhaps sending out a notification to everyone who's on this AMI version + eks 1.29 before their GC occurs is a good idea.

dims commented 8 months ago

@arvindpunk yep! +1 to the suggestion. in the works.. waiting to make sure it sticks.

doramar97 commented 8 months ago

any updates on bottlerocket ?

marcin99 commented 8 months ago

@doramar97 downgrade workers with image v1.28 and it's better to wait a few weeks with the update, because they don't test anything (i.e. they test it in production with customers)

doramar97 commented 8 months ago

@marcin99 I'm not sure that I can downgrade EKS version without replacing the cluster with a new one, It is a production cluster and i'm looking for a reliable fix until they will issue a fix.

dims commented 8 months ago

@doramar97 downgrade workers with image v1.28 and it's better to wait a few weeks with the update, because they don't test anything (i.e. they test it in production with customers)

you are welcome to do what works for you. please bear with us as this was a tricky one.

dims commented 8 months ago

any updates on bottlerocket ?

@marcin99 if you need a solid ETA for production, it's better to approach via support escalation channels. suffice to say, it's in progress.

tzneal commented 8 months ago

@marcin99 I'm not sure that I can downgrade EKS version without replacing the cluster with a new one, It is a production cluster and i'm looking for a reliable fix until they will issue a fix.

The BottleRocket team confirmed that the DaemonSet prevention solution I posted above works for BotleRocket as well.

marcin99 commented 8 months ago

@doramar97 you don't need downgrade cluster version, but you can use the image from the previous version for workers

RamazanBiyik77 commented 8 months ago

This issue should be fixed in AMI release v20240202. We were able to include containerd-1.7.11 which properly reports the sandbox_image as pinned to kubelet, after the changes in #1605.

How can i apply this changes to my existing AMI?

i can confirm that my sandbox image is still 602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5.

RamazanBiyik77 commented 8 months ago

Okay found it on AWS EKS Compute section. There was a notification for new AMI release.

odellcraig commented 7 months ago

After reading through the thread, I see that this is fixed with v20240202. To apply this change, do you have to go update the launch template to point at the new AMI? I see that a new EKS cluster I created yesterday via Terraform is using the latest AMI (ami-0a5010afd9acfaa26 - amazon-eks-node-1.29-v20240227), But a cluster I created about a month ago before this change is still on ami-0c482d7ce1aa0dd44 (amazon-eks-node-1.29-v20240117). Is there a way tell my existing clusters to use the latest AMI?

awslabs / amazon-eks-ami

Sandbox container image being GC'd in 1.29 #1597