Closed nightmareze1 closed 8 months ago
It sounds like something deleted your pause container image.
I would check:
--pod-infra-container-image
flag passed to kubelet
matches the sandbox_image
in /etc/containerd/config.toml
. This will prevent kubelet
from deleting it during its image garbage collection process.RemoveImage
CRI calls in your containerd
logs. It's likely that some other CRI client (not kubelet
) is deleting the image. [~]# systemctl kubelet status
└─3729 /usr/bin/kubelet --config /etc/kubernetes/kubelet/kubelet-config.json --kubeconfig /var/lib/kubelet/kubeconfig --container-runtime-endpoint unix:///run/containerd/containerd.sock --image-credential-provider-config /etc/eks/image-credential-provider/config.json --image-credential-provider-bin-dir /etc/eks/image-credential-provider --pod-infra-container-image=602401143452.dkr.ecr.eu-west-2.amazonaws.com/eks/pause:3.5 --v=2
[~]# cat /etc/containerd/config.toml |grep 602401143452.dkr.ecr.eu-west-2.amazonaws.com/eks/pause:3.5
sandbox_image = "602401143452.dkr.ecr.eu-west-2.amazonaws.com/eks/pause:3.5"
We also have noticed this issue after updating to 1.29. If we rotate out the nodes it recovers for some time then comes back a day later.
I'm using a temporal workaround proposed by a person in the issue created in aws-node repo(I modified a little but works)
curl -fsL -o crictl.tar.gz https://github.com/kubernetes-sigs/cri-tools/releases/download/v1.29.0/crictl-v1.29.0-linux-amd64.tar.gz
tar zxf crictl.tar.gz
chmod u+x crictl
mv crictl /usr/bin/crictl
cat <<EOF > /etc/eks/eks_creds_puller.sh
IMAGE_TOKEN=@@@(aws ecr get-login-password --region eu-west-2)
crictl --runtime-endpoint=unix:///run/containerd/containerd.sock pull --creds "AWS:\$IMAGE_TOKEN" 602401143452.dkr.ecr.eu-west-2.amazonaws.com/eks/pause:3.5
EOF
sed -i 's/@@@/\$/g' /etc/eks/eks_creds_puller.sh
chmod u+x /etc/eks/eks_creds_puller.sh
echo "*/5 * * * * /etc/eks/eks_creds_puller.sh >> /var/log/eks_creds_puller 2>&1" | crontab -
I am experience same issue. --pod-infra-container-image
flag is set on kubelet. I found that my disk on node really become full after some time and kubelet garbage collector delete pause
image. So, instead of delete different images, it deletes pause image. After pause image deleted, node doesn't work.
I found the reason of full disk. In my case, I have ttlSecondsAfterFinished: 7200
for dagster jobs, and it consume all disk space. I've changed it to ttlSecondsAfterFinished: 120
and jobs cleaned up more frequently and we don't have this issue any more.
It's strange cause I didn't have this issue on 1.28, and I didn't change any Dagster configuration between version upgrade. My guess, it kubelet image garbage collector works different in 1.28 and 1.29.
We're experiencing the same issue as well.
I'm having that same issue after upgrading to 1.29 on both AL2 and Bottlerocket nodes.
The Kubelet flag --pod-infra-container-image
is deprecated in 1.27+ [https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/]. The current implementation is that GC reads the properties of the Image set by the Container Runtime.
In the case of containerd, the GC should avoid images tagged with the property "pinned: true".
And containerd should flag the sandbox_image as pinned [https://github.com/containerd/containerd/pull/7944].
I believe that issue is related to containerD and the sandbox_image.
Although is set in config.toml, this is not flagged as "pinned: true".
I do not know if this is a general issue in ContainerD, but at least in my EKS Cluster in 1.29 the sandbox image appears as "pinned:false";
./crictl images | grep pause | grep us-east-1 | grep pause
602401143452.dkr.ecr-fips.us-east-1.amazonaws.com/eks/pause 3.5 6996f8da07bd4 299kB
602401143452.dkr.ecr.us-east-1.amazonaws.com/eks/pause 3.5 6996f8da07bd4 299kB
./crictl inspecti 6996f8da07bd4 | grep pinned
"pinned": false
It definitely seems like image pinning is the problem here. I'm trying to put a fix together 👍
I think the issue here is the version of containerd
being used by Amazon Linux does not have pinned image support, which was added in 1.7.3
: https://github.com/containerd/containerd/compare/v1.7.2...v1.7.3
I'm verifying that this hasn't been cherry-picked by the AL team. We'll probably have to do a hotfix in the immediate term.
AL intends to push containerd-1.7.11
to the package repositories soon, but I'll go ahead and put together a hotfix on our end.
I think the best bandaid for now is to periodically pull the sandbox image (if necessary), that's what #1601 does. @mmerkes @suket22 PTAL.
any updates?
We're experiencing the same issue as well, pretty random tho, any updates ?
+1
None of our applications or jobs are running in the cluster now! This is literally the highest priority issue with 1.29!
A small workaround I've done on our end to help alleviate the issue, is to give the nodes in the cluster a bigger disk. This means it will take longer time for the nodes to use enough disk space to trigger the garbage collection which deletes the pause image.
A small workaround I've done on our end to help alleviate the issue, is to give the nodes in the cluster a bigger disk. This means it will take longer time for the nodes to use enough disk space to trigger the garbage collection which deletes the pause image.
I did the same, it just increases a time for issue to occur and brings additional expenses. Agree that it is a top priority issue because it is impossible to downgrade to 1.28 without recreating the cluster.
The way we pull the image is part of the problem, this label is only applied (with containerd
1.7.3+) at pull time in the cri-containerd server, ctr pull
won't do the trick.
cc @henry118
While we work to get a fix out, swapping out the sandbox container image to one that doesn't require ECR credentials is another workaround:
registry.k8s.io/pause:3.9
public.ecr.aws/eks-distro/kubernetes/pause:v1.29.0-eks-1-29-latest
While we work to get a fix out, swapping out the sandbox container image to one that doesn't require ECR credentials is another workaround:
registry.k8s.io/pause:3.9
public.ecr.aws/eks-distro/kubernetes/pause:v1.29.0-eks-1-29-latest
Greetings, does anybody have any guidance on how I can make this modification to my EKS cluster? Is it part of the Dockerfile build of the container image? The kube deployment manifest (which uses my container image)? Somewhere else? Better to just wait it out for the fix?
@mlagoma /etc/containerd/config.toml
is the configuration file for containerd, you will see an entry (key / value) for a sandbox_image
this points to an image in ECR usually. @cartermckinnon was talking about switching that.
However, it is better to talk to AWS support and get help if you are not comfortable.
Reading from a comment at https://github.com/awslabs/amazon-eks-ami/pull/1601, it is possible to use the 1.28 AMI version. So pods can be created while this issue is resolved in version 1.29
In my case I use Karpenter and I added the following code to the EC2NodeClass, within the spec block:
amiSelectorTerms:
- name: amazon-eks-node-1.28-*
I got the image name from here And the documentation on how to modify NodeClass is here
This way, all nodes started rotating to the latest available version of 1.28
I verified that rolling the node AMI (I use AL2) back to a 1.28 version is a good workaround as well. I have everything tied to a ASG/Launch Template so it's just a matter of going back to a previous version of a launch template.
@victor-chan-groundswell thanks for confirming!
Edit: @macnibblet Disregard my previous comment below. The update (1.29.0-20240129) just got it working again temporarily (due to node restart) then the issue came up again later.
Updated Managed EKS node in Clusters -> Compute -> Node group and working fine per merge of #1601
@mlagoma Which EKS AMI Version are you on because we updated this morning and we are still facing the same issue
Edit: The latest version doesn't work, because the version released today fixes something else, see changelog.
The best bet is to downgrade to 1.28 AMI as it's compatible with 1.29 API server
Thank you @macnibblet for the changelog link, I upgraded this morning and wondered why we were still seeing the Sandbox error. (Using EKS 1.29 still)
This was closed by a PR hook; we'll re-close it once an AMI release is available with a fix. 👍
As an immediate workaround to prevent this from occurring until the new AMI is released, you can:
Determine the sandbox image that is configured for use in the cluster by looking at the /etc/containerd/config.toml on a node in the cluster:
$ grep sandbox /etc/containerd/config.toml
sandbox_image = "602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/pause:3.5"
Modify and deploy the following DaemonSet to reference the sandbox image that containerd uses. Replace <sandbox image source>
with the image reference that was in containerd's config.toml file:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: prevent-sandbox-gc
namespace: kube-system
labels:
app: prevent-sandbox-gc
spec:
selector:
matchLabels:
name: prevent-sandbox-gc
template:
metadata:
labels:
name: prevent-sandbox-gc
spec:
tolerations:
# run everywhere regardless of taints
- operator: Exists
containers:
- name: pause
image: <sandbox image source>
resources:
requests:
cpu: 1m
memory: 1Mi
Note: This will not fix nodes that are already broken, but it will mark the sandbox image as in use from kubelet's perspective and prevent the image from being garbage collected in the future on nodes where it is running.
what about bottlerocket?
any luck with the issue?
any luck with the issue?
Hang in there please! :)
@tzneal please how do I remove your workaround when a new AMI is released that fixes the issues ?
@tzneal please how do I remove your workaround when a new AMI is released that fixes the issues ?
You can just delete that dameonset with kubectl delete daemonset -n kube-system prevent-sandbox-gc
.
This issue should be fixed in AMI release v20240202. We were able to include containerd-1.7.11
which properly reports the sandbox_image
as pinned to kubelet
, after the changes in #1605.
Follow up to https://github.com/awslabs/amazon-eks-ami/pull/1601#issuecomment-1919628687
@dims your comment is not helpful, the issue we're having is in production and is directly related to this PR. So how about you guys get cracking on this and get it merged instead of telling me to report things to my TAM!
@spatelwearpact Please see note above from @cartermckinnon
Can confirm v20240202 fixed the issue. Unfortunately since GC occurs at random times - we definitely incurred a downtime before we noticed the issue. Perhaps sending out a notification to everyone who's on this AMI version + eks 1.29 before their GC occurs is a good idea.
@arvindpunk yep! +1 to the suggestion. in the works.. waiting to make sure it sticks.
any updates on bottlerocket ?
@doramar97 downgrade workers with image v1.28 and it's better to wait a few weeks with the update, because they don't test anything (i.e. they test it in production with customers)
@marcin99 I'm not sure that I can downgrade EKS version without replacing the cluster with a new one, It is a production cluster and i'm looking for a reliable fix until they will issue a fix.
@doramar97 downgrade workers with image v1.28 and it's better to wait a few weeks with the update, because they don't test anything (i.e. they test it in production with customers)
you are welcome to do what works for you. please bear with us as this was a tricky one.
any updates on bottlerocket ?
@marcin99 if you need a solid ETA for production, it's better to approach via support escalation channels. suffice to say, it's in progress.
@marcin99 I'm not sure that I can downgrade EKS version without replacing the cluster with a new one, It is a production cluster and i'm looking for a reliable fix until they will issue a fix.
The BottleRocket team confirmed that the DaemonSet prevention solution I posted above works for BotleRocket as well.
@doramar97 you don't need downgrade cluster version, but you can use the image from the previous version for workers
This issue should be fixed in AMI release v20240202. We were able to include
containerd-1.7.11
which properly reports thesandbox_image
as pinned tokubelet
, after the changes in #1605.
How can i apply this changes to my existing AMI?
i can confirm that my sandbox image is still 602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5.
Okay found it on AWS EKS Compute section. There was a notification for new AMI release.
After reading through the thread, I see that this is fixed with v20240202. To apply this change, do you have to go update the launch template to point at the new AMI? I see that a new EKS cluster I created yesterday via Terraform is using the latest AMI (ami-0a5010afd9acfaa26 - amazon-eks-node-1.29-v20240227), But a cluster I created about a month ago before this change is still on ami-0c482d7ce1aa0dd44 (amazon-eks-node-1.29-v20240117). Is there a way tell my existing clusters to use the latest AMI?
AMI: amazon-eks-node-1.29-v20240117
1 day after upgrading EKS to 1.29