awslabs / amazon-eks-ami

Packer configuration for building a custom EKS AMI
https://awslabs.github.io/amazon-eks-ami/
MIT No Attribution
2.43k stars 1.14k forks source link

Sandbox container image being GC'd in 1.29 #1597

Closed nightmareze1 closed 8 months ago

nightmareze1 commented 8 months ago

AMI: amazon-eks-node-1.29-v20240117

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "602401143452.dkr.ecr.eu-west-2.amazonaws.com/eks/pause:3.5": failed to pull image "602401143452.dkr.ecr.eu-west-2.amazonaws.com/eks/pause:3.5": failed to pull and unpack image "602401143452.dkr.ecr.eu-west-2.amazonaws.com/eks/pause:3.5": failed to resolve reference "602401143452.dkr.ecr.eu-west-2.amazonaws.com/eks/pause:3.5": unexpected status from HEAD request to https://602401143452.dkr.ecr.eu-west-2.amazonaws.com/v2/eks/pause/manifests/3.5: 401 Unauthorized

1 day after upgrading EKS to 1.29

bryantbiggs commented 7 months ago

@odellcraig you do that via the release_version

odellcraig commented 7 months ago

@bryantbiggs Thank you.

For anyone using Terraform and eks_managed_node_groups you can specify using:

eks_managed_node_groups = {
    initial = {
      ami_release_version = "1.29.0-20240227" # this is the latest version as of this comment
      name           = "..."
      instance_types = [...]
      min_size       = ...
      max_size       = ...
      desired_size   = ...
...
korncola commented 4 months ago

can you please fix the damn issue after half a year? Still happens with EKS managed nodegroup and AMI amazon/amazon-eks-node-1.29-v20240522

Error in kubelet on node: unexpected status from HEAD request to https://602401143452.dkr.ecr.eu-central-1.amazonaws.com/v2/eks/pause/manifests/3.5: 403 Forbidden"

migration to EKS halted here

shamallah commented 4 months ago

Same error with amazon-eks-node-1.29-v20240315 failed" error="failed to pull and unpack image \"602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5\": failed to copy: httpReadSeeker: failed open: unexpected status code https://602401143452.dkr.ecr.eu-central-1.amazonaws.com/v2/eks/pause/blobs/sha256:6996f8da07bd405c6f82a549ef041deda57d1d658ec20a78584f9f436c9a3bb7: 403 Forbidden"

tzneal commented 4 months ago

Are the permissions on your node role correct per https://docs.aws.amazon.com/eks/latest/userguide/create-node-role.html? Specifically, does it have the AmazonEC2ContainerRegistryReadOnly policy?

shamallah commented 4 months ago

AmazonEC2ContainerRegistryReadOnly policy?

The policy is attached.

korncola commented 4 months ago

Policy AmazonEC2ContainerRegistryReadOnly is attached here also. Cant you just use a REAL public repo instead of this half baked half private/public repo in the configs and init scripts? cause hacking the bootstrapping script with public.ecr.aws/eks-distro/kubernetes/pause:v1.29.0-eks-1-29-latest works - but only until reboot, cause init-scripts will always place this damn non working URL in /etc/containerd/config.toml

cartermckinnon commented 4 months ago

@korncola can you open a ticket with AWS support so we can look into the specifics of your environment?

korncola commented 4 months ago

thanks @cartermckinnon , will do that. But still: Why no true public repo?!

Did a cluster via terraform and GUI, triple checked policies. Also disabled all SCP. Still same error. Also nodegroups with AL2023 image or AL2 no success.

cartermckinnon commented 4 months ago

ECR Public is only hosted in a few regions; so we still use regional ECR repositories for lower latency and better availability. ECR Public also has a monthly bandwidth limit for anonymous pulls that cannot be increased; so if you're using it in production, make sure you're not sending anonymous requests.

korncola commented 4 months ago

[...] and better availability. [...]

yeah i see the availability in this and the other tickets...

ECR Public also has a monthly bandwidth limit for anonymous pulls that cannot be increased;

As i said above use a real public service... And AWS owns that service, so make it worth... This are bad excuse for this design decision. Sorry for my rant, but I don't get this decisions, when I look at the scripts with all the hardcoded account IDs to compose an ECR repo URL, with scripts in scripts in scripts, I mean come on, you can do better at AWS.

But as always in the end, I will have a certain typo or whatever on my side causing my ECR pull error and you will all laugh at me :-)

bryantbiggs commented 4 months ago

@korncola lets keep it professional. The best course of action is to work with the team through the support ticket. There are many factors that go into decisions that users are not usually aware of. The team is very responsive in terms of investigating and getting a fix rolled out (as needed)

korncola commented 4 months ago

yep you are right 👍 team here is very helpful and responsive, thank you for the support here! Will report when issue is resolved, so others can use that info.

mlagoma commented 4 months ago

If I understand correctly, the same or a similar (in that it will definitely occur over time) bug was perhaps reintroduced/introduced? So should it be advised to not upgrade nodes? Or is this a separate issue (e.g. anonymous pulls)?

No sign of the issue on older version (1.29.0-20240202)

cartermckinnon commented 4 months ago

No, at this point we don’t have evidence of a new bug or a regression.

I’m going to lock this thread to avoid confusion, please open a new issue for follow-ups.