awslabs / amazon-eks-ami

Packer configuration for building a custom EKS AMI
https://awslabs.github.io/amazon-eks-ami/
MIT No Attribution
2.46k stars 1.15k forks source link

Broken 1.30.6 AWS image - Not being able to pull sleep image #2066

Closed thomaspeitz closed 2 days ago

thomaspeitz commented 2 days ago

What happened: Spawned a new kubernetes node via karpenter. Node did not come up.

Saw in the error logs:

Events:
  Type     Reason                  Age               From               Message
  ----     ------                  ----              ----               -------
  Normal   Scheduled               27s               default-scheduler  Successfully assigned kube-system/aws-node-scbkb to ip-192-168-167-69.eu-central-1.compute.internal
  Warning  FailedCreatePodSandBox  1s (x3 over 26s)  kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "localhost/kubernetes/pause": failed to pull image "localhost/kubernetes/pause": failed to pull and unpack image "localhost/kubernetes/pause:latest": failed to resolve reference "localhost/kubernetes/pause:latest": failed to do request: Head "https://localhost/v2/kubernetes/pause/manifests/latest": dial tcp 127.0.0.1:443: connect: connection refused

What you expected to happen: Node comes up.

How to reproduce it (as minimally and precisely as possible):

How I could fix it:

Node comes up again. Seems odd. Not 100% sure yet if it is karpenter fault / AWS ami fault.

Environment:

thomaspeitz commented 2 days ago

Solved it ourselves. We had a custom nvme mount script, which was working till now.

But it remounted /var/lib/containerd creating these problems.

cartermckinnon commented 1 day ago

@thomaspeitz makes sense, please let us know if you run into any more issues with the pause image. We're trying to remove the runtime hard dependency on ECR, it's a frequent source of flakiness/footguns (more info in #2000).

thomaspeitz commented 1 day ago

@cartermckinnon - No, works perfectly on all clusters. Very happy about it. Thanks for this improvement. Especially those seconds to have faster nodes up, are totally worth to refactor some nodes. Besides the win, when ecr is down.

It only effected our nvme nodes. Have some blog post about nvme + gist. Updated as well to ensure it rsyncs the content before remounting it. https://gist.github.com/thomaspeitz/eaef87c714418eb1fe1a732655d4637b

bryantbiggs commented 1 day ago

added a note on your gist - you don't need to do all of that in user data. the EKS AL2 and AL2023 both have provisions to mount the instance store volumes - see the Al2023 version in the comment https://gist.github.com/thomaspeitz/eaef87c714418eb1fe1a732655d4637b?permalink_comment_id=5293458#gistcomment-5293458