Broken 1.30.6 AWS image - Not being able to pull sleep image

thomaspeitz commented 2 days ago

What happened: Spawned a new kubernetes node via karpenter. Node did not come up.

Saw in the error logs:

Events:
  Type     Reason                  Age               From               Message
  ----     ------                  ----              ----               -------
  Normal   Scheduled               27s               default-scheduler  Successfully assigned kube-system/aws-node-scbkb to ip-192-168-167-69.eu-central-1.compute.internal
  Warning  FailedCreatePodSandBox  1s (x3 over 26s)  kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox image "localhost/kubernetes/pause": failed to pull image "localhost/kubernetes/pause": failed to pull and unpack image "localhost/kubernetes/pause:latest": failed to resolve reference "localhost/kubernetes/pause:latest": failed to do request: Head "https://localhost/v2/kubernetes/pause/manifests/latest": dial tcp 127.0.0.1:443: connect: connection refused

What you expected to happen: Node comes up.

How to reproduce it (as minimally and precisely as possible):

How I could fix it:

ssh to the node.

Fix the sleep container (which has a wrong source)

# from
[plugins."io.containerd.grpc.v1.cri"]
sandbox_image = "localhost/kubernetes/pause"
# to
[plugins."io.containerd.grpc.v1.cri"]
sandbox_image = "k8s.gcr.io/pause:3.9"

service containerd restart
ctr images pull k8s.gcr.io/pause:3.9

Node comes up again. Seems odd. Not 100% sure yet if it is karpenter fault / AWS ami fault.

Environment:

AWS Region: eu-central-1
Instance Type(s): c6id.24xlarge
Cluster Kubernetes version: 1.30
Node Kubernetes version: 1.30.6
AMI Version: ami-0d6c630f239d638a6 / amazon-eks-node-al2023-x86_64-standard-1.30-v20241115
Karpenter version: 1.0.7

thomaspeitz commented 2 days ago

Solved it ourselves. We had a custom nvme mount script, which was working till now.

But it remounted /var/lib/containerd creating these problems.

cartermckinnon commented 1 day ago

@thomaspeitz makes sense, please let us know if you run into any more issues with the pause image. We're trying to remove the runtime hard dependency on ECR, it's a frequent source of flakiness/footguns (more info in #2000).

thomaspeitz commented 1 day ago

@cartermckinnon - No, works perfectly on all clusters. Very happy about it. Thanks for this improvement. Especially those seconds to have faster nodes up, are totally worth to refactor some nodes. Besides the win, when ecr is down.

It only effected our nvme nodes. Have some blog post about nvme + gist. Updated as well to ensure it rsyncs the content before remounting it. https://gist.github.com/thomaspeitz/eaef87c714418eb1fe1a732655d4637b

bryantbiggs commented 1 day ago

added a note on your gist - you don't need to do all of that in user data. the EKS AL2 and AL2023 both have provisions to mount the instance store volumes - see the Al2023 version in the comment https://gist.github.com/thomaspeitz/eaef87c714418eb1fe1a732655d4637b?permalink_comment_id=5293458#gistcomment-5293458

awslabs / amazon-eks-ami

Broken 1.30.6 AWS image - Not being able to pull sleep image #2066