Closed thomaspeitz closed 2 days ago
Solved it ourselves. We had a custom nvme mount script, which was working till now.
But it remounted /var/lib/containerd creating these problems.
@thomaspeitz makes sense, please let us know if you run into any more issues with the pause image. We're trying to remove the runtime hard dependency on ECR, it's a frequent source of flakiness/footguns (more info in #2000).
@cartermckinnon - No, works perfectly on all clusters. Very happy about it. Thanks for this improvement. Especially those seconds to have faster nodes up, are totally worth to refactor some nodes. Besides the win, when ecr is down.
It only effected our nvme nodes. Have some blog post about nvme + gist. Updated as well to ensure it rsyncs the content before remounting it. https://gist.github.com/thomaspeitz/eaef87c714418eb1fe1a732655d4637b
added a note on your gist - you don't need to do all of that in user data. the EKS AL2 and AL2023 both have provisions to mount the instance store volumes - see the Al2023 version in the comment https://gist.github.com/thomaspeitz/eaef87c714418eb1fe1a732655d4637b?permalink_comment_id=5293458#gistcomment-5293458
What happened: Spawned a new kubernetes node via karpenter. Node did not come up.
Saw in the error logs:
What you expected to happen: Node comes up.
How to reproduce it (as minimally and precisely as possible):
How I could fix it:
Node comes up again. Seems odd. Not 100% sure yet if it is karpenter fault / AWS ami fault.
Environment: