Unfortunte Dev Disaster because of Eraser

116davinder commented 2 weeks ago

What kind of request is this?

Improvement of existing experience

What is your request or suggestion?

Please include some defaults for image removal skips since, it removed very important image in my case and now, I am wondering, how to fix it without distorying my worker nodes.

Environment: AWS EKS (1.31) Worker: Bottlerocket Latest at the time of writing

Default Eraser Rule Log {"level":"info","ts":1731308363.6721404,"logger":"collector","msg":"no images to exclude"}

Eraser in Action Log {"level":"info","ts":1731308367.6420007,"logger":"remover","msg":"removed image","given":"sha256:60eb709f2e5c30f4067e605271d1b1bfff0e32f633a5a02f55a74aa448bfafbc","imageID":"sha256:60eb709f2e5c30f4067e605271d1b1bfff0e32f633a5a02f55a74aa448bfafbc","name":{"image_id":"sha256:60eb709f2e5c30f4067e605271d1b1bfff0e32f633a5a02f55a74aa448bfafbc","names":["localhost/kubernetes/pause:0.1.0"]}}

After Eraser all my deployment/pod startup are stuck like this

Are you willing to submit PRs to contribute to this feature request?

[ ] Yes, I am willing to implement it.

sozercan commented 2 weeks ago

@116davinder sorry to hear that. skipping images functionality exists in eraser by setting up exclusions: https://eraser-dev.github.io/eraser/docs/exclusion

Official pause image is from registry.k8s.io/pause. Unfortunately, there's no local default for pause image. I would recommend making the pause image accessible to pull from a registry, and adding it to exclusion list (so it doesn't get pulled every time).

If you can connect to the nodes, you can pull the pause image and retag to what the cluster is looking for, or update the kubelet sandbox image config https://kubernetes.io/docs/setup/production-environment/container-runtimes/#override-pause-image-containerd for mitigation

This is related to #380 that defines pinned images in containerd level. I believe this is closer to what you are looking for.

116davinder commented 2 weeks ago

@sozercan , this image local/kubernetes/pause is coming from botterrocket, they do allow setting different pause image but most of clusters are built at the time of this happened so can't change.

Unfortunately, all my worker nodes aka Bottlerocket OS are locked so None can access them :( .

Lastly and luckily later, I found out that I destroyed only one cluster and my team managed to recreate worker nodes to fix it. As of now, I am using exclusion policy to skip the pause image for time being, until, my team moves to registry.k8s.io/pause image.

it is similar to the issue you have mentioned #380. If you like, we can close this issue and keep talking over #380 issue.

sozercan commented 2 weeks ago

Closing, we'll track it in #380

eraser-dev / eraser