enix / kube-image-keeper

kuik is a container image caching system for Kubernetes
MIT License
459 stars 36 forks source link

fix: ignore pod patching failures during startup #392

Closed kppullin closed 3 months ago

kppullin commented 3 months ago

Summary

Patch failures in pod_webhook.Start() abort the controller process and may result in crash looping.

We've experienced this on a test cluster, where we have 8 pods out of ~1100 that fail return an error on the "no op" patch attempt in Start().

As I believe the code within Start() is an optimization to eagerly create CachedImages, but otherwise not required to function properly, this patch will instead log the err and continue on to the next pod.

Patch Failure Details

Here's a sample of the patch failure:

The Pod "xxx-yyy-zzz" is invalid: spec: Forbidden: pod updates may not change fields other than `spec.containers[*].image`,`spec.initContainers[*].image`,`spec.activeDeadlineSeconds`,`spec.tolerations` (only additions to existing tolerations),`spec.terminationGracePeriodSeconds` (allow it to be set to 1 if it was previously negative)
  core.PodSpec{
    ... // 15 identical fields
    Subdomain:         "",
    SetHostnameAsFQDN: nil,
    Affinity: &core.Affinity{
        NodeAffinity: &core.NodeAffinity{
            RequiredDuringSchedulingIgnoredDuringExecution: &core.NodeSelector{
                NodeSelectorTerms: []core.NodeSelectorTerm{
                    {
                        MatchExpressions: []core.NodeSelectorRequirement{
                            {Key: "xxx/availability", Operator: "In", Values: {"on-demand"}},
                            {Key: "xxx/arch", Operator: "In", Values: {"amd64"}},
+                           {Key: "xxx/availability", Operator: "In", Values: []string{"on-demand"}},
                        },

I've no idea why the patch logic occasionally is converting from Values: {"on-demand"} to Values: []string{"on-demand"} only on a subset of pods. A first thought was the match expressions objects being reordered & sorted based on the Key, however I was unable to replicate the error. This style of Affinity config exists in many other pods in the cluster.

paullaffitte commented 3 months ago

Hello and thanks for your submission.

As you understood, the goal of this piece of code is to eagerly create CachedImages on start, and failing to do so is delaying the usefulness of kuik, which we would like to avoid. However, crashing at start is even worse, thus I can agree to merge this PR until we find a more acceptable solution. I just would like to ask you to rename your commit in order to follow the conventional commit spec, the title of this PR would fit perfectly.

It is very weird because it looks like the empty patch sometimes just randomly mutate the pod. I would be interested in finding out if a retry mechanism could solve the issue and more importantly understand why this happens. Since it is related to #371, I suggest that we continue the discussion there if you're willing to help us understand why this happens and how to properly fix it.

Thanks again, I'm waiting for you to reword your commit before merging.