enix / kube-image-keeper

kuik is a container image caching system for Kubernetes
MIT License
459 stars 36 forks source link

fix: Ignore pod patching failures during startup #397

Closed kppullin closed 2 months ago

kppullin commented 3 months ago

fix: Ignore pod patching failures during startup

Note: this is a copy of #392, which auto-closed when I renamed my branch in an attempt to follow conventionalcommits.org

Summary

Patch failures in pod_webhook.Start() abort the controller process and may result in crash looping.

We've experienced this on a test cluster, where we have 8 pods out of ~1100 that fail return an error on the "no op" patch attempt in Start().

As I believe the code within Start() is an optimization to eagerly create CachedImages, but otherwise not required to function properly, this patch will instead log the err and continue on to the next pod.

Patch Failure Details

Here's a sample of the patch failure:

The Pod "xxx-yyy-zzz" is invalid: spec: Forbidden: pod updates may not change fields other than `spec.containers[*].image`,`spec.initContainers[*].image`,`spec.activeDeadlineSeconds`,`spec.tolerations` (only additions to existing tolerations),`spec.terminationGracePeriodSeconds` (allow it to be set to 1 if it was previously negative)
  core.PodSpec{
    ... // 15 identical fields
    Subdomain:         "",
    SetHostnameAsFQDN: nil,
    Affinity: &core.Affinity{
        NodeAffinity: &core.NodeAffinity{
            RequiredDuringSchedulingIgnoredDuringExecution: &core.NodeSelector{
                NodeSelectorTerms: []core.NodeSelectorTerm{
                    {
                        MatchExpressions: []core.NodeSelectorRequirement{
                            {Key: "xxx/availability", Operator: "In", Values: {"on-demand"}},
                            {Key: "xxx/arch", Operator: "In", Values: {"amd64"}},
+                           {Key: "xxx/availability", Operator: "In", Values: []string{"on-demand"}},
                        },

I've no idea why the patch logic occasionally is converting from Values: {"on-demand"} to Values: []string{"on-demand"} only on a subset of pods. A first thought was the match expressions objects being reordered & sorted based on the Key, however I was unable to replicate the error. This style of Affinity config exists in many other pods in the cluster.

monkeynator commented 1 month ago

:tada: This PR is included in version 1.10.0 :tada:

The release is available on GitHub release

Your semantic-release bot :package::rocket: