Closed kppullin closed 3 months ago
Hello and thanks for your submission.
As you understood, the goal of this piece of code is to eagerly create CachedImages
on start, and failing to do so is delaying the usefulness of kuik, which we would like to avoid. However, crashing at start is even worse, thus I can agree to merge this PR until we find a more acceptable solution. I just would like to ask you to rename your commit in order to follow the conventional commit spec, the title of this PR would fit perfectly.
It is very weird because it looks like the empty patch sometimes just randomly mutate the pod. I would be interested in finding out if a retry mechanism could solve the issue and more importantly understand why this happens. Since it is related to #371, I suggest that we continue the discussion there if you're willing to help us understand why this happens and how to properly fix it.
Thanks again, I'm waiting for you to reword your commit before merging.
Summary
Patch failures in
pod_webhook.Start()
abort thecontroller
process and may result in crash looping.We've experienced this on a test cluster, where we have 8 pods out of ~1100 that fail return an error on the "no op" patch attempt in
Start()
.As I believe the code within
Start()
is an optimization to eagerly createCachedImages
, but otherwise not required to function properly, this patch will instead log theerr
and continue on to the next pod.Patch Failure Details
Here's a sample of the patch failure:
I've no idea why the patch logic occasionally is converting from
Values: {"on-demand"}
toValues: []string{"on-demand"}
only on a subset of pods. A first thought was the match expressions objects being reordered & sorted based on theKey
, however I was unable to replicate the error. This style ofAffinity
config exists in many other pods in the cluster.