Note: this is a copy of #392, which auto-closed when I renamed my branch in an attempt to follow conventionalcommits.org
Summary
Patch failures in pod_webhook.Start() abort the controller process and may result in crash looping.
We've experienced this on a test cluster, where we have 8 pods out of ~1100 that fail return an error on the "no op" patch attempt in Start().
As I believe the code within Start() is an optimization to eagerly create CachedImages, but otherwise not required to function properly, this patch will instead log the err and continue on to the next pod.
Patch Failure Details
Here's a sample of the patch failure:
The Pod "xxx-yyy-zzz" is invalid: spec: Forbidden: pod updates may not change fields other than `spec.containers[*].image`,`spec.initContainers[*].image`,`spec.activeDeadlineSeconds`,`spec.tolerations` (only additions to existing tolerations),`spec.terminationGracePeriodSeconds` (allow it to be set to 1 if it was previously negative)
core.PodSpec{
... // 15 identical fields
Subdomain: "",
SetHostnameAsFQDN: nil,
Affinity: &core.Affinity{
NodeAffinity: &core.NodeAffinity{
RequiredDuringSchedulingIgnoredDuringExecution: &core.NodeSelector{
NodeSelectorTerms: []core.NodeSelectorTerm{
{
MatchExpressions: []core.NodeSelectorRequirement{
{Key: "xxx/availability", Operator: "In", Values: {"on-demand"}},
{Key: "xxx/arch", Operator: "In", Values: {"amd64"}},
+ {Key: "xxx/availability", Operator: "In", Values: []string{"on-demand"}},
},
I've no idea why the patch logic occasionally is converting from Values: {"on-demand"} to Values: []string{"on-demand"} only on a subset of pods. A first thought was the match expressions objects being reordered & sorted based on the Key, however I was unable to replicate the error. This style of Affinity config exists in many other pods in the cluster.
fix: Ignore pod patching failures during startup
Note: this is a copy of #392, which auto-closed when I renamed my branch in an attempt to follow conventionalcommits.org
Summary
Patch failures in
pod_webhook.Start()
abort thecontroller
process and may result in crash looping.We've experienced this on a test cluster, where we have 8 pods out of ~1100 that fail return an error on the "no op" patch attempt in
Start()
.As I believe the code within
Start()
is an optimization to eagerly createCachedImages
, but otherwise not required to function properly, this patch will instead log theerr
and continue on to the next pod.Patch Failure Details
Here's a sample of the patch failure:
I've no idea why the patch logic occasionally is converting from
Values: {"on-demand"}
toValues: []string{"on-demand"}
only on a subset of pods. A first thought was the match expressions objects being reordered & sorted based on theKey
, however I was unable to replicate the error. This style ofAffinity
config exists in many other pods in the cluster.