Open corang opened 7 months ago
This would be a common problem anytime you have validating webhooks with a failurePolicy
of Fail
and kill all nodes at once running those workloads. I think some preferred anti-affinities by default in core would help and probably be a good sane thing to add to uds-core, but I do think this is also somewhat on end users to do more careful node rotation. Perhaps we can make some changes to make this a bit nicer, while also documenting some practices around node rotation and specific issues you could hit like this one.
Definitely open to other suggestions on ways we could approach this...changing the failurePolicy
to Ignore
for Pepr is a non-starter (would make our policy engine bypassable), and I think the same is true for Istiod (it's been a while but in the past with bad Istio resources you could crash istiod if the validator was skipped). Even if it were just a temporary switch of the failurePolicy
, I'm not sure how we could safely do that.
This idea applies to either istio or pepr but I'll just give one example: Bake whatever pepr would mutate for istiod into it and have the istiod pod be ignored by the pepr webhook. Inverse could be done for pepr too?
@corang hm, might be something we could do there. I don't think pre-baking pepr with the istio mutation is really feasible (since that's the sidecar injection). Pre-baking istiod with pepr's mutation(s) would be doable for sure, although it does also mean excluding istio entirely from policy checks which isn't ideal.
May have to give this one some more thought...
Revisiting this I think we have two paths forward:
I've been running into this issue as well. If some event wipes out all nodes, I would very much appreciate a workaround or script or task or UDS CLI command that can allow the istiod/pepr pods to be scheduled and rectify the cluster.
If nodes are killed without proper drain the pepr and istiod webhooks failing keeps any pods from scheduling to the cluster. istiod can't schedule from the pepr webhook failing and pepr can't schedule from the istiod webhook failing.
Steps to reproduce
Expected result
Pods, albeit slowly, schedule onto new nodes
Actual Result
It's impossible for any pods to schedule onto the cluster going forward
Severity/Priority
Med/High
Additional Context
We were able to work around this by giving istiod as many replicas as there are nodes and a preferred podAntiAffinity so that istiod does it's best to be scheduled onto all nodes, but that doesn't feel like a proper solution.