Pepr/Istiod Webhook chicken and egg

defenseunicorns / uds-core

A FOSS secure runtime platform for mission-critical capabilities

https://uds.defenseunicorns.com

GNU Affero General Public License v3.0

51 stars 21 forks source link

Pepr/Istiod Webhook chicken and egg #342

Open corang opened 7 months ago

corang commented 7 months ago

If nodes are killed without proper drain the pepr and istiod webhooks failing keeps any pods from scheduling to the cluster. istiod can't schedule from the pepr webhook failing and pepr can't schedule from the istiod webhook failing.

Steps to reproduce

Have UDS core deployed across multiple nodes
Add new nodes
Kill old nodes that have workloads running without draining/cordoning

Expected result

Pods, albeit slowly, schedule onto new nodes

Actual Result

It's impossible for any pods to schedule onto the cluster going forward

Severity/Priority

Med/High

Additional Context

We were able to work around this by giving istiod as many replicas as there are nodes and a preferred podAntiAffinity so that istiod does it's best to be scheduled onto all nodes, but that doesn't feel like a proper solution.

mjnagel commented 7 months ago

This would be a common problem anytime you have validating webhooks with a failurePolicy of Fail and kill all nodes at once running those workloads. I think some preferred anti-affinities by default in core would help and probably be a good sane thing to add to uds-core, but I do think this is also somewhat on end users to do more careful node rotation. Perhaps we can make some changes to make this a bit nicer, while also documenting some practices around node rotation and specific issues you could hit like this one.

Definitely open to other suggestions on ways we could approach this...changing the failurePolicy to Ignore for Pepr is a non-starter (would make our policy engine bypassable), and I think the same is true for Istiod (it's been a while but in the past with bad Istio resources you could crash istiod if the validator was skipped). Even if it were just a temporary switch of the failurePolicy, I'm not sure how we could safely do that.

corang commented 7 months ago

This idea applies to either istio or pepr but I'll just give one example: Bake whatever pepr would mutate for istiod into it and have the istiod pod be ignored by the pepr webhook. Inverse could be done for pepr too?

mjnagel commented 7 months ago

@corang hm, might be something we could do there. I don't think pre-baking pepr with the istio mutation is really feasible (since that's the sidecar injection). Pre-baking istiod with pepr's mutation(s) would be doable for sure, although it does also mean excluding istio entirely from policy checks which isn't ideal.

May have to give this one some more thought...

mjnagel commented 4 months ago

Revisiting this I think we have two paths forward:

Exclude Istiod from the policy engine: Istio does currently deploy before pepr so there is already an aspect of this happening. We should absolutely review the configuration of Istio to ensure we have it in a safe state to do this - but with overrides someone could still reduce/change the posture on Istio.
Un-inject Pepr: This is a very possible path forward although it would be a large lift currently due to the Keycloak/SSO logic being dependent on Istio restrictions around client registration. This could be the longer term path forward if we want to rethink that architecture.

brcourt commented 1 day ago

I've been running into this issue as well. If some event wipes out all nodes, I would very much appreciate a workaround or script or task or UDS CLI command that can allow the istiod/pepr pods to be scheduled and rectify the cluster.