Open jawnsy opened 11 months ago
My rule of thumb for patches is that they are for things that the operator itself has no opinion about (i.e. no operator behavior changes with different topology constraints).
It can be hard to make good defaults for things like this. The ones you have here for example don't match what we use in some of our production environments, and is probably not what you'd want for test/staging SpiceDBClusters.
I'm open to making this easier to configure, but scheduling-related api fields tend to have more churn than other bits of kube, so using patches
for this keeps the operator forward-compatible for longer.
Thanks for triaging, @ecordell! I understand that none of these constraints are going to be suitable for everyone.
We're okay carrying patches for this indefinitely, but if you have some thoughts on an approach that would make this easier to configure, I think that would be useful. I'd be happy to contribute a feature, but I don't have a straightforward way to test changes to the operator (getting a development environment for this stuff can be tricky)
Summary
Add a default pod topology spread constraint the cluster deployment to prefer scheduling on different nodes, so that failure of a node or zone does not result in an outage of the SpiceDB cluster.
Background
When running in a high availability configuration (multiple replicas of SpiceDB), the Kubernetes scheduler may place all the nodes in the same failure domain, such as a particular node or a particular availability zone. In rare instances, this can cause a brief outage:
Workaround
A solution to this is to add pod topology spread constraints to the resulting pods, with a
whenUnsatisfiable
setting ofScheduleAnyway
to prevent issues on single-zone clusters. Nodes and availability zones have standard well-known labels for this purpose:topology.kubernetes.io/zone
kubernetes.io/hostname
The following SpiceDBCluster patches can do this: