[feature request] multiple metacontroller with object sharding strategy

moving the slack discussion over to here to continue in a more formal manner

Obligatory

Metacontroller is freaking great, thank you for enabling us to build custom controllers in a matter of days.

The Problem

As our cluster grows in scale, we are noticing metacontroller isn't able to keep up during large volume and/or volatile events such as a major deployment, or new cluster provision. Metacontroller is responsible for about 2500 objects at this point, and the time it takes for all the "update loop"s to resolve could be about 20-30 mins. This is especially problematic for parents whose children may be conditional on the state of other children and/or objects. As these would require at least two "update loops". Metacontoller and the webhooks are in no way resource constrained, never going above 200m, and memory usage is negligible.

~The~ A Solution

Clearly we don't want to break metacontroller's simple interactions with the cluster and users. Cluster scoped controller objects, and backwards compatible are really important. In essence, there is a cluster scoped pool of work, and the idea is to safely parallelize the processing of that pool across an arbitrary number of workers. Keeping that in mind, I propose the following:

Allow for multiple replicas of metacontoller, each with their own shard of the worker pool
All metacontroller's register watches on all controller parents/children, as they do now
...but only process objects in their designated shard
Sharding algorithm would be just a simple math function:

update object if metaconroller statefulset host number == object uid % number of metacontroller replicas
This would still work as metacontroller scales up and down, and would be compatible with an hpa (I believe, though I never created an hpa for a statefulset before)
Metacontroller can look up the number replicas by watching it's parent statefulset status

Challenges

Need to be careful that two metacontrollers don't attempt to update the same object, this should be mathematically impossible however, latency on the api could cause moments when the exact replica count may be inaccurate
Metacontoller needs health and ready probes and should never start processing its object shard unless it is reported as ready
Though the processing of objects would scale, it is impossible to shard watches and thus, every metacontroller would still watch for everything. This should have no material effect on the k8s api, but it is worth mentioning.

Other nice side effects

Multiple replicas means we can improve our HA posture by spreading out our metacontrollers with anti-affinity rules
Auto scaling is now possible, tuning would be just like any other service in the cluster
It would be possible to preemptively scale up metaconroller before high volume / volatility events

Hi @hypergig , very good suggestion! i'd try to move the request to https://github.com/AmitKumarDas/metac - it's a fork as this project is no longer maintained.

about HPA with stateful sets, it's possible - but a graceful termination is required to free resources and actually clean the state well (for example, perform removal from cluster, or de-provision a volume).. there's a nice article of a graceful shutdown: https://medium.com/@marko.luksa/graceful-scaledown-of-stateful-apps-in-kubernetes-2205fc556ba9

If you don't need to scale down a specific shard (let's say, you have 3 shards, but now shard #1 is irrelevant) - then a graceful shutdown is not really necessary and regular HPA is fine.

about challenge #1 - 3 thoughts:

why does this matter? even if you have 2 replicas that handle the same object - only one of the replicas should grab it and perform the update - i'm not 100% sure though). if not, just from top of my head, a shared queue per shard can be used so each replica can read from it, worst case (if 2 replicas are working on the same object) - only one will catch the update.
maybe it doesn't really matter, as you're usually syncing to the end state of the object, i don't really see how 2 instances can grab 2 different versions of the same object, as you'd need to deploy twice super fast (while scaling the controller).
because you're switching to stateful set, you can basically notify all shards about the new number of replicas while a new shard is initializing, before it starts to process. just open an API that shard-3 will call (shard-0, shard-1, shard-2) and notify them about the changes in deployment size. each one that is called stops processing. when all shards are updated - notify that they can continue, it's basically some kind of "resharding" strategy. you can also have some timeout in case of errors to rollback, etc... there might be better approaches (https://medium.com/harmony-one/understanding-harmonys-cuckoo-rule-for-resharding-215766f4ca50), but you can have freedom there because you control and know exactly when a scale up happens

GoogleCloudPlatform / metacontroller