Multiple Replicas for controller

craftyc0der commented 2 years ago

Let's say I wanted to deploy multiple replicas of the controller for scalability and redundancy. What design method would you use for this? The watcher is going to alert each GameServer handler. Do we create a simple mutex to prevent duplication of efforts? This seems like a standard problem. Is there a standard solution?

danieloliveira079 commented 2 years ago

That is a very good question. We have to break down the answers:

Scalability:

The controller has already a high QPS that will allow way more requests to the K8S API than usual. I am following the same approach as applications like Prometheus while communicating with K8S API. Though, I didn't have the chance to test the ingress controller with a huge Fleet of GS. Unfortunately, I don't have the cloud resources available. What I can say is that I would not expect issues in terms of scalability due to the nature of this application. The reconcile process is pretty simple and it is only creating a couple of resources. Not much business logic involved or external calls to DBs, APIs or external services. That said, I have a few tricks under my sleeve to improve a bit the Reconcile process. That means, splitting the OnAdd and OnUpdate into 2 different queues/channels and having 2 types of workers dealing with messages. Additionally, as an example of another application following similar pattern is Agones itself. There is only one single controller available and expecting that K8S will guarantee that this replica is always up and running.

Changing the Deployment manifest of the gameserver ingress controller and increasing the Replicas > 1 will not make much difference. Instead, it will put extra load on the API. That is because each controller has its own internal cache/watcher subscription.

Redundancy: It is on my roadmap to implement lead election which will allow more than one controller running at the same time but only one will be handling events OnAdd OnUpdate. However, I would need to justify this effort with something that shows me that the time to elect a new leader is lower than a controller pod being scheduled again in case of a crash.

I hope that all makes sense and let me know if you need any other information.

Feel free to close the issue if that answer your question.

craftyc0der commented 2 years ago

It makes perfect sense. I've been hacking on this code base for a while. I am quite familiar with it as a result.

Appreciate the thoughts. Happy Turkey Day!

craftyc0der commented 2 years ago

If we want to rely on the service being online and have k8s replace it if its down, should we consider a very effective health check?

danieloliveira079 commented 2 years ago

Great suggestion. I can add the health checks.

danieloliveira079 commented 2 years ago

It will be part of the release 0.1.5

craftyc0der commented 2 years ago

I have implemented LeaderElection. I'll test it next week in my cloud environment next week. Minikube is working nicely. Pretty straight forward really. My complications were getting RBAC just right for my rather complicated CICD pipeline.

danieloliveira079 commented 2 years ago

Cool, I can also give it a try this weekend using the built in support from Controller Runtime.

Octops / gameserver-ingress-controller

Multiple Replicas for controller #15