Make it possible to run branch planner in HA configuration (replicas > 1)

flux-iac / tofu-controller

A GitOps OpenTofu and Terraform controller for Flux

https://flux-iac.github.io/tofu-controller/

Apache License 2.0

1.29k stars 138 forks source link

Make it possible to run branch planner in HA configuration (replicas > 1) #781

Open yitsushi opened 1 year ago

yitsushi commented 1 year ago

Right now, by design, branch planner can't be replicated. To achieve this, we need a polling mechanism that can share work between instances.

=========

User Story:

As a Branch Planner developer, I'd like to make the branch planner scalable, so it can run in multiple instances and it can be configured to be HA with replica count.

Acceptance Criteria:

[ ] Ensure branch planner can be replicated without affecting existing functionality and without instancing fighting over branch planner resources.

squaremo commented 11 months ago

I think just being to run more goroutines would be good enough for now (I'm not clear whether this is the suggestion). Making it scale by sharing work amongst several pods is much more involved. Running more worker goroutines would be simple if branch-planner is ported to run as a controller-runtime controller.

yitsushi commented 11 months ago

For HA systems, they may want to run the branch planner controller with at least 2 instances, preferably in different nodes. It's not about scaling to manage more resources, but scaling as availability. Right now if they set the controller to replica > 1 each controller will fight for each repo and branch and one will create resources the rest will error on that one and they kind of simultaneously go to the next PR to check changes.

squaremo commented 11 months ago

if they set the controller to replica > 1 each controller will fight for each repo and branch

It's a good argument for porting to a controller-runtime Manager, so it can use leader election conveniently. If a pod crashes and another takes over, is there any state lost that would stop the second pod working properly?

squaremo commented 11 months ago

I think you'll need to revisit the acceptance criteria, if the point was HA rather than horizontal scalability.