argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
14.88k stars 3.17k forks source link

proposal: splitting the controller into multiple deployments #13517

Open Joibel opened 3 weeks ago

Joibel commented 3 weeks ago

Summary

Currently the workflow controller is a monolith with no sharding capabilities.

We would like to be able to shard it for scalability, but that is some way off. This proposal is not about sharding.

This is a discussion around splitting it into multiple controllers. For examples:

Scalability

This should, overall improve scalability. It would allow us to experiment with sharding in less risky and complex ways as most of these only run one off events unlike the main controller.

It may put more read pressure on the k8s API as we will be running separate informers. Their information doesn't overlap though, workflows transition between them.

Strategies

I'd like to discuss which strategy we take with splitting:

Single binary, multiple uses

Many projects, including ArgoCD release a single binary which have many personalities depending upon the name of the binary that is invoked.

An alternative would be that you could specify the roles that this controller would take on. A single deployment could take on all the roles, or it could be fully split, depending upon your needs.

Single binary and image per controller

As a software engineer, this is my preferred approach. The pain of getting there is higher, but separation of concerns become much stronger and the binaries should benefit from the splitting.

The downside is that either we also produce a single binary version for testing/small installations or we no-longer support the current deployment method of a single pod. This is mostly not of that much concern to many users who just install a manifest from kustomize or helm.

My choice

I would split into multiple binaries and images, and that would be the only deployment option. If we want to reverse this and deploy multiple controllers in one image in future we could do this via a supervisor daemon running all the binaries.

This would ensure separation of concerns, which I think is the big win here.

agilgur5 commented 3 weeks ago

UX complexity is significant -- Istio as an example of prior art of the opposite

I've thought about this before, but there is actually prior art in the opposite direction. Of particular note is Istio, which in its 1.5 release, specifically transitioned to a single binary (istiod) and single Controller. It used to have like 5, and it was absolutely much more complicated to deploy back then. There are still multiple pieces of Istio despite that (the sidecars or ambient mesh daemons, and the optional Ingress and Egress Gateways) and users do have to be aware that they scale differently and have different configurations. Misconfigurations were extraordinarily common in the earlier versions of Istio and do still happen, but substantially less so after continuous simplifications.

Given my experience with Istio, other Istio users' experience, as well as the maintainers' experience in continuing to simplify the deployment (e.g. in 1.6 with SDS, agents, and all the way through to Ambient Mesh, which was beta in 1.22), I would suggest against this unless there is specific user demand for it. It does add substantial deployment complexity, and for many users, that complexity is not necessary.

Note that k8s itself has been infamous for deployment complexity, and that has similarly reduced over the years. The amount of services/Charts people install per cluster has increased though, enough to be its own source of complexity -- this would add to that.

Also note that this will substantially complicate UX when it comes to support or reporting issues. Our issue templates already don't handle the difference between Controller, Executor, and Server issues well, and this will multiplicatively complicate that. Similarly, users are not always aware of these different components and how they interact, so splitting into multiple more will make it more complicated than it already is for them as well.

memory increase due to multiple informers watching the same resources

It may put more read pressure on the k8s API as we will be running separate informers. Their information doesn't overlap though, workflows transition between them.

If you were to fully split things like TTL, archiving, and PodGC, there indeed would be duplicative informers as they watch the same resources. This could actually result in a substantial memory increase, since the Informers tend to be one of the largest proportions of memory. The memory increase is multiplicative with each separate Controller that watches the same set of resources.

We already have users with very high memory usage (100GB+), so I don't think they would like to further increase that. k8s API issues are even more common, so taking care when increasing that usage is also important.

For smaller installations, this and splitting into multiple Pods would also be significant, as the baseline of running a process, the Go runtime, and potentially duplicate informers, becomes higher.

It would be better to split per Informer / per resource as such. That way the difference is more minimal for larger installations (instead of multiplicative), although could still be quite significant for smaller installations.

can optimize memory usage of an informer, but would not recommend it unless absolutely necessary

Caveat: there is a way to optimize informer memory usage of the same resources but disjoint information within the same resource. I did research that quite a bit for my sharding proposal and then eventually found a way to do it and implemented a variant of that in https://github.com/argoproj/argo-workflows/pull/11855#discussion_r1556461731.

It's not ideal though, and there is still higher read pressure on the k8s API and higher network usage. So I wouldn't necessarily suggest regularly using this method. In that PR it is primarily a workaround due to lack of labels on Semaphore ConfigMaps, so as a result the Informer is on all ConfigMaps, which could lead to very high memory usage without optimization. My suggestion in that PR was to require labels on all Argo-used ConfigMaps moving forward (we do for many of them, but not quite all of them, like Semaphores) to avoid this issue in future versions (it's a breaking change otherwise, so can't be done in a patch)

supervisors are a can of worms

If we want to reverse this and deploy multiple controllers in one image in future we could do this via a supervisor daemon running all the binaries.

I would not recommend this either -- k8s and Docker do not lend well to multiple processes per image and it's a cause of lots of issues in the community -- I have literally seen a variant of this in every job since I started using k8s in 2017. As the most basic example, a healthcheck is either incomplete and only representative of a single process, or becomes substantially more complicated as an aggregate. The most common issue I've seen is using the older approach of a dynamic appserver like Passenger / Puma / Unicorn / etc splitting requests between replicated processes that worked well when it was the only thing deployed on a single EC2 server, and just throwing that all in one image and deploying it as a Pod on k8s. The healthcheck gets routed to only one of the processes, and so a successful response can be incomplete as the other processes may be failing -- this often causes issues where 1/x requests fail (x being the number of processes).

Supervisors themselves are also their own can of worms -- see also Emissary as an example of one of those cans of worms.

Multiple goroutines are also substantially more efficient than multiple processes.

build complexity of multiple binaries + images

The downside is that either we also produce a single binary version

There is also the build complexity of multiple binaries and multiple images. The release pipeline is a bit complex as-is and does take upwards of an hour to run already, which is not good in its delayed feedback (and does occasionally cause release issues already as such). Splitting these and still producing binaries and images for multiple architectures will certainly increase this build time, and likely significantly, especially as we'd probably cap out on matrix parallelization and so some will run in serial.

specific examples

  • CronWorkflow controller. This is a relatively (esp. compared to the main controller) simple controller which was where I started thinking about this. The CWF controller can use many GB of RAM in high CWF count scenarios.

I would probably say this is the best use-case, as it is a separate resource, can have high resource usage depending on the use-case, and is technically an optional resource too. Similar to CD's AppSet Controller. Whether it's worth separating for those scenarios though I think is a trade-off (80/20 problem). The maintenance effort may not be worthwhile in particular.

  • Validating Admission Webhook Validating Admission Webhook #13503. If tied to the controller it cannot horizontally scale because the rest of the controller cannot, but otherwise it could.

Not the most intuitive, but since the controller can have replicas (hot standbys currently), a webhook actually would be able to scale 😅 I believe there are existing cases in the ecosystem of the webhook responder being split into its own deployment though (can't remember off the top of my head, but pretty sure I have seen and scaled at least one before). Given that it's a server with an API and requires k8s-trusted TLS, it can be worth splitting to reduce complexity in its own right. Webhook scenario is an outlier though compared to the rest of the Controller, so likely not a good comparison. We do have WorkflowEventBindings as well, although I don't think I've actually ever touched that part of the codebase 🤔


Summary -- suggest separate files and packages primarily, rather than separate deployments

Each runs quite a simple controller for their specific task, but having this code separated seems worthy

In summary, while this could sound like a codebase simplification via separation of concerns, it actually would add substantial complexity in several other areas, resulting in a net increase in complexity rather than decrease. Given our existing issues with complexity, this will increase those rather than decrease them.

For the purpose of codebase simplification, I would just suggest continuing to split controllers into their own files and maybe packages, as both of us have suggested before (e.g. https://github.com/argoproj/argo-workflows/pull/13419#discussion_r1715196438). That will get all of the codebase-specific benefits of this without any downsides.

shard it

GH Markdown note: you can't use a short URL link within an in-line link -- this actually acted like an anchor tag that would link to a section of this issue, ending up linking to itself. I've edited this to include the full link