Joibel commented 1 week ago

Summary

Semaphores (and mutexes) can be used to represent and restrict access external systems which cannot or should not be accessed by multiple workflows at once.

To facilitate this it would be nice to be able to use semaphores stored in something other than a local cluster ConfigMap.

The suggestion is to use a database

Use Cases

In a highly available scenario with multiple argo workflows instances in different clusters you could share semaphores by storing them off cluster in an HA system.

Supported offboard system

Argo workflows already has the option to use PostgreSQL or MySQL for various tasks such as node-status offloading and archiving. It would be nice to make this work with these databases too. However, this is problematic. When a semaphore is released the workflow-controller SHOULD be informed of this via some notification system. This is possible using notify/listen in PostgreSQL, but this is not possible in MySQL. There are a number of addons to do this kind of thing with MySQL.

Any system which uses a notify system should also poll for changes, because notifications may not get dispatched or may be lost during downtime. We could document this as being a downside and just recommend the use of PostgreSQL instead.

I don't propose to support MySQL for this reason at the moment.

Alternative to a database

Alternatively we could support redis & valkey. They have specific support for doing this. The downside is that they are a new different kind of infrastructure.

I don't think we should actually do this as these are usually deployed as non-critical caching systems, rather than formal persistence. I also don't think we should add a new dependency.

Note: Supporting Redis despite their open source stance is because enterprise workflows users may not be able to use valkey yet.

How to use

The workflow controller would be configured to talk to the infrastructure component. There would only be one offboard semaphore subsystem per controller.

You can specify something like:

  synchronization:
    semaphore:
      database:
        key: workflow

Semaphores and mutexes would be namespace keyed unless they added a global: true, in which case they would be global to the offboard store.

tooptoop4 commented 1 week ago

even without notify it could be a periodic query

agilgur5 commented 1 week ago

Mmmm given that it would only support a single database type (and note that we do have existing requests for more database types) and that the single mentioned use-case is for multi-cluster, which is not officially supported at this time, I wouldn't support this proposal.

A multi-cluster solution to this may entail replicating ConfigMaps, rather than assuming a shared database. In the unofficial OCM solution, if you wanted to do this, you would create the ConfigMap in the control plane cluster and have it distribute it to managed clusters. Whereas if you only wanted to apply to a single Argo instance, you'd only create the ConfigMap in its cluster. In Alex's old PoC for native multi-cluster, you might have different Workflows reference the same ConfigMap by specifying the cluster: thiscluster.

As such, in the absence of native multi-cluster, I'd actually recommend a user-land workaround that does replication or similar instead.

Joibel commented 6 days ago

This is a workaround for scaling up workflows, but it isn't a workaround for high availability, it reduces availability. Any solution to kubernetes being unavailable is not improved by adding another single point of failure cluster.

Replication also won't work for semaphores unless it's performing write delayed replication (as in the write won't complete until replication is complete). I'm not aware of this being possible, but I haven't looked deeply. Nonethless, it still reduces availability.

External highly available systems such as databases or redislike caches, or even an etcd seem like a good solution to this problem. I don't think for multi-cluster we should support attempting to do this with kubernetes directly.

This is part of a multi-cluster for high availability, which yes, isn't supported at this time. This proposal brings that closer, not further away, and I feel would be a step on the way to that.

I am happy if the answer is that we support mysql as well, but without notify and just polling.

agilgur5 commented 5 days ago

I think we have some wires crossed here -- there's two different pieces to a Semaphore:

the ConfigMap which is basically just a shared record of the max allowed
the state of the Semaphore, which is stored in-memory in the Controller (and can be derived from Workflows' status on restart; parallelism is tracked similarly).

It sounds like state is the main thing you want to be able to share across clusters that use the same DB? that's actually sharing between multiple instances of the Controller, they don't need to be in multiple clusters.

The difference here is not moving the ConfigMap then, but moving the state from in-memory to in DB. That will have latency trade-offs incurred and needs to be ensured to be done atomically as a transaction in the DB as well (since the existing in-process locks don't affect other Controllers). I feel like there's a more optimal way of doing this 🤔

This is part of a multi-cluster for high availability, which yes, isn't supported at this time. This proposal brings that closer, not further away, and I feel would be a step on the way to that.

This approach also requires at least a partially shared DB between Controllers, which is a bit of a niche use-case. A fully shared DB amongst multiple Controllers would require cluster: be specified for every row and lots of other nuances. That's also not the use-case for the native multi-cluster feature requests, which are more to allow you to schedule across clusters (and a lot of users want that with 1 Controller, not multiple. although I really wouldn't recommend a single controller approach).

This is a workaround for scaling up workflows, but it isn't a workaround for high availability, it reduces availability.

Multi-cluster HA with a hard limit on Semaphores? If you have semaphores, I wouldn't think you'd be seeking HA per se, since you will be queuing Workflows.

write delayed replication (as in the write won't complete until replication is complete)

or it could be eventually consistent if you don't need a hard limit.

Any solution to kubernetes being unavailable is not improved by adding another single point of failure cluster.

This is also resolved with eventual consistency; replicated writes would be delayed if a cluster is unavailable. Also can have partial consensus. A lot of "it depends"

agilgur5 commented 5 days ago

The difference here is not moving the ConfigMap then, but moving the state from in-memory to in DB. That will have latency trade-offs incurred and needs to be ensured to be done atomically as a transaction in the DB as well (since the existing in-process locks don't affect other Controllers).

Right, this is just distributed locking. There are purpose-built systems for that that it would make more sense to integrate with for multi-cluster. They could also be integrated in user-land as well.

I'm seeing a lot of non-k8s native complexity for a very specific use case... 😕

I feel like there's a more optimal way of doing this 🤔

Also yes this can be done locklessly in a single cluster by just incrementing a counter in a ConfigMap (or other resource), retry if conflict, requeue if full.

In multi-cluster you also wouldn't necessarily have to replicate or have a shared database, just read/write the ConfigMap from the Control Plane cluster (or some chosen cluster), either by giving access to the resource via SA + kubeconfig, or by delegating to a Control Plane "Super" Controller (i.e. it locks on your behalf)

External highly available systems such as databases

There is still an SPoF here, as the DB is shared; I'm not sure that's necessarily better than k8s resources as storage. Sharing is a requirement in this case, so it's not something we can architect around (which I do quite often to make for simpler & more resilient systems)

Joibel commented 4 days ago

Ok, re-reading my original proposal I can understand I haven't conveyed what I intended to convey.

The proposal is supposed to:

Implement semaphore state memory offboard the controller in a way that can be shared amongst multiple controllers
The use case I'm targetting was multi-cluster, but multi-controller would also be useful
Implement a strict semaphore, so that if you've asked for a maximum of 3 you never get more than this - think access to a physical device.
Not require any given kubernetes cluster to be running because clusters are considered non-reliable

Semaphores are implementable in user land, definitely. Existing semaphores are a convenience feature with added UX benefits.

HA systems for doing this are much more readily available from many cloud vendors or internally within enterprises. They are a SPoF, but they are designed for resilience and persistence in a way that kubernetes is not.

I know of users already sharing a database between multiple controllers for other workflows purposes. It saves on cost and gives a multicluster view of archived workflows for example.

I don't think that our eventual multi-cluster solution should revolve around a single controller multiple cluster solution. Users are already hitting the limits of a single controller on a single cluster. I'm expecting us to run cluster-local controller(s) in a multi-cluster solution doing the heavy lifting. You can already run the controller in a different cluster from the workflows and pods, and I think it would be a poor solution to just enhance this to split the cluster with the workflows from the cluster with the pods.

So this enhancement proposal is supposed to help with the eventual goal of multiple, disparate controllers that would like to share semaphores, such as we might end up with in a multi-cluster solution. It isn't apparent to me that multi-cluster solutions automatically solve this problem, and it feels like adding in support to delegating semaphores to a control plane cluster isn't as good for reliability or performance.

Cluster local semaphores for non-global resources would be much more common place I suspect in a multi-cluster scenario, so wouldn't require this setup.

The choice of using a database rather than any other offboard solution is:

It will do the job fine. (highly available transactional persistence is a solved problem)
We already use databases so we're not introducing a new dependency.

argoproj / argo-workflows

proposal: Support postgres for synchronization #13276

Summary

Use Cases

Supported offboard system

Alternative to a database

How to use