argoproj-labs / argo-rollouts-manager

Kubernetes Operator for Argo Rollouts controller.
https://argo-rollouts-manager.readthedocs.io/en/latest/
Apache License 2.0
93 stars 304 forks source link

Cluster-scoped Rollouts installs #20

Closed jgwest closed 5 months ago

jgwest commented 8 months ago

At present, the operator only supports namespace-scoped installs via the --namespaced parameter, which is currently hardcoded to be enabled by default.

Work required

jgwest commented 8 months ago

Link to Red Hat Issue Tracker: https://issues.redhat.com/browse/GITOPS-3847

jgwest commented 8 months ago

Following up from an internal discussion around how to handle the mix of namespace-scoped and cluster-scoped Rollouts controllers, here's a brainstorm of how it all fits together.

Supported Scenarios

When installing Argo Rollouts controller on a cluster, there are 3 possible scenarios that could exist:

A) Cluster-scoped: 1 rollouts controller is watching for Rollouts CRs at cluster scope

B) Namespace-scoped: 1 or more Rollouts controllers are watching for Rollouts CRs at namespace scope (possible there are multiple controllers on the cluster, each watching a single namespace)

C) Hybrid: multiple Rollouts controllers on the cluster, with at most one being cluster-scoped, and the rest being namespace-scoped. Cluster-scoped install would ignore RolloutManager if there was a namespace-scoped install in that Namespace.

Thus, only the following two scenarios are supported by Argo Rollouts controller: A) 1 cluster-scoped Rollouts controller XOR B) 1 or more namespace-scoped Rollouts controllers

How to ensure in argo-rollouts-manager that the cluster is always in a supported state

When reconciling RolloutsManager CRs, the Rollouts Operator can examine the current list of all RolloutManagers on the cluster, and use that to ensure that the cluster is in a valid state:

Reconcile() should ONLY sets the .status field of the particular RolloutManager CR that it is reconciling. Don't worry about setting the status field of other RolloutManager CRs that might exist on the cluster. (These will eventually be reconciled on K8s controller resync, which occurs every X hours. So they will eventually have an error set)

Preventing denial of service (DOS)

Can we prevent a DOS, where a malicious user creates a RolloutManager in their own Namespace, which moves Rollouts controller into an unsupported use case?

An interesting question, and after some thought, while this is theoretically possible, I think this is already reasonably mitigated:

In the future, we could possibly something even fancier, perhaps with admission webhooks, or perhaps by adding support to upstream rollouts, but I think the above case is strong enough that it's fair to wait for a customer/user request before we work on this.