Multi-cluster section - Githubissues

scottrigby commented 3 years ago

Follow-up after #608. See https://github.com/fluxcd/flux2/pull/608#pullrequestreview-553880064

Can include multiple environments, production setup, and promotion.

stealthybox commented 3 years ago

I agree -- there are several paths for multi-cluster topologies. We should help folks looking to do multi-cluster understand the concepts that can compose for their environment:

single repo + multi-path vs. branches vs. multi-repo
flux bootstrap indepenent clusters on different paths
remote apply mechanism / kubeConfig constraints
use a management cluster to reconcile child clusters
bootstrap flux in a child cluster using a management cluster
cluster API
differentiating metrics / alerts

For promotion, there are several strategies:

tagging
paths
branching

^ These can be combined and have significant impact on the way that platform is managed and software is released. Should also link out to the notification / alerts docs and observability

Will add more detail later

stefanprodan commented 3 years ago

There are two examples in fluxcd org on multi-cluster that show different approaches:

As a starting point, we could write a doc page that introduces the 2 examples and links to the docs inside each repo.

gerardgorrion commented 3 years ago

Tagging the repo commits is a flux v1 feature that we need to upgrade flux v2.

benjimin commented 6 months ago

I think the docs could do a better job on advising how to manage multiple environments and phased promotions between them.

Particularly, should everything be in one branch? The monorepo has two obvious downsides:

Changes to common components (kustomize base) will immediately deploy to production, rather than being tested first in the staging environment. (There's a workaround whereby shared changes are temporarily refactored into overlay patches for phased testing in a specific environment, but this is onerous and makes promotion error-prone.)
The merge controls can not distinguish production from non-production. For example, using GitHub to enforce stringent review/approval processes for production will, as a side effect, cause unnecessary friction when making changes in the development environment.

Having two branches (e.g. production and staging) makes promotions more difficult to perform using git tools. If the branches are totally separate then divergence accumulates, making the repo generally difficult to manage. If the content is factored into a common base and overlays, it is easier to identify and remove divergence between the branch heads, but the histories nonetheless become unrelated (e.g., if a hotfix must be promoted to one app ahead of other already-staged changes for other apps, assuming multiple collaborators use the repo) which makes merges nontrivial.

Should there be a second repo where all of the applications are gathered? For applications represented as helm charts, this is viable (compare bitnami's chart repo) so long as the apps repo has CI to package and release updated charts into an OCI registry, and flux image tag automation is used to propagate releases into appropriate parts of the environments repo. It can also work for kustomize if using a scheme of app-specific release tags (as the flux gitrepo resource can refer to a specific subpath and a specific version pattern). This approach relies on refactoring the apps to minimise what config lives in the env overlay or helmrelease manifest, otherwise environments will still exhibit duplication (and be prone to unintended divergence). The downside is that this requires either intricate CI (for chart packaging) or is reliant on tags for promotions (note tags are less safely preserved than the commit history).

Separate repos for each application is hardly different to having one separate repo for all apps.

Another option is trunk based development with no separate staging environment, either by:

Instead relying on additional technologies like flagger to manage phased promotions for individual apps (so deployments are staged within the same cluster as production). An outline for using flagger would be useful, especially if it is the recommended approach?
Alternatively, using CI to dynamically spin up (and tear down) test environments for feature branches.

There are so many common pitfalls, this would be a great place to advise on best practice as in part, the problem originates from flux orchestrating an entire environment from a repo.

stefanprodan commented 6 months ago

I've been working on an a reference architecture for using Flux to manage the continuous delivery of Kubernetes infrastructure and applications on multi-cluster multi-tenant environments.

https://github.com/controlplaneio-fluxcd/d1-fleet

The setup is comprised of multiple repositories that:

offers a clean separation between platform teams and dev teams, and between infra (cluster add-ons) and apps
allows customising the infra and apps releases for each environment
changes to the base overlay do not affect the production clusters
the staging cluster runs the automation for updating the Helm charts to the latest versions
promotions for infra and apps from staging to production are done via PRs (merging main into the production branch)
dev teams get their own repos which are reconciled under a restricted service account

benjimin commented 6 months ago

@stefanprodan am I correct in understanding that your reference architecture for continuous delivery has the app repos each use separate different branches for production and staging? In other words, it adopts the "git flow" branching model documented here and here. Note that both those references discourage the use of that branching model for modern DevOps and continuous delivery. Is there anywhere that you've addressed those concerns or presented more discussion regarding the choice?

stefanprodan commented 6 months ago

There is a big difference between code and infra, in my setup the main branch contains both overlays (which is never the case with code). Merging into another branch like production, or tagging the main brach with a semver tag, is just a way to say "promote this". The actual work is done in main, this is trunk-based development, not git flow. One improvement I would like to make, it to use the production branch as a canary, that gets synced by a single production cluster, and promoting the changes to the whole fleet would happen after tagging that branch with a semver tag.

fluxcd / flux2

Multi-cluster section #611