fluxcd / flux2

Open and extensible continuous delivery solution for Kubernetes. Powered by GitOps Toolkit.
https://fluxcd.io
Apache License 2.0
6.39k stars 593 forks source link

Github Flux bootstrap Upgrade using Github Actions does not work #1852

Open mmckane opened 2 years ago

mmckane commented 2 years ago

Describe the bug

Following this guide https://fluxcd.io/docs/installation/#bootstrap-upgrade and utilizing the https://github.com/fluxcd/flux2/tree/main/action#automate-flux-updates to update the manifest in the repo and running the commands does not result in upgrading flux.

I don't appear to be able to automatically update the system components using gitops/github.

Steps to reproduce

  1. Start with Flux CLI 16.2 Bootstrap cluster following these instructions https://fluxcd.io/docs/installation/#github-and-github-enterprise to branch main
  2. Run Github Action to update repo//flux-system/gotk-components.yaml PR will be created to main.
  3. Merge PR
  4. Run flux reconcile source git flux-system per https://fluxcd.io/docs/installation/#bootstrap-upgrade
  5. Run flux check all images and labels are still at flux 16.2 instead of current 17.2 as of writing

Expected behavior

updating the gotk-components.yaml in git automatically results in the flux controllers to be updated.

Screenshots and recordings

Github PR is done at this point don't have any output. But here are the output from the 2 commands run from the steps in the instructions. The commit matches the PR that has references to 17.2 in the gotk-components.yaml file.

$:~/repos/k8s-mgmt/cluster-setup$ flux reconcile source git flux-system
► annotating GitRepository flux-system in flux-system namespace
✔ GitRepository annotated
◎ waiting for GitRepository reconciliation
✔ GitRepository reconciliation completed
✔ fetched revision main/3b05ded560cfa0e0da6efad5943485682b4ccd04

# Note run with a Flux 17.2 CLI tool: (Kustomize controller should be at ghcr.io/fluxcd/kustomize-controller:v0.14.1)
$:~/repos/k8s-mgmt/cluster-setup$ flux check
► checking prerequisites
✔ kubectl 1.20.8-dispatcher >=1.18.0-0
✔ Kubernetes 1.20.8-gke.1500 >=1.16.0-0
► checking controllers
✔ helm-controller: deployment ready
► ghcr.io/fluxcd/helm-controller:v0.11.2
✔ kustomize-controller: deployment ready
► ghcr.io/fluxcd/kustomize-controller:v0.13.3
✔ notification-controller: deployment ready
► ghcr.io/fluxcd/notification-controller:v0.15.1
✔ source-controller: deployment ready
► ghcr.io/fluxcd/source-controller:v0.15.4
✔ all checks passed

OS / Distro

Ubuntu 20.04

Flux version

17.2

Flux check

► checking prerequisites ✔ kubectl 1.20.8-dispatcher >=1.18.0-0 ✔ Kubernetes 1.20.8-gke.1500 >=1.16.0-0 ► checking controllers ✔ helm-controller: deployment ready ► ghcr.io/fluxcd/helm-controller:v0.11.2 ✔ kustomize-controller: deployment ready ► ghcr.io/fluxcd/kustomize-controller:v0.13.3 ✔ notification-controller: deployment ready ► ghcr.io/fluxcd/notification-controller:v0.15.1 ✔ source-controller: deployment ready ► ghcr.io/fluxcd/source-controller:v0.15.4 ✔ all checks passed

Git provider

github

Container Registry provider

No response

Additional context

No response

Code of Conduct

stefanprodan commented 2 years ago

Can you please post here your workflow? My guess is that the path given to --export doesn't match your cluster path.

mmckane commented 2 years ago

When the PR is created it changes the <cluster-path>/flux-system/gotk-components.yaml and updates it with the changes to gotk-components.yaml in the PR. When we merge the the PR only way to get the cluster to update is to run a flux bootstrap again, it doesn't do it by itself. I can provide screenshots of the file being modified and adding a comment at the top of the file with the version that it is changing to if that helps.

We have multiple clusters so we modified the example from https://github.com/fluxcd/flux2/tree/main/action#readme to find all the gotk-components.yaml files put it into a json array and dump it into a matrix that creates a separate PR per environment.

name: update-flux

on:
  workflow_dispatch:
  #schedule:
  #  - cron: "0 * * * *"

jobs:
  flux-clusters:
    runs-on: gcp
    outputs:
      matrix: ${{ steps.content.outputs.matrix }}
    container: stedolan/jq:latest
    steps:
      - uses: actions/checkout@v2
      - name: get content
        id: content
        run: |
          echo ::set-output name=matrix::$(find config/clusters -iname 'gotk-components.yaml' | jq --slurp --raw-input 'split("\n")[:-1]' | jq '{"include": (map( { ("file"): .,"name":((split("/")[2]) + "-" + (split("/")[3]))  } ))}')
          echo "Output Array For Troubleshooting Purposes"
          echo $(find config/clusters -iname 'gotk-components.yaml' | jq --slurp --raw-input 'split("\n")[:-1]' | jq '{"include": (map( { ("file"): .,"name":((split("/")[2]) + "-" + (split("/")[3]))  } ))}')

  update-flux:
    needs: flux-clusters
    runs-on: gcp
    strategy:
      matrix: ${{ fromJson(needs.flux-clusters.outputs.matrix) }}
      fail-fast: false
    steps:
      - name: Check out code
        uses: actions/checkout@v2
        with:
          submodules: true
      - name: Setup Flux CLI
        uses: fluxcd/flux2/action@main
      - name: Check for updates
        id: update
        run: |
          echo "cluster: ${{ matrix.name }}"
          FULL_VERSION="$(~/flux -v)"
          ~/flux install --export > ${{ matrix.file }}
          IFS=' '
          read -a split_version <<< "$FULL_VERSION"
          echo "::set-output name=flux_version::${split_version[2]}"
      - name: Create Pull Request
        uses: peter-evans/create-pull-request@v3
        with:
            token: ${{ secrets.GITHUB_TOKEN }}
            branch: flux/${{ matrix.name }}-${{ steps.update.outputs.flux_version }}
            base: main
            commit-message: Update ${{ matrix.name }} to ${{ steps.update.outputs.flux_version }}
            title: Flux Update ${{ matrix.name }} to ${{ steps.update.outputs.flux_version }}
            body: |
              Update ${{ matrix.name }} to ${{ steps.update.outputs.flux_version }}

I don't think --export is wrong unless it's not supposed to modify <clusterpath>/flux-system/gotk-components.yaml

mmckane commented 2 years ago

note I updated to the latest version 0.20.0 and was still seeing this issue. I tried the below though as a workaround and it seemed to have fixed it. Not sure if this is the official way I should be doing this.

@stefanprodan I was able to get a cluster to upgrade from git by copying the Kustomization spec created in my <cluster path>/flux-system folder and applying it to my cluster. Is this the supported way to do this and the documentation needs to be updated, or should the kustomize controller be syncing the flux-system namespace automatically behind the scenes somewhere in the go code and it's not?

Example gotk-sync.yaml from repo and flux bootstrap command used:

flux bootstrap github \
  --owner=${GITHUB_ORG} \
  --repository=${GITHUB_REPO} \
  --branch=${GITHUB_BRANCH} \
  --path="config/clusters/platform/sandbox"

output flux bootstrap gotk-sync.yaml pushed automatically into repo flux-system folder at by flux CLI

---
apiVersion: source.toolkit.fluxcd.io/v1beta1
kind: GitRepository
metadata:
  name: flux-system
  namespace: flux-system
spec:
  interval: 1m0s
  ref:
    branch: main
  secretRef:
    name: flux-system
  url: ssh://git@github.com/<redacted>/k8s-mgmt
---
apiVersion: kustomize.toolkit.fluxcd.io/v1beta2
kind: Kustomization
metadata:
  name: flux-system
  namespace: flux-system
spec:
  interval: 10m0s
  path: ./config/clusters/platform/sandbox
  prune: true
  sourceRef:
    kind: GitRepository
    name: flux-system

File I modified and applied to cluster to get automatic upgrade from the action to work, after creating and reconciling this resource the resources in flux-system appear to have been upgraded. You can see I just made a new resource and added /flux-system to the end of it.

apiVersion: kustomize.toolkit.fluxcd.io/v1beta2
kind: Kustomization
metadata:
  name: flux-system-components
  namespace: flux-system
spec:
  interval: 10m0s
  path: ./config/clusters/platform/sandbox/flux-system
  prune: true
  sourceRef:
    kind: GitRepository
    name: flux-system
Alan01252 commented 2 years ago

I am curious to know the answer here too. I can't work out how the flux-system/kustomization file is applied.

Updating the gotk-components directly for the cluster doesn't directly "apply" them to the cluster, so I presume the bootstrap command does something else?

I can see why the above file modification would work as then we're creating a kustomization which will get added as anything else within flux, but it "feels" like I must be missing something here.

Help appreciated ;)

mmckane commented 2 years ago

@stefanprodan any feedback on how this should work here? It seems like nothing is referencing the flux-system folder and running the kustomization file there to run the gotk-* files against the cluster on an ongoing basis, which breaks using the Git PR flow to update flux inside the cluster. The work around appears to either add a reference to the flux-system folder in the Kustomization file and path that you feed the bootstrap command, or create a separate kustomize.toolkit.fluxcd.io/v1beta2 reference that references the flux-system folder. Is this how we should be doing it?

Further evidence, from my testing It seems like currently the only way to get flux to sync to the flux-system folders state is to run flux bootstrap using the flux.exe cli, which defeats the purpose of the git update flow as that also updates the files in the flux-system folder. Unless I am missing a setting here or just on an older version where this isn't fixed?

kingdonb commented 2 years ago

The behavior of flux bootstrap is to create a flux-system directory in the cluster path that you specified, with three files in it, one of which is a Flux kustomization pointing to the empty cluster directory at the parent location of flux-system.

It's significant that the cluster directory is empty at this point, because the Flux Kustomization pointing at it (all Flux kustomizations do this) has default behavior to run kustomize create with parameters for auto detecting sub-resources, as described in this FAQ entry about plain YAML.

Tl;dr: this behavior searches for plain YAML files and Kustomization YAMLs (overlays) beneath the cluster root. When you run kustomize create --autodetect --recursive in the cluster root yourself, if you still have no kustomization.yaml there you will see the behavior includes an entry for flux-system and any other plain YAML files you have left there.

(Don't commit it!)

Users should expect Flux to slurp up any YAMLs inside of the directory you point it at, including subdirectories, and users must know that kustomization.yaml being placed in any path has the side-effect of blocking this recursive autodetect behavior from proceeding through any sub-paths that are not listed there in the resources section.

If you put a kustomization.yaml at the top of your cluster root directory and did not mention flux-system in there, then you excluded the Flux controllers from being applied and managed by Flux within the cluster, then you will no longer be managing Flux with Flux. If this is not handled carefully then Flux might even delete itself, (depending on whether pruning is enabled or not.)

I suspect that you do not have prune: true and that is how your resources were able to become unmanaged without your noticing.

This is one of the most non-obvious behaviors of Flux, and it's really unfortunate how many people trip over this, because in my view at least it has been built this way to allow users to avoid learning about Kustomize for as long as possible. Learning Kustomize and Flux at once is a lot to take in. But as folks progress with the usage of Flux, this and other misunderstandings about how Kustomize works are common FAQs, and since Kustomize itself is an external dependency, we really have little control over how confusing it is or how difficult it is to learn. The best we can do is hope to cover with enough examples, and add features to make it easier to see what Flux is doing.

$ flux trace -n flux-system deploy kustomize-controller

Object:        Deployment/kustomize-controller
Namespace:     flux-system
Status:        Managed by Flux
---
Kustomization: flux-system
Namespace:     flux-system
Path:          ./clusters/moo-cluster
Revision:      staging/474d117f8694be2a1e2f297bc944de2e5c187013
Status:        Last reconciled at 2022-04-26 07:26:42 -0400 EDT
Message:       Applied revision: staging/474d117f8694be2a1e2f297bc944de2e5c187013
---
GitRepository: flux-system
Namespace:     flux-system
URL:           ssh://git@github.com/kingdonb/bootstrap-repo
Branch:        staging
Revision:      staging/474d117f8694be2a1e2f297bc944de2e5c187013
Status:        Last reconciled at 2022-04-24 14:44:27 -0400 EDT
Message:       stored artifact for revision 'staging/474d117f8694be2a1e2f297bc944de2e5c187013'

This is my test cluster, and the flux trace command should help you find out what arbitrary resources are managed by which Flux controllers. I did a test and disabled prune, then wrote a kustomization.yaml in the flux system dir parent dir and excluded the flux-system directory there.

I did an experiment and removed flux-system from Flux's management as I've described. It still shows up in flux trace as being managed by Flux, that might be a bug, or it might be intentional (because it was created by Flux in the first place, it's still technically accurate, at least in a sense.)

When I run $ flux tree ks flux-system I can see that the flux-system kustomization is no longer managing itself. (I have a lot of things listed there, but none of them are GitRepository/flux-system/flux-system or Kustomization/flux-system/flux-system – if this is you, then besides inspecting the kustomize directory tree to understand exactly what Kustomize is doing in relation to how it is set up by default, this is one other way you can find out what's going on here.

kingdonb commented 2 years ago

(I just noticed that flux tree ks flux-system never shows Kustomization/flux-system/flux-system managing Kustomization/flux-system/flux-system – which is good, because that would be a circular dependency, and this is a recursive operation, you don't want to see an infinite loop in your tree! But in the difference, you can still see that GitRepository/flux-system/flux-system is either there or not there, since that is a different object and it is not recursed through, so that one is not circular.)

mmckane commented 2 years ago

@kingdonb this makes some sense as to what may be happening for us and I think clarified some of my suspicions. Here is exactly what is happening from our side of things and where we need to possibly adjust.

  1. We are pre-loading the kustomization.yaml /config/clusters/cluster-1/ as in that folder we specify the following kustomization file as we handle most of our transforms using the transformers file method of kustomize:
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- ../../../base/cluster/gcp/stable-gke/

transformers:
- transformers/argocd.yaml
- transformers/cert-manager.yaml
- transformers/deployment.yaml
- transformers/goldilocks.yaml
- transformers/istio.yaml
- transformers/metadata-api.yaml
- transformers/oidc.yaml
  1. We then run the following:

    flux bootstrap github \
    --owner=${GITHUB_ORG} \
    --repository=${GITHUB_REPO} \
    --branch=${GITHUB_BRANCH} \
    --path="config/clusters/cluster-1"
  2. The bootstrap auto commits and creates a flux-system folder at /config/clusters/cluster-1/flux-system with the following file gotk-sync that is setup to sync the cluster folder.

    # This manifest was generated by flux. DO NOT EDIT.
    ---
    apiVersion: source.toolkit.fluxcd.io/v1beta2
    kind: GitRepository
    metadata:
    name: flux-system
    namespace: flux-system
    spec:
    interval: 1m0s
    ref:
    branch: main
    secretRef:
    name: flux-system
    url: ssh://<redacted>
    ---
    apiVersion: kustomize.toolkit.fluxcd.io/v1beta2
    kind: Kustomization
    metadata:
    name: flux-system
    namespace: flux-system
    spec:
    interval: 10m0s
    path: ./config/clusters/cluster-1
    prune: true
    sourceRef:
    kind: GitRepository
    name: flux-system
  3. End Install. End result cluster is not constantly synchronizing the flux-system folder, but it is applying our changes to the gitrepo. Upgrades seem to work fine and work because I am guessing running flux bootstrap reads the riles from the repo or generates them again at runtime and pushes them out to the cluster again kubectl apply style.

The above seem to result in the state we are now in where because we created the kustomzization.yaml out of band. I am guessing to fix this we should update our resources section to look like this when we pre-create a kustomization file in this folder like this:

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- ../../../base/cluster/gcp/stable-gke/
- /flux-system

transformers:
...

or alternatively adjust our directory structure to something like this and bootstrap /confgi/clusters/cluster-1/flux

|__config
    |__clusters
       |__cluster-1
           |__flux
           |   |__flux-system (created on bootstrap)
           |   |__cluster-1.yaml (specify flux kustomization to /config/clusters/cluster-1)
           |__kustomization.yaml (with transformers etc from above)

I would be ok with either way from a technical standpoint I think its slightly frustrating that the documentation isn't exactly clear with a recommendation on what the folder flux bootstrap --path should or should not contain in terms of kustomization.yaml. From a transparency perspective the documentation should be updated to make it clear.

Also as a note you can see even with prune = true currently flux does not prune itself using the gotk-sync.yaml. I am guessing this has to do with the fact that Flux never knows about the files in the flux-system folder in the repo because the flux bootstrap command is basically kubectl applying the manifests for the first time to bootstrap the cluster and it never ever knows its supposed to run the kustomization.yaml in the flux-sytem folder so it doesn't know it should prune it.

kingdonb commented 2 years ago

This makes sense now. You cannot have a kustomization.yaml in your cluster root that does not also include flux-system, unless you want for Flux to not manage itself, which is apparently what you're getting.

If you leave the kustomization.yaml out of the cluster root as Flux creates the bootstrap directory (empty except for flux-system/, if you haven't pre-loaded it with anything) you can create as many sibling directories alongside of flux-system as you need, and they will all be evaluated as a kustomize overlay if they include kustomization.yaml or as a directory of plain YAMLs if they do not.

We could improve the documentation to make this clearer. Thanks for your feedback! Does this help to unblock you?