digitalocean / Kubernetes-Starter-Kit-Developers

Hands-on tutorial and Automation stack for an operations-ready DigitalOcean Kubernetes (DOKS) cluster.
761 stars 263 forks source link

error GitRepository/flux-system.flux-system #87

Closed Analect closed 2 years ago

Analect commented 2 years ago

I have been following the automation tutorial. https://github.com/digitalocean/Kubernetes-Starter-Kit-Developers/tree/main/15-automate-with-terraform-flux

I've re-run it a few times (recreating clusters), but it seems to get stuck on creating all the necessary flux-system components

When I run flux get all, then I get:

NAME                            READY   MESSAGE                                                         REVISION        SUSPENDED 
gitrepository/flux-system       False   auth error: knownhosts: illegal base64 data at input byte 5                     False 

And flux logs gives:

2021-12-09T16:07:35.606Z error GitRepository/flux-system.flux-system - Reconciler error auth secret error: Secret "flux-system" not found
2021-12-09T16:07:35.713Z error GitRepository/flux-system.flux-system - Reconciler error auth secret error: Secret "flux-system" not found
2021-12-09T16:07:35.897Z error GitRepository/flux-system.flux-system - Reconciler error auth secret error: Secret "flux-system" not found
2021-12-09T16:07:36.243Z error GitRepository/flux-system.flux-system - Reconciler error auth secret error: Secret "flux-system" not found
2021-12-09T16:07:36.913Z error GitRepository/flux-system.flux-system - Reconciler error auth secret error: Secret "flux-system" not found
2021-12-09T16:07:38.216Z error GitRepository/flux-system.flux-system - Reconciler error auth secret error: Secret "flux-system" not found
2021-12-09T16:07:40.812Z error GitRepository/flux-system.flux-system - Reconciler error auth error: knownhosts: illegal base64 data at input byte 5
2021-12-09T16:07:45.950Z error GitRepository/flux-system.flux-system - Reconciler error auth error: knownhosts: illegal base64 data at input byte 5
2021-12-09T16:07:56.233Z error GitRepository/flux-system.flux-system - Reconciler error auth error: knownhosts: illegal base64 data at input byte 5
2021-12-09T16:08:16.751Z error GitRepository/flux-system.flux-system - Reconciler error auth error: knownhosts: illegal base64 data at input byte 5
2021-12-09T16:08:57.749Z error GitRepository/flux-system.flux-system - Reconciler error auth error: knownhosts: illegal base64 data at input byte 5
2021-12-09T16:10:19.710Z error GitRepository/flux-system.flux-system - Reconciler error auth error: knownhosts: illegal base64 data at input byte 5
2021-12-09T16:13:03.599Z error GitRepository/flux-system.flux-system - Reconciler error auth error: knownhosts: illegal base64 data at input byte 5

It seems the various git credentials added in the main.tf file are right since files got added to the git_repository_sync_path that I supplied. However, these logs above suggest a related problem, where it can't access the GitRepository for other purposes.

In the Github PAT, I granted these permissions in scope. Maybe that's not sufficient?

image

If I look in .terraform/modules/create-doks-with-terraform-flux/provider.tf I see:

provider "github" {
  owner = var.github_user
  token = var.github_token
}

There is no base64 encoding/decoding suggested here.

Googling here suggests that maybe if a github user is a person rather than an org, then --personal flag should be passed. I'm not sure if that's relevant here and if that is handled in this starter kit. Also it suggest checking the content of the flux-system secret on the cluster, which should equate to an encoded Github PAT supplied in the main.tf. It's not clear to me how that is best done.

Any thoughts on how I might get over this stumbling block? Tks

mtiutiu-heits commented 2 years ago

Hi @Analect,

I think I may have an idea of why you ran into this situation, but I need to confirm first. Let me try to reproduce this issue and get back to you.

Thanks.

v-ctiutiu commented 2 years ago

@Analect Found the issue. A fix is pending via https://github.com/digitalocean/container-blueprints/issues/18.

Analect commented 2 years ago

Thanks @mtiutiu-heits .

I was probably wrong when I said the flux-system components were not created.

On checking kubectl get pods -n flux-system I see:

NAME                                       READY   STATUS    RESTARTS   AGE
helm-controller-55896d6ccf-scnnd           1/1     Running   0          18h
kustomize-controller-76795877c9-nqwrn      1/1     Running   0          18h
notification-controller-7ccfbfbb98-slsxx   1/1     Running   0          18h
source-controller-6b8d9cb5cc-cjcs5         1/1     Running   0          18h

However the other flux get all errors above persist.

Also, I tried running flux bootstrap github --owner=<my-github-org> --repository=<my-repo> thinking that might rectify any misconfiguration. The response suggests all is OK, but those errors on flux get all persist.

Please enter your GitHub personal access token (PAT): 
► connecting to github.com
► cloning branch "main" from Git repository "https://github.com/xxx/xxx.git"
✔ cloned repository
► generating component manifests
✔ generated component manifests
✔ committed sync manifests to "main" ("xxxed2b84")
► pushing component manifests to "https://github.com/xxx/xxx.git"
✔ installed components
✔ reconciled components
► determining if source secret "flux-system/flux-system" exists
✔ source secret up to date
✗ sync path configuration ("") would overwrite path ("./clusters/dev") of existing Kustomization
Analect commented 2 years ago

@Analect Found the issue. A fix is pending via digitalocean/container-blueprints#18.

Thanks @mtiutiu-heits ... are there any manual steps I can take to fix and reload/apply to the cluster?

v-ctiutiu commented 2 years ago

@Analect I don't know if it's ok to use both methods, meaning bootstrapping Flux CD via Terraform and then via the flux CLI.

Did you uninstalled Flux CD first via: flux uninstall before bootstrapping again ? Because there are still some things that are left behind, like CRDs.

For fixing it manually, please follow below steps:

  1. Edit the flux-system secret:
kubectl edit secret flux-system -n flux-system
  1. Find the known_hosts field and change the value to:
Z2l0aHViLmNvbSBlY2RzYS1zaGEyLW5pc3RwMjU2IEFBQUFFMlZqWkhOaExYTm9ZVEl0Ym1semRIQXlOVFlBQUFBSWJtbHpkSEF5TlRZQUFBQkJCRW1LU0VOalFFZXpPbXhrWk15N29wS2d3RkI5bmt0NVlScllNak51RzVOODd1UmdnNkNMcmJvNXdBZFQveTZ2MG1LVjBVMncwV1oyWUIvKytUcG9ja2c9Cg==
  1. Save the secret, and reconcile theflux-system git repository resource:
flux reconcile source git flux-system

The above value for known_hosts is the base64 encoded form of:

github.com ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBEmKSENjQEezOmxkZMy7opKgwFB9nkt5YRrYMjNuG5N87uRgg6CLrbo5wAdT/y6v0mKV0U2w0WZ2YB/++Tpockg=

Let me know if it helps.

P.S.: Sorry for replying with both GitHub accounts (I'm a contractor, and one account is associated with the company that I work for). I forgot to switch accounts, chrome profiles, etc. Too many things to do sometimes 😄 .

Analect commented 2 years ago

Thanks @v-ctiutiu That seems to have partially worked. Having followed steps above, I then run flux get all. Any idea what the error on the kustomization/flux-system might be related to?

NAME                            READY   MESSAGE                                                         REVISION                                        SUSPENDED 
gitrepository/flux-system       True    Fetched revision: main/97ba7b0970fcbeadbfe2fe44fb639d2254ed2b84 main/97ba7b0970fcbeadbfe2fe44fb639d2254ed2b84   False    

NAME                            READY   MESSAGE                                                                                                                                                                                                                                                                                             REVISION        SUSPENDED 
kustomization/flux-system       False   CustomResourceDefinition/kustomizations.kustomize.toolkit.fluxcd.io dry-run failed, reason: Invalid, error: CustomResourceDefinition.apiextensions.k8s.io "kustomizations.kustomize.toolkit.fluxcd.io" is invalid: status.storedVersions[1]: Invalid value: "v1beta2": must appear in spec.versions                 False
v-ctiutiu commented 2 years ago

@Analect You can also override the Terraform module value for that variable in your main.tf file, like this (notice the last line):

module "doks_flux_cd" {
  source = "github.com/digitalocean/container-blueprints/create-doks-with-terraform-flux"

  # DOKS 
  do_api_token                 = "<YOUR_DO_API_TOKEN_HERE>"               # DO API TOKEN (string value)
  doks_cluster_name            = "<YOUR_DOKS_CLUSTER_NAME_HERE>"          # Name of this `DOKS` cluster ? (string value)
  doks_cluster_region          = "<YOUR_DOKS_CLUSTER_REGION_HERE>"        # What region should this `DOKS` cluster be provisioned in ? (string value)
  doks_cluster_version         = "1.21.3-do.0"                            # What Kubernetes version should this `DOKS` cluster use ? (string value)
  doks_cluster_pool_size       = "<YOUR_DOKS_CLUSTER_POOL_SIZE_HERE>"     # What machine type to use for this `DOKS` cluster ? (string value)
  doks_cluster_pool_node_count = <YOUR_DOKS_CLUSTER_POOL_NODE_COUNT_HERE> # How many worker nodes this `DOKS` cluster should have ? (integer value)

  # GitHub
  # Important notes:
  #  - This module expects your Git `repository` and `branch` to be created beforehand
  #  - Currently, the `github_token` doesn't work with SSO
  github_user               = "<YOUR_GITHUB_USER_HERE>"               # Your `GitHub` username
  github_token              = "<YOUR_GITHUB_TOKEN_HERE>"              # Your `GitHub` personal access token
  git_repository_name       = "<YOUR_GIT_REPOSITORY_NAME_HERE>"       # Git repository where `Flux CD` manifests should be stored
  git_repository_branch     = "<YOUR_GIT_REPOSITORY_BRANCH_HERE>"     # Branch name to use for this `Git` repository (e.g.: `main`)
  git_repository_sync_path  = "<YOUR_GIT_REPOSITORY_SYNC_PATH_HERE>"  # Git repository path where the manifests to sync are committed (e.g.: `clusters/dev`)
  github_ssh_pub_key        = "ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBEmKSENjQEezOmxkZMy7opKgwFB9nkt5YRrYMjNuG5N87uRgg6CLrbo5wAdT/y6v0mKV0U2w0WZ2YB/++Tpockg="
}
v-ctiutiu commented 2 years ago

Thanks @v-ctiutiu That seems to have partially worked. Having followed steps above, I then run flux get all. Any idea what the error on the kustomization/flux-system might be related to?

NAME                            READY   MESSAGE                                                         REVISION                                        SUSPENDED 
gitrepository/flux-system       True    Fetched revision: main/97ba7b0970fcbeadbfe2fe44fb639d2254ed2b84 main/97ba7b0970fcbeadbfe2fe44fb639d2254ed2b84   False    

NAME                            READY   MESSAGE                                                                                                                                                                                                                                                                                             REVISION        SUSPENDED 
kustomization/flux-system       False   CustomResourceDefinition/kustomizations.kustomize.toolkit.fluxcd.io dry-run failed, reason: Invalid, error: CustomResourceDefinition.apiextensions.k8s.io "kustomizations.kustomize.toolkit.fluxcd.io" is invalid: status.storedVersions[1]: Invalid value: "v1beta2": must appear in spec.versions                 False

I suspect that you have now a Flux CD environment with mixed stuff, meaning old CRDs from the old one as well. I see that it complains about CRDs version. What works best if it's not a big issue for you, is to uninstall Flux CD completely via flux uninstall and then bootstrap it again.

Let me do this first in my current setup, and see if overriding the github_ssh_pub_key parameter fixes all the issues first. Then, I will try to find a manual fix for your environment as well if possible.

Thanks.

v-ctiutiu commented 2 years ago

@Analect Can you try this and let me know if it works:

flux reconcile kustomization flux-system -n flux-system --with-source
Analect commented 2 years ago

Tried flux reconcile kustomization flux-system -n flux-system --with-source ... and got this:

► annotating GitRepository flux-system in flux-system namespace
✔ GitRepository annotated
◎ waiting for GitRepository reconciliation
✔ fetched revision main/97ba7b0970fcbeadbfe2fe44fb639d2254ed2b84
► annotating Kustomization flux-system in flux-system namespace
✔ Kustomization annotated
◎ waiting for Kustomization reconciliation
✗ Kustomization reconciliation failed: CustomResourceDefinition/kustomizations.kustomize.toolkit.fluxcd.io dry-run failed, reason: Invalid, error: CustomResourceDefinition.apiextensions.k8s.io "kustomizations.kustomize.toolkit.fluxcd.io" is invalid: status.storedVersions[1]: Invalid value: "v1beta2": must appear in spec.versions

So you suggest running flux uninstall. By bootstrapping again, do you mean rerunning these:

terraform plan -out starter_kit_flux_cluster.out
terraform apply "starter_kit_flux_cluster.out"

Does that require me to tear-down the existing cluser?

Analect commented 2 years ago

OK. Ran:

flux uninstall
terraform plan -out starter_kit_flux_cluster.out
terraform apply "starter_kit_flux_cluster.out"

It recreated the flux-system pods, but on running flux get all, I don't see any reference kustomization/flux-system.

NAME                            READY   MESSAGE                                                         REVISION                                        SUSPENDED 
gitrepository/flux-system       True    Fetched revision: main/97ba7b0970fcbeadbfe2fe44fb639d2254ed2b84 main/97ba7b0970fcbeadbfe2fe44fb639d2254ed2b84   False

I ran flux reconcile kustomization flux-system -n flux-system --with-source ... but got:

✗ no matches for kind "Kustomization" in version "kustomize.toolkit.fluxcd.io/v1beta2"
v-ctiutiu commented 2 years ago

@Analect

To start fresh, and without deleting the whole cluster you need to:

  1. Uninstall Flux CD via:
flux uninstall
  1. Make sure that the flux-system namespace gets deleted. Check with: kubectl get ns. If it's still there and having finalizers that hold it from being deleted, then you need to use this script to delete it forcibly.
  2. Override the github_ssh_pub_key TF variable as explained previously.
  3. TF plan and apply, as you already mentioned (before applying, please inspect the plan carefully, and notice the changes - should be Flux CD related mostly):
terraform plan -out starter_kit_flux_cluster.out
terraform apply "starter_kit_flux_cluster.out"

Terraform should see the differences and re-create the missing parts only, meaning Flux CD components (if you still have the state file in your working directory, or on the S3 bucket).

Let me know how it goes and if it fixes your issue. Thanks.

Analect commented 2 years ago

@v-ctiutiu followed your instructions:

flux uninstall
Are you sure you want to delete Flux and its custom resource definitions: y
► deleting components in flux-system namespace
✔ Deployment/flux-system/helm-controller deleted 
✔ Deployment/flux-system/kustomize-controller deleted 
✔ Deployment/flux-system/notification-controller deleted 
✔ Deployment/flux-system/source-controller deleted 
✔ Service/flux-system/notification-controller deleted 
✔ Service/flux-system/source-controller deleted 
✔ Service/flux-system/webhook-receiver deleted 
✔ NetworkPolicy/flux-system/allow-egress deleted 
✔ NetworkPolicy/flux-system/allow-scraping deleted 
✔ NetworkPolicy/flux-system/allow-webhooks deleted 
✔ ServiceAccount/flux-system/helm-controller deleted 
✔ ServiceAccount/flux-system/kustomize-controller deleted 
✔ ServiceAccount/flux-system/notification-controller deleted 
✔ ServiceAccount/flux-system/source-controller deleted 
✔ ClusterRole/crd-controller-flux-system deleted 
✔ ClusterRoleBinding/cluster-reconciler-flux-system deleted 
✔ ClusterRoleBinding/crd-controller-flux-system deleted 
► deleting toolkit.fluxcd.io finalizers in all namespaces
✔ GitRepository/flux-system/flux-system finalizers deleted 
► deleting toolkit.fluxcd.io custom resource definitions
✔ CustomResourceDefinition/alerts.notification.toolkit.fluxcd.io deleted 
✔ CustomResourceDefinition/buckets.source.toolkit.fluxcd.io deleted 
✔ CustomResourceDefinition/gitrepositories.source.toolkit.fluxcd.io deleted 
✔ CustomResourceDefinition/helmcharts.source.toolkit.fluxcd.io deleted 
✔ CustomResourceDefinition/helmreleases.helm.toolkit.fluxcd.io deleted 
✔ CustomResourceDefinition/helmrepositories.source.toolkit.fluxcd.io deleted 
✔ CustomResourceDefinition/kustomizations.kustomize.toolkit.fluxcd.io deleted 
✔ CustomResourceDefinition/providers.notification.toolkit.fluxcd.io deleted 
✔ CustomResourceDefinition/receivers.notification.toolkit.fluxcd.io deleted 
✔ Namespace/flux-system deleted 
✔ uninstall finished

The namespace flux-system was stuck in a Terminating state for some time, so I went ahead and ran:

(
NAMESPACE=flux-system
kubectl proxy &
kubectl get namespace $NAMESPACE -o json |jq '.spec = {"finalizers":[]}' >temp.json
curl -k -H "Content-Type: application/json" -X PUT --data-binary @temp.json 127.0.0.1:8001/api/v1/namespaces/$NAMESPACE/finalize
)

This appeared to work. I notice this at the end of the output on running that script.

{
        "type": "NamespaceContentRemaining",
        "status": "True",
        "lastTransitionTime": "2021-12-10T18:01:39Z",
        "reason": "SomeResourcesRemain",
        "message": "Some resources are remaining: kustomizations.kustomize.toolkit.fluxcd.io has 1 resource instances"
      },
      {
        "type": "NamespaceFinalizersRemaining",
        "status": "True",
        "lastTransitionTime": "2021-12-10T18:01:39Z",
        "reason": "SomeFinalizersRemain",
        "message": "Some content in the namespace has finalizers remaining: finalizers.fluxcd.io in 1 resource instances"
      }
    ]
  }

I reran:

terraform plan -out starter_kit_flux_cluster.out
terraform apply "starter_kit_flux_cluster.out"

... but running flux get all, I'm still not seeing this kustomization/flux-system. I tried flux reconcile kustomization flux-system -n flux-system --with-source again, but just get back:

✗ no matches for kind "Kustomization" in version "kustomize.toolkit.fluxcd.io/v1beta2"
v-ctiutiu commented 2 years ago

@Analect

Ok I reproduced your issue, and it seems that the main TF module from the container-blueprints repo is a little bit outdated in regards to Flux CD provider. So, I went and updated the Flux CD Terraform provider in my GitHub fork of the container-blueprints repo, to use the latest version. The container-blueprints repo holds the main Terraform module code btw, which is then used in the Starter Kit.

I assume that you have locally the latest version for the flux CLI, right ? (or at least a very recent version)

If so, the Flux CD provider from the TF module created in the container-blueprints repo needs an update as well, because it's old. I'm suspecting that the Kustomization Controller issue is caused by this. The TF provider for Flux and the CLI counterpart need to be not too distant when talking about the version.

So, I uninstalled Flux CD again via flux uninstall, and then I used my updated version for the module source like this (in the main.tf file):

module "doks_flux_cd" {
  source = "github.com/v-ctiutiu/container-blueprints/create-doks-with-terraform-flux"
...
}

After planning again and then applying, I got both resources:

flux get all

And the output is:

NAME                            READY   MESSAGE                                                         REVISION                                        SUSPENDED 
gitrepository/flux-system       True    Fetched revision: main/95ae1bd47e4ce8cefd5e0bd409e3fe520ff748e1 main/95ae1bd47e4ce8cefd5e0bd409e3fe520ff748e1   False    

NAME                            READY   MESSAGE                                                         REVISION                                        SUSPENDED 
kustomization/flux-system       True    Applied revision: main/95ae1bd47e4ce8cefd5e0bd409e3fe520ff748e1 main/95ae1bd47e4ce8cefd5e0bd409e3fe520ff748e1   False

Please test and let me know if it works for you as well. If it does, then I will create another PR for the container-blueprints repo to address this issue as well.

Thanks.

Analect commented 2 years ago

@v-ctiutiu . Thanks for your efforts with this. Having gone through all your steps above, unfortunately I still can't get this kustomization/flux-system to 'show up'.

$ kubectl get pods -n flux-system
NAME                                       READY   STATUS    RESTARTS   AGE
helm-controller-779b58df6b-f4lmj           1/1     Running   0          98s
kustomize-controller-5db6bfc56d-cqwzh      1/1     Running   0          98s
notification-controller-7ccfbfbb98-lrqb4   1/1     Running   0          98s
source-controller-565f8fbbff-g6ptc         1/1     Running   0          98s
$ flux get all
NAME                            READY   MESSAGE                                                         REVISION                                        SUSPENDED 
gitrepository/flux-system       True    Fetched revision: main/3fe239b88ce7725d7867215884940adf77dde94a main/3fe239b88ce7725d7867215884940adf77dde94a   False    

$ flux reconcile kustomization flux-system -n flux-system --with-source
✗ no matches for kind "Kustomization" in version "kustomize.toolkit.fluxcd.io/v1beta2"

Back in my github repo, flux-system/gotk-sync.yaml updated as follows. This would suggest that the v1beta2 for kustomize that keeps getting complained about was updated, but maybe that hasn't been properly enforced on the cluster.

image

The upgrade of the fluxcd provider to 0.8.1 required me to run terraform init -upgrade, which it appears has upgraded various other components too in the flux-system\gotk-components.yaml file.

image

Sorry, I realise I'm fumbling around a bit blindly here, but would be good to get this running as per starter-kit demo. Tks.

Analect commented 2 years ago

I think this might be relevant to what is going wrong. "When using the Terraform provider for Flux, you have to manually remove the v1beta1 Kustomization from the TF state" with: terraform state rm 'kubectl_manifest.sync["kustomize.toolkit.fluxcd.io/v1beta1/kustomization/flux-system/flux-system"]'

I got:

Error: Invalid target address
│ 
│ No matching objects found. To view the available instances, use "terraform state list". Please modify the address to reference a specific instance.

When I run terraform state list, I get:

module.doks_flux_cd.data.digitalocean_kubernetes_cluster.primary
module.doks_flux_cd.data.flux_install.main
module.doks_flux_cd.data.flux_sync.main
module.doks_flux_cd.data.github_repository.main
module.doks_flux_cd.data.kubectl_file_documents.install
module.doks_flux_cd.data.kubectl_file_documents.sync
module.doks_flux_cd.digitalocean_kubernetes_cluster.primary
module.doks_flux_cd.github_repository_deploy_key.main
module.doks_flux_cd.github_repository_file.install
module.doks_flux_cd.github_repository_file.kustomize
module.doks_flux_cd.github_repository_file.sync
module.doks_flux_cd.kubectl_manifest.install["apiextensions.k8s.io/v1/customresourcedefinition/alerts.notification.toolkit.fluxcd.io"]
module.doks_flux_cd.kubectl_manifest.install["apiextensions.k8s.io/v1/customresourcedefinition/buckets.source.toolkit.fluxcd.io"]
module.doks_flux_cd.kubectl_manifest.install["apiextensions.k8s.io/v1/customresourcedefinition/gitrepositories.source.toolkit.fluxcd.io"]
module.doks_flux_cd.kubectl_manifest.install["apiextensions.k8s.io/v1/customresourcedefinition/helmcharts.source.toolkit.fluxcd.io"]
module.doks_flux_cd.kubectl_manifest.install["apiextensions.k8s.io/v1/customresourcedefinition/helmreleases.helm.toolkit.fluxcd.io"]
module.doks_flux_cd.kubectl_manifest.install["apiextensions.k8s.io/v1/customresourcedefinition/helmrepositories.source.toolkit.fluxcd.io"]
module.doks_flux_cd.kubectl_manifest.install["apiextensions.k8s.io/v1/customresourcedefinition/kustomizations.kustomize.toolkit.fluxcd.io"]
module.doks_flux_cd.kubectl_manifest.install["apiextensions.k8s.io/v1/customresourcedefinition/providers.notification.toolkit.fluxcd.io"]
module.doks_flux_cd.kubectl_manifest.install["apiextensions.k8s.io/v1/customresourcedefinition/receivers.notification.toolkit.fluxcd.io"]
module.doks_flux_cd.kubectl_manifest.install["apps/v1/deployment/flux-system/helm-controller"]
module.doks_flux_cd.kubectl_manifest.install["apps/v1/deployment/flux-system/kustomize-controller"]
module.doks_flux_cd.kubectl_manifest.install["apps/v1/deployment/flux-system/notification-controller"]
module.doks_flux_cd.kubectl_manifest.install["apps/v1/deployment/flux-system/source-controller"]
module.doks_flux_cd.kubectl_manifest.install["networking.k8s.io/v1/networkpolicy/flux-system/allow-egress"]
module.doks_flux_cd.kubectl_manifest.install["networking.k8s.io/v1/networkpolicy/flux-system/allow-scraping"]
module.doks_flux_cd.kubectl_manifest.install["networking.k8s.io/v1/networkpolicy/flux-system/allow-webhooks"]
module.doks_flux_cd.kubectl_manifest.install["rbac.authorization.k8s.io/v1/clusterrole/crd-controller-flux-system"]
module.doks_flux_cd.kubectl_manifest.install["rbac.authorization.k8s.io/v1/clusterrolebinding/cluster-reconciler-flux-system"]
module.doks_flux_cd.kubectl_manifest.install["rbac.authorization.k8s.io/v1/clusterrolebinding/crd-controller-flux-system"]
module.doks_flux_cd.kubectl_manifest.install["v1/namespace/flux-system"]
module.doks_flux_cd.kubectl_manifest.install["v1/service/flux-system/notification-controller"]
module.doks_flux_cd.kubectl_manifest.install["v1/service/flux-system/source-controller"]
module.doks_flux_cd.kubectl_manifest.install["v1/service/flux-system/webhook-receiver"]
module.doks_flux_cd.kubectl_manifest.install["v1/serviceaccount/flux-system/helm-controller"]
module.doks_flux_cd.kubectl_manifest.install["v1/serviceaccount/flux-system/kustomize-controller"]
module.doks_flux_cd.kubectl_manifest.install["v1/serviceaccount/flux-system/notification-controller"]
module.doks_flux_cd.kubectl_manifest.install["v1/serviceaccount/flux-system/source-controller"]
module.doks_flux_cd.kubectl_manifest.sync["kustomize.toolkit.fluxcd.io/v1beta2/kustomization/flux-system/flux-system"]
module.doks_flux_cd.kubectl_manifest.sync["source.toolkit.fluxcd.io/v1beta1/gitrepository/flux-system/flux-system"]
module.doks_flux_cd.kubernetes_namespace.flux_system
module.doks_flux_cd.kubernetes_secret.main
module.doks_flux_cd.tls_private_key.main

I can see module.doks_flux_cd.kubectl_manifest.sync["kustomize.toolkit.fluxcd.io/v1beta2/kustomization/flux-system/flux-system" in there. But this still doesn't explain why I'm getting this no matches for kind "Kustomization" in version "kustomize.toolkit.fluxcd.io/v1beta2"

v-ctiutiu commented 2 years ago

@Analect

First of all - Great job!

These are my latest notes and findings, after doing some more debugging and re-reading your replies. Before moving on with other explanations, let me emphasize two important things:

  1. Terraform is using its private state machine to keep track of changes, and stores state in a file on your local machine (or remotely, via S3).
  2. Kubernetes has its private state machine, and stores current system state in the etcd database.

So far so great, but not quite. Sometimes I hate state machines, especially when not only one is present and need to be synchronized. The problem is that, if you act externally with some other tool and alter one of the two state machines, then the other one is not aware of the changes. In your case, Terraform is not aware of the fact that you bootstrapped Flux CD again via the CLI (flux bootstrap github --owner=<gh_owner> --repository=<flux_repo>). If you run flux bootstrap, existing Flux API definitions may be updated or new ones will be added in your Kubernetes cluster.

Before moving further, what I did was to re-create the initial scenario that you ran into:

  1. I started fresh, and bootstrapped DOKS and Flux CD. I also overwritten the GitHub public key, because the main TF module is broken in this regard.
  2. I ran flux get all, and got everything except the kustomization resource:

    flux get all
    
    # The actual result:
    NAME                            READY   MESSAGE                        REVISION       SUSPENDED 
    gitrepository/flux-system       True    Fetched revision: main/1b43... main/1b43...   False

I reproduced your issue - great!

Now, what I did was to list the supported API versions for Flux CD:

kubectl api-versions | grep flux

And I got:

helm.toolkit.fluxcd.io/v2beta1
kustomize.toolkit.fluxcd.io/v1beta1
notification.toolkit.fluxcd.io/v1beta1
source.toolkit.fluxcd.io/v1beta1

Looking at the above, you can see that kustomize.toolkit.fluxcd.io is present at version v1beta1. If I list the kustomization objects via kubectl directly, it's there:

kubectl get kustomizations -A

# The actual result:
NAMESPACE     NAME          READY   STATUS                                                            AGE
flux-system   flux-system   True    Applied revision: main/1b43faf02da567e415aae57a7ecda865fd5b8063   4m46s

But then the question remains: why flux get all doesn't see it? The next steps should give you some hints at least.

What I did next was to run flux bootstrap on an existing Flux CD installation. So, what's different here? I have the latest flux CLI installed on my local machine (or at least a newer version than the one used when the Starter Kit automation chapter was written). On the Flux CD side, I have the old version deployed in the cluster, via the Starter Kit Terraform module (provider is at version 0.2.x).

Before I move on, let me quote the command and the output that you pasted in a previous reply:

Also, I tried running flux bootstrap github --owner= --repository= thinking that might rectify any misconfiguration. The response suggests all is OK, but those errors on flux get all persist.

Please enter your GitHub personal access token (PAT):

► connecting to github.com

► cloning branch "main" from Git repository "https://github.com/xxx/xxx.git"

✔ cloned repository

► generating component manifests

✔ generated component manifests

✔ committed sync manifests to "main" ("xxxed2b84")

► pushing component manifests to "https://github.com/xxx/xxx.git"

✔ installed components

✔ reconciled components

► determining if source secret "flux-system/flux-system" exists

✔ source secret up to date

✗ sync path configuration ("") would overwrite path ("./clusters/dev") of existing Kustomization

What happens after you run the above is, flux client (or the CLI counterpart of Flux CD) will create new API definitions for the Flux components in your cluster besides the existing ones. In your case, the kustomize.toolkit.fluxcd.io. Why ? Because you have a new flux client version installed on your machine, and it wants you to have the latest version of the Flux CD components deployed in your cluster, meaning v1beta2. And this makes sense after all. But, and this is very important - it will not update your Git repository manifests from the sync path. You can see that in the last line from the above output: ✗ sync path configuration ("") would overwrite path ("./clusters/dev") of existing Kustomization.

On the other hand, Terraform is not aware of this change, and it thinks that Kustomization is still at v1beta1. I think it can be synchronized, by changing the logic in the main TF module, or in the Flux CD provider. Terraform can import state also if running a separate command. But, it's out of scope for the current discussion.

So, after running flux bootstrap, I was hit by the same issue as yours:

flux get all

# The actual result:

NAME                            READY   MESSAGE                        REVISION        SUSPENDED 
gitrepository/flux-system       True    Fetched revision: main/685e... main/685ea...   False    

NAME                            READY   MESSAGE                                                                                                                                                                               REVISION                                        SUSPENDED 
kustomization/flux-system       False   apply failed: The CustomResourceDefinition "kustomizations.kustomize.toolkit.fluxcd.io" is invalid: status.storedVersions[1]: Invalid value: "v1beta2": must appear in spec.versions  main/1b43faf02da567e415aae57a7ecda865fd5b8063   False 

Nothing new so far, but if I run kubectl api-versions | grep flux, something is revealed:

helm.toolkit.fluxcd.io/v2beta1
kustomize.toolkit.fluxcd.io/v1beta1
kustomize.toolkit.fluxcd.io/v1beta2
notification.toolkit.fluxcd.io/v1beta1
source.toolkit.fluxcd.io/v1beta1

Looking at the above output, you can see that now I have two versions for kustomize.toolkit.fluxcd.io. So what I think is that the flux CLI expects to have a v1beta2 of the kustomize.toolkit.fluxcd.io CRD type created, but it's not ! Why ? Because when you bootstrapped Flux again via CLI, the new API version is now defined in the cluster, but the new object or resource is not there. And Flux expects a new resource with a new version of v1beta2 to be available or instantiated. Going further, you would also need to update manually the yaml manifests in the sync path from your Git repo, to use the latest API version and specs. By already using the updated TF module from my Git repository, this part was handled automatically for you by TF.

Before moving further, I'm curious what's the output of running below command in your environment ?

kubectl get kustomizations -A

To stay consistent with the Starter Kit, you need to downgrade the flux client (or the CLI counterpart). As you already pointed in the last reply, like mentioning the Upgrade Flux to the v1beta2 API discussion from the Flux CD official repo, this is an upgrade scenario issue. Currently, Starter Kit doesn't deal with upgrade scenarios because we wanted to keep things simple (it's a "starter" after all).

Getting back to the main issue, the only viable solution that I see now is to downgrade the Flux client. I still don't know why it doesn't create the new Kustomization resource, after the manifests in your Git repository were updated to the newest version as well.

On our end, I should add a note about this in the prerequisites section for the affected chapter (meaning to use an older flux client version). On the other hand, we plan to upgrade all the Starter Kit components very soon, so an upgrade section for each chapter is necessary after all.

To fix your current installation this time (hopefully), please follow below steps:

  1. Uninstall Flux CD via:

    flux uninstall
  2. Revert the module source in main.tf file to point to the original:

    module "doks_flux_cd" {
      source = "github.com/digitalocean/container-blueprints/create-doks-with-terraform-flux"
      ...
    }
  3. Run TF init (when asked, run the upgrade as well). Then, plan and apply.

  4. Uninstall the current version of flux CLI (or just make a backup of it, although you can download it anytime).

  5. Install an old version for the flux CLI, which is compatible with the Starter Kit (e.g.: 0.17.0, or any release dating from July or August):

    curl -s https://fluxcd.io/install.sh | sudo FLUX_VERSION=0.17.0 bash

After I ran the above steps it started to work immediately. Let me know if it does the same for you.

Although I don't have a final answer your last question, I hope that I was able to give you some hints about why it behaves the way it is now.

Thanks a lot for your patience and time.

Analect commented 2 years ago

@v-ctiutiu . I greatly appreciate your efforts to explain what might be going on. I can see the power of the TF/flux combination, in terms of managing complexities in a Kubernetes cluster, but using these tools can also introduce a whole new set of complications!

Ahead of running your steps above, I ran these commands, as per your explanation. It seems kustomizations resources were absent from the cluster ... and maybe that was down to me over-riding things with my flux cli bootstrap

$ kubectl api-versions | grep flux
helm.toolkit.fluxcd.io/v2beta1
notification.toolkit.fluxcd.io/v1beta1
source.toolkit.fluxcd.io/v1beta1
$ kubectl get kustomizations -A
error: the server doesn't have a resource type "kustomizations"

On running those steps above and downgrading the Flux CLI to version 0.17, I now get this on calling flux get all.

NAME                            READY   MESSAGE                                                         REVISION                                        SUSPENDED 
gitrepository/flux-system       True    Fetched revision: main/3fe239b88ce7725d7867215884940adf77dde94a main/3fe239b88ce7725d7867215884940adf77dde94a   False    

NAME                            READY   MESSAGE                                                         REVISION                                        SUSPENDED 
kustomization/flux-system       True    Applied revision: main/3fe239b88ce7725d7867215884940adf77dde94a main/3fe239b88ce7725d7867215884940adf77dde94a   False 

And now I see:

$ kubectl api-versions | grep flux
helm.toolkit.fluxcd.io/v2beta1
kustomize.toolkit.fluxcd.io/v1beta1
kustomize.toolkit.fluxcd.io/v1beta2
notification.toolkit.fluxcd.io/v1beta1
source.toolkit.fluxcd.io/v1beta1

$ kubectl get kustomizations -A
NAMESPACE     NAME          READY   STATUS                                                            AGE
flux-system   flux-system   True    Applied revision: main/3fe239b88ce7725d7867215884940adf77dde94a   4m5s

I suppose what intrigues me a bit is that with this latest uninstall/init/plan & apply does result in an update to the terraform.state file in my digitalocean spaces. However, it doesn't make any new commits/alterations to my github repo that is capturing my flux state. It's not clear to me how different versions of flux manifest themselves in this github repo.

image

If one gets into a tangle like this in future, is it ever a solution to delete either the flux state in the github repo or the TF state in the digitalocean spaces, as a means of resetting?

v-ctiutiu commented 2 years ago

@Analect To be honest, I don't have a real answer for it.

What has happened here in the end, is more or less a migration issue, I assume. What I don't have an answer for yet is (because we have not verified these scenarios - was a little bit out of scope for the Starter Kit):

  1. What happens if I upgrade Flux CD from an older version to a newer version (not a major release though) ?
  2. What happens if I want to revert back to the old version again ?

What I found on the Flux CD documentation site about migration, is this: https://fluxcd.io/docs/migration.

Maybe @stefanprodan, who is one of the main contributors of Flux CD can give us some hints of what has happened, or how to prevent this to happen in the future?

To summarize:

  1. We provisioned Flux CD via a custom Terraform module, using an older provider version (0.2.x). In the Starter Kit (meaning this repo), we lock down versions (no major releases) for each component that we use (be it Helm releases, Terraform providers, etc.). This was an internal decision, to have consistent and predictable results.
  2. Then, the GitHub public key has changed, so I created a fix which was merged. But a mistake slipped, and a redundant "github.com" string was added by me, which rendered the fluxcd secret that holds the SSH known hosts unusable. The gitrepository source Flux CD component, refused to work.
  3. I fixed the above via PR #19 from our container-blueprints repo, so that Flux CD gitrepository source component was functional again.
  4. Meanwhile, without knowing the real issue (meaning the above point), we bootstrapped Flux CD again via CLI. But, the flux client (or the CLI counterpart) had a newer version (the latest one). As a consequence, it pushed new api versions for the Flux CD Kustomization component to our Kubernetes cluster.
  5. Continuing the saga, we ran into other issues, deviating from the initial one. This time, the Kustomization component refused to appear in the list when invoking: flux get all.
  6. In the end, I managed to reproduce the scenario (or at least part of it) in a previous reply from this thread. After reverting Flux CD deployment and the CLI counterpart to an older version that the Starter Kit was compatible with, things started to work again, meaning the Kustomization component (@Analect, correct me if I'm wrong here).

My final conclusion and to avoid all the above, is that the Starter Kit should clearly specify to use a flux CLI version that is compatible with the TF provider being used for deploying Flux CD (meaning 0.2.x). My bad here, for forgetting or not thinking of it at that time of writing.

As a side note, we constrain all TF providers version in a more strict (or pessimistic) way, like this (patch version upgrades only):

flux = {
    source  = "fluxcd/flux"
    version = "~> 0.2.0"
}

@stefanprodan - If someone runs into this situation in the future, is it possible to "reset" states in a clean way as @Analect already mentioned in the previous post, for both the Terraform state file (we already read this discussion) and Flux CD deployment?

Thanks a lot.