fluxcd / terraform-provider-flux

Terraform and OpenTofu provider for bootstrapping Flux
https://registry.terraform.io/providers/fluxcd/flux/latest
Apache License 2.0
368 stars 86 forks source link

[Bug]: patching `GitRepository` to use `.spec.ref.semver` via `kustomization_override` hangs #695

Closed Jonesus closed 4 months ago

Jonesus commented 6 months ago

Describe the bug

When trying to use terraform-provider-flux with flux_bootstrap_git, if one tries to change the generated GitRepository source in gotk-sync.yaml to use some other .spec.ref than branch (for example, semver) by using kustomization_override, terraform apply hangs until timeout and exits with

with flux_bootstrap_git.this,
│   on main.tf line 12, in resource "flux_bootstrap_git" "this":
│   12: resource "flux_bootstrap_git" "this" {
│ 
│ bootstrap failed with 2 health check failure(s): [error while waiting for GitRepository to be ready: 'context canceled', error while waiting for Kustomization to be ready:
│ 'client rate limiter Wait returned an error: context canceled']

Despite this, the changes have been properly applied and flux get all -A --status-selector ready=false shows nothing, and no logs indicate any problems.

If the kustomization_override patch doesn't change the GitRepository then everything works and exits with a success; kustomization_override can be used to change other configs, like image-reflector-controller for example.

Steps to reproduce

  1. Set up terraform with flux_bootstrap_git
  2. Create a kustomization_override to change GitRepository to have a .spec.ref with something else than branch
  3. Run terraform apply
  4. Observe the script running until it timeouts, although during the run the changes get applied to the flux installation

Expected behavior

Running terraform apply should exit with a success instead of timing out

Screenshots and recordings

No response

Terraform and provider versions

Terraform v1.8.3
on linux_amd64
+ provider registry.terraform.io/azure/azapi v1.13.1
+ provider registry.terraform.io/fluxcd/flux v1.3.0
+ provider registry.terraform.io/hashicorp/azuread v2.50.0
+ provider registry.terraform.io/hashicorp/azurerm v3.104.2
+ provider registry.terraform.io/hashicorp/helm v2.13.2
+ provider registry.terraform.io/hashicorp/kubernetes v2.30.0
+ provider registry.terraform.io/hashicorp/random v3.6.2
+ provider registry.terraform.io/hashicorp/time v0.9.1
+ provider registry.terraform.io/integrations/github v6.2.1

Terraform provider configurations

Relevant Terraform config:

provider "flux" {
  kubernetes = {
    host                   = var.kubernetes_host
    client_certificate     = var.kubernetes_client_certificate
    client_key             = var.kubernetes_client_key
    cluster_ca_certificate = var.kubernetes_cluster_ca_certificate
  }
  git = {
    url = "https://github.com/${var.github_org}/${var.github_repository}.git"
    http = {
      username = "git"
      password = var.github_token
    }
    author_name  = "fluxcdbot"
    author_email = "fluxcdbot@users.noreply.github.com"
  }
}

flux_bootstrap_git resource

resource "flux_bootstrap_git" "this" {
  components_extra = [
    "image-reflector-controller",
    "image-automation-controller"
  ]
  path = "deploy"
  kustomization_override = file("github-sync-patch.yaml")
}

Contents of github-sync-patch.yaml:

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - gotk-components.yaml
  - gotk-sync.yaml
patches:
  - patch: |
      - op: replace
        path: /spec/ref
        value:
          semver: "*"
    target:
      kind: GitRepository
      name: flux-system
      namespace: flux-system

Flux version

v2.3.0

Additional context

The behavior can be observed in this repository containing reproducing configuration: https://github.com/Jonesus/flux-terraform-repro

The folder github-with-customizations was based on https://github.com/fluxcd/terraform-provider-flux/tree/main/examples/github-with-customizations, with the only change being the addition of patches[0] in resources/flux-kustomization-patch.yaml

Code of Conduct

Would you like to implement a fix?

None

stefanprodan commented 6 months ago

You can't bootstrap with semver, tags are immutable.

Jonesus commented 6 months ago

I'm not sure I understand, if I'm supposed to be able to configure Flux to reconcile based on git tags ranked with semver (https://fluxcd.io/flux/components/source/gitrepositories/#semver-example) and the file that configures the reconciliation based on branch, semver etc. is generated with flux_bootstrap_git, is our desired configuration something that forces us to not use the terraform provider at all?

I also don't understand what tag immutability has to do with the GitRepository configuration, as it should just read all the tags from the repo and choose the newest one based on semantic versioning.

stefanprodan commented 6 months ago

Bootstrap can only be run for a branch and can only be run once per cluster. You run bootstrap for a dedicated repo, then in there you can add GitRepo/Kustomizations for other repos where your apps are, there you can use semver.

Jonesus commented 6 months ago

All right, seems that my approach of having a monorepo both for the bootstrap target and the apps to be deployed wasn't really an intended use case, thanks for the explanation :+1:

The failure mode was still somewhat obscure and unexpected, I wonder if the intended way could be clarified somewhat in the docs of if the provider could notice and alert the user when detecting such usage? This might also be offtopic now so if you deem the problem irrelevant the issue can be closed.

stefanprodan commented 6 months ago

It should error out after the default timeout expires, did you waited 5 minutes?

Jonesus commented 6 months ago

Yeah I did, but the error message that produces –

with flux_bootstrap_git.this,
│   on main.tf line 12, in resource "flux_bootstrap_git" "this":
│   12: resource "flux_bootstrap_git" "this" {
│ 
│ bootstrap failed with 2 health check failure(s): [error while waiting for GitRepository to be ready: 'context canceled', error while waiting for Kustomization to be ready:
│ 'client rate limiter Wait returned an error: context canceled']

– Doesn't really point towards the configuration not being intended, especially when the changes actually do get reflected in the target repository even though the Terraform command hangs and times out.

stefanprodan commented 6 months ago

The provider doesn't allow setting anything else but a Git branch. To catch this misconfiguration, we would need to parse the Kustomize patch YAMLs...

Jonesus commented 6 months ago

Yeah, detecting what the user has configured on the fly probably isn't worth it, thus it's probably better to ensure that users don't try to configure the bootstrapping as I did.

On the error message side, I wonder why does the Terraform provider time out even though the changes do seem to get applied properly?

On the documentation side, a scenario of a user's thought process when setting up a configuration (based on personal experiences):

  1. A user wants to set up Flux, they start by reading through Core Concepts, Getting Started and Installation.
  2. The user settles on using Terraform to bootstrap Flux, pointing the bootstrap to their application monorepo as that's where they store their kube manifests for their app
  3. Everything works well, the user sees that the bootstrap created configurations that look at the app monorepo's main branch for updates and updates happen on push
  4. The user wants to make Flux not use the main branch of the app repo, instead to target tags for better control
  5. The generated GitRepository file has a warning that it shouldn't be edited, so the user digs through the docs to find that the generated file can be augmented with kustomization_override (https://fluxcd.io/flux/installation/configuration/boostrap-customization/)
  6. They write a kustomization patch to edit the file such that the GitRepository points to a tag instead of a branch
  7. They apply the changes with Terraform, which makes the command time out claiming that the GitRepository and Kustomization faced errors when waiting to be ready
  8. The user checks flux get all -A and sees that Flux reports the resources ready anyway
  9. The user checks that the Terraform changes have been applied to the repo despite the timeout and errors
  10. The user is left quite confused as to what went wrong

Further, reading through the Ways of structuring your repositories, I find it difficult to infer that even when talking about a monorepo approach, the monorepo in this context only means a monorepo of infrastructure and that the Flux bootstrap is supposed to be pointed to the infra repo, instead of an application monorepo.

swade1987 commented 4 months ago

@Jonesus previously I have had multiple repos for different parts of my overall GitOps setup (for better or for worse).

See the example here. I know @stefanprodan has reservations about this approach due to this file existing there.

However, it worked for me very well and may be useful for you in the future.

swade1987 commented 4 months ago

@Jonesus you happy for me to close this issue?

Jonesus commented 4 months ago

Yes, the original issue wasn't really a bug so this issue can be closed.

I still think that the error message or documentation for this use case (even if an unintended one) could be more clear, though.

swade1987 commented 4 months ago

Yes, the original issue wasn't really a bug so this issue can be closed.

I still think that the error message or documentation for this use case (even if an unintended one) could be more clear, though.

Feel free to submit a PR for the doc changes would be happy to review and get it merged for the next release.