flux_bootstrap_git destroy should actually delete resources created by flux

cdenneen commented 1 year ago

Currently the behavior of the destroy is to perform similar function of a flux uninstall. While this makes sense to retain workloads when doing a flux uninstall in the case of a terraform destroy you aren't retaining workloads and are therefore causing dangling resources.

For example if you bootstrap a cluster and flux then provisions things like cert-manager, kube-prometheus-stack, prometheus-adapter.

After destroy of flux_bootstrap_git the dangling resources from HelmRelease and such by not being installed cause problems with the rest of the destruction of the things like kubernetes_namespace destroy.

  NamespaceContentRemaining                    True    Thu, 26 Oct 2023 21:09:47 +0000  SomeResourcesRemain   Some resources are remaining: challenges.acme.cert-manager.io has 1 resource instances
  NamespaceFinalizersRemaining                 True    Thu, 26 Oct 2023 21:09:47 +0000  SomeFinalizersRemain  Some content in the namespace has finalizers remaining: finalizer.acme.cert-manager.io in 1 resource instances

and

status:
  conditions:
  - lastTransitionTime: "2023-10-26T21:10:16Z"
    message: 'Discovery failed for some groups, 2 failing: unable to retrieve the
      complete list of server APIs: custom.metrics.k8s.io/v1beta1: stale GroupVersion
      discovery: custom.metrics.k8s.io/v1beta1, metrics.k8s.io/v1beta1: stale GroupVersion
      discovery: metrics.k8s.io/v1beta1'
    reason: DiscoveryFailed
    status: "True"

Therefore the behavior of flux_bootstrap_git shouldn't follow the uninstall logic of removing it's finalizers from it's resources but to actually remove all the resources flux was managing (HelmRepository, HelmRelease, GitRepository, Kustomization, etc). I understand this is a change but I can't see how the current destroy behavior for this resource actually is useful in a production capacity when tearing down a cluster in which had workloads that were managed by flux and are now "abandoned" more or less.

If changing the behavior from flux uninstall logic seems too severe then maybe there should be a switch/flag that would allow for this resource removal as opt-in vs changing the default behavior. I guess there could be scenarios where you are destroying just flux_bootstrap_git and it's not part of a larger cluster in the same setup so you "might" want to retain those workloads. Just seems however in most cases you are bootstrapping a cluster at build time so when your now destroying the cluster and the flux_bootstrap_git, there is no logical reason the "retain workloads" like you normally would with a flux uninstall (where maybe you decided not to continue to use flux).

Discussion here for context: https://cloud-native.slack.com/archives/CLAJ40HV3/p1698355353067829

swade1987 commented 7 months ago

@stefanprodan this feels like an upstream flux bug potentially.

stefanprodan commented 7 months ago

Flux uninstall can't remove workloads from a cluster, this is impossible.

cdenneen commented 7 months ago

@stefanprodan that wasn't the ask. The ask here is that the TF module not follow the uninstall method. The point of "destroy" in TF is to remove so I can't see where you'd use this module and want to retain deployments. It's normally coupled in a module with other resources like repositories, keys, k8s clusters that are being destroyed. So in this case you run into issues destroying a cluster that still has resources running on it. For example when EKS cluster is built and sending logs to a CloudWatch Log group and you are in the process of destroying everything there are cases that the LogGroup gets destroyed and due to "workloads" still running on cluster it is recreated with wrong permissions for retention for example and causes blocks if you try to create new stack with same name now resource already exists. So the ask here was to have an opt-in on this module that upon destroy it runs some hook or something to delete the CR resources created by CRDs instead of just the CRD

stefanprodan commented 7 months ago

We can't delete CRs as there is no wait mechanism in Flux for that, also lots of things can't be deleted, for example Helm SDK leaves PVC and CRDs behind. Given this, flux uninstall and this provider will never remove workloads, just Flux own workloads.

cdenneen commented 7 months ago

Sorry I was just thinking you delete the CR like flux delete hr foo wouldn't that delete the workload? Does flux delete ks delete the HR under it? It can all be worked around.

fluxcd / terraform-provider-flux

flux_bootstrap_git destroy should actually delete resources created by flux #569