fluxcd / helm-controller

The GitOps Toolkit Helm reconciler, for declarative Helming
https://fluxcd.io
Apache License 2.0
404 stars 161 forks source link

HelmRelease deletion not waiting for `helm uninstall` #1021

Open cwrau opened 2 months ago

cwrau commented 2 months ago

Describe the bug

When you delete a HelmRelease, even with uninstall.deletionPropagation=foreground, flux, via helm or otherwise, marks all created resources to be deleted and then removes the finalizer, resulting in the HelmRelease disappearing even though the uninstallation hasn't finished.

This, at least sometimes, leaves the "real" helm release to be stuck in uninstalling which has to be manually resolved.

Steps to reproduce

Create a helm chart like https://github.com/teutonet/teutonet-helm-charts/tree/main/charts/t8s-cluster, which creates CRs that are not just "no-ops" for deletion, like cluster-api resources.

Install it, wait for everything to be ready, uninstall.

Expected behavior

That flux leaves the finalizer on the HelmRelease until the "real" helm release is really uninstalled.

I suppose what I mean in a non-flux way would be for flux to wait until helm uninstall finishes successfully.

Screenshots and recordings

No response

OS / Distro

N/A

Flux version

N/A

Flux check

► checking prerequisites ✗ Kubernetes version v1.27.14 does not match >=1.28.0-0 ► checking version in cluster ✔ distribution: flux-2.3.0 ✔ bootstrapped: true ► checking controllers ✔ helm-controller: deployment ready ► ghcr.io/fluxcd/helm-controller:v1.0.1 ✔ image-automation-controller: deployment ready ► ghcr.io/fluxcd/image-automation-controller:v0.38.0 ✔ image-reflector-controller: deployment ready ► ghcr.io/fluxcd/image-reflector-controller:v0.32.0 ✔ kustomize-controller: deployment ready ► ghcr.io/fluxcd/kustomize-controller:v1.3.0 ✔ notification-controller: deployment ready ► ghcr.io/fluxcd/notification-controller:v1.3.0 ✔ source-controller: deployment ready ► ghcr.io/fluxcd/source-controller:v1.3.0 ► checking crds ✔ alerts.notification.toolkit.fluxcd.io/v1beta3 ✔ buckets.source.toolkit.fluxcd.io/v1beta2 ✔ gitrepositories.source.toolkit.fluxcd.io/v1 ✔ helmcharts.source.toolkit.fluxcd.io/v1 ✔ helmreleases.helm.toolkit.fluxcd.io/v2 ✔ helmrepositories.source.toolkit.fluxcd.io/v1 ✔ imagepolicies.image.toolkit.fluxcd.io/v1beta2 ✔ imagerepositories.image.toolkit.fluxcd.io/v1beta2 ✔ imageupdateautomations.image.toolkit.fluxcd.io/v1beta2 ✔ kustomizations.kustomize.toolkit.fluxcd.io/v1 ✔ ocirepositories.source.toolkit.fluxcd.io/v1beta2 ✔ providers.notification.toolkit.fluxcd.io/v1beta3 ✔ receivers.notification.toolkit.fluxcd.io/v1 ✗ check failed

Git provider

No response

Container Registry provider

No response

Additional context

No response

Code of Conduct

// "moved" from https://github.com/fluxcd/flux2/issues/4872

cwrau commented 2 months ago

I've tried debugging this, and this line is the root cause.

Because the digest seems to change during the uninstall, which in fact returns an error, that error is not returned, helm-controller thinks it finished without an error and removes the finalizer, therefore removing the HelmRelease.

Is there a reason for the digest check? I would just open a PR removing that, if that's ok?

stefanprodan commented 1 month ago

Because the digest seems to change during the uninstall

Why would it change? Are you using GitRepo as the source of the HelmRelease?

cwrau commented 1 month ago

Because the digest seems to change during the uninstall

Why would it change?

I don't know, I debugged helm-controller and the .Digest of req.Object.Status.History.Latest() between https://github.com/fluxcd/helm-controller/blob/7f78cdc36836f0ff3b652e518550932c9e890560/internal/reconcile/uninstall.go#L83 and https://github.com/fluxcd/helm-controller/blob/7f78cdc36836f0ff3b652e518550932c9e890560/internal/reconcile/uninstall.go#L141 is different.

I tried looking through the code how the req.Object could get changed, but I couldn't figure it out😅

Are you using GitRepo as the source of the HelmRelease?

Nope, using HelmRepository, no update occured during these 60 lines 😅

cwrau commented 1 month ago

@stefanprodan, ah, I think I've found it;

During helm uninstall, after changing info.Status, helm updates the release

After the update, flux notifies itself on the HelmRelease object about the history "change"

During the observation, ObservedToSnapshot calculates a new Digest with the updated info.Status, which of course changes the digest

I hope I didn't miss anything 😅

cwrau commented 1 month ago

Is there any news on this? Maybe my PR, #1024, is even acceptable?

Would be great to have this fixed 😊

cwrau commented 1 month ago

Ola, is there something I can do to help get this resolved? Do you need more information?

stefanprodan commented 2 days ago

@cwrau we've talked about removing the check in a dev meeting and the conclusion was that it's safe to do so. For the PR to be merged you'll need to adapt the unit tests.