Closed HaveFun83 closed 11 months ago
This also happens in v0.18.1
I've noticed that helm list
does not show release in question in that situation but pods are actually running.
Also, I've tried to remove last helm secret for release and then reconcile HelmRelease and reconciliation was successful.
I've noticed that
helm list
does not show release in question in that situation but pods are actually running.Also, I've tried to remove last helm secret for release and then reconcile HelmRelease and reconciliation was successful.
@aliusmiles you can use helm ls -a
to list all helm releases in all conditions not only with status deployed
@HaveFun83 Thanks, now I've learned something new about helm:) Guess my previous comment should say: "helm release in question is not in deployed
or failed
state"
Sorry it took awhile for me to get to this, it were busy weeks for other Flux parts, and then KubeCon.
When you say it worked up till 0.16.1
, is this the CLI only, or does it include the helm-controller version that matches this release?
Hello @hiddeco no problem it is not urgent and i know how busy kubecon weeks are :beers: Thanks for having a look into it. The helm-controller is updated constantly i tested it with fluxcli 0.16.1 but helm-controller version from fluxcd release v0.17.2 currently we are running fluxcd v0.18.3. It looks like that the method changed how fluxcli helmreleases reconciles are triggered from v0.16.1 to v0.16.2 onward or how the helm-controller react on reconciles from different fluxcli versions.
Hello!
I can confirm this issue is still relevant for the latest version of a helm-controller. The workaround for now is this:
flux suspend hr <release_name>
flux resume hr <release_name>
it will reconcile broken release states such as "exhausted" and "another rollback/release is in progress". Works for me.
Hopefully this helps to people also facing the same issue.
Hello!
I can confirm this issue is still relevant for the latest version of a helm-controller. The workaround for now is this:
flux suspend hr <release_name> flux resume hr <release_name>
it will reconcile broken release states such as "exhausted" and "another rollback/release is in progress". Works for me.
Hopefully this helps to people also facing the same issue.
Thanks a lot for this workaround.
Actually this seems to be even worse currently. I actually can't even get the suspend/resume workaround to free up the resource (even if the resource is updated in the source).
In my experience, the important setting to control (or problematic setting if you miss it) is spec.timeout
If you haven't set a value for spec.timeout
you might have trouble diagnosing problematic HelmReleases. Historically they would fail to post errors as events because the helmrelease never timing out means the error is never formally raised in the Helm package that Helm Controller uses as upstream logic for its Helm-related activities. I'm not sure if that's still the case, but I still recommend setting spec.timeout
to everyone as soon as they report trouble with Helm Controller because it makes the failure mode and behavior more predictable.
I'm not sure what happened in Flux 0.17 that might be relevant to this issue, but if you set spec.timeout
to some reasonable value like 2m0s
and wait at least that long, you should start to see errors that will lead you towards a solution. (The errors would generally appear in the kubectl describe HelmRelease
output, listed as an event.)
If this does not immediately resolve your issue @siegenthalerroger maybe post the content of your HelmRelease
and we can have a look at the details? Without more information, I'm afraid we won't be able to tell if this is the same issue or help much with finding out the root cause.
Hi @kingdonb, to be clear it was my HelmRelease that was broken, it wasn't an issue with the helm-controller in any way. My issue is that once the HelmRelease is in the "upgrade retries exhausted" state, I have no way of getting it to try again when changed in the source without deleting the HelmRelease. The workaround above which I personally have used before, didn't work in my case anymore ^^.
Thanks for the tip about spec.timeout though, that'll prove useful when I'm debugging a different HelmRelease.
This is another suggestion, although I don't like it as much it may also have worked for you:
When you need to trigger a new HelmRelease reconciliation after "upgrade retries exhausted" and you aren't in a position to run helm rollback
or helm uninstall
, try editing spec.values
– this is one place where an untyped values
comes in handy, you can invent a new value that doesn't mean anything, say spec.values.nonce
and just update it.
Helm does not type values.yaml
so it has no way of knowing that change to nonce
doesn't actually update anything when it is substituted into the templates, and it cannot know because there's no mechanism in helm to detect what types of changes are made by any post-install or post-upgrade hooks there might have been in any given Helm chart. (Any hooks might care about the value of nonce
as they can be running processes that manipulate the state of the release in post.)
Helm will be forced to run the upgrade again each time you update the nonce value. Hope this helps as well!
I could have sworn I tried that but it seems to work now so not sure. What I know for sure is that downgrading the chart version did not save it, I had to delete the helm release and have the source controller recreate it.
We've been seeing this behavior as well. Occasionally we'll have an HR fail for some reason or another (e.g. the service took too long to startup and we didn't have retries set accordingly). All I want is to be able to retry the HR once I've resolved the issue. Flux 0.16.1 allowed me to do that with flux reconcile hr
, but with later versions flux reconcile hr
appears to do nothing other than to tell me it's failed. I have a copy of Flux 0.16.1 that I keep around for retrying the HRs, as it's the only way I'm aware of to do it without making some superfluous commits to our repo.
We've been seeing this behavior as well. Occasionally we'll have an HR fail for some reason or another (e.g. the service took too long to startup and we didn't have retries set accordingly). All I want is to be able to retry the HR once I've resolved the issue. Flux 0.16.1 allowed me to do that with
flux reconcile hr
, but with later versionsflux reconcile hr
appears to do nothing other than to tell me it's failed. I have a copy of Flux 0.16.1 that I keep around for retrying the HRs, as it's the only way I'm aware of to do it without making some superfluous commits to our repo.
I've had success working around this by doing a suspend followed by a resume.
Also happening with Flux CLI Version 0.24.1
:
flux reconcile hr <name>
-->HelmRelease reconciliation failed: install retries exhausted
Workaround as suggest above works:
flux suspend hr <name>
followed by flux resume hr <name>
are working in terms of a workaround.
@snukone The Flux 0.26.1 release out this week has lots of Helm updates that will make Helm fail less often, according to reports we've received.
I have heard mixed reports about whether suspend/resume will actually retry a failed HelmRelease that exhausted retries or not, it may depend on how it failed. I'd be surprised if install retries exhausted
was solved that way in fact, since a failed install leaves a secret behind, and I think the secret records it as failed? I guess I'm in the minority here if this doesn't work for me.
In any case I think you'd have to configure remediationStrategy
settings for your preferred number of retries, and/or remediation method. It sounds like the days are long gone when your best option was running helm uninstall
and trying again. 👍
Hi @kingdonb , all right, thanks a lot for providing your explanation. Your right, fixing errors with just uninstall + install is a long time ago ;) In our case we have sometimes services based on helmcharts on dev enviroments, which arent important and havent been used for a longer time. Thats when we ignore installation errors, because in 99% they are just occuring because of outdated helmchart versions (e.g. the image isnt available any more). Just to know that supsending and resuming does the same as reconcile in the past is ok for me. Because on important enviroments where the services have to run everytime, we imitially get informed by alerting and a "install retries exhausted" error could hardly happen.
First time install i get install retries exhausted
.
after first time i get upgrade retries exhausted
.
I have tried the above solutions and none have worked for me.
flux suspend
and resume
did not woked.
I had added
upgrade:
remediation:
remediateLastFailure: true
But did not worked either.
thanks to @onedr0p i have fixed it with helm uninstall <app>
and after that making a commit or reconcile.
Should work then :)
I've run into situations where helm uninstall ...
or flux delete hr ...
was the only way to resolve this issue as well, suspend/resume had no effect. Next time it happens I'll try to have more information. Seems like Flux gets stuck on trying to install or upgrade and only a fresh install of the helm release fixes it.
In my experience, the important setting to control (or problematic setting if you miss it) is
spec.timeout
If you haven't set a value for
spec.timeout
you might have trouble diagnosing problematic HelmReleases. Historically they would fail to post errors as events because the helmrelease never timing out means the error is never formally raised in the Helm package that Helm Controller uses as upstream logic for its Helm-related activities. I'm not sure if that's still the case, but I still recommend settingspec.timeout
to everyone as soon as they report trouble with Helm Controller because it makes the failure mode and behavior more predictable.I'm not sure what happened in Flux 0.17 that might be relevant to this issue, but if you set
spec.timeout
to some reasonable value like2m0s
and wait at least that long, you should start to see errors that will lead you towards a solution. (The errors would generally appear in thekubectl describe HelmRelease
output, listed as an event.)If this does not immediately resolve your issue @siegenthalerroger maybe post the content of your
HelmRelease
and we can have a look at the details? Without more information, I'm afraid we won't be able to tell if this is the same issue or help much with finding out the root cause.
@kingdonb where do I set spec.timeout? for me at least, everytime that I run into install retries exhausted
, one workaround is to manually edit the helm release:
kubectl edit hr <helmreleasename>
and manually add spec.timeout
. Where can I set it such that it always sets a timeout?
I am creating my helm releases like this:
flux create hr myhr \
--target-namespace=mynamespace \
--chart=mychart \
--source=gitrepository/mysource \
--timeout 15m
but timeout 15m is not being replicated to the helm release spec, but rather it seems it sends that time out to helm command itself. Is the only way to set spec.timeout
to use a yaml file instead of flux cli for creating a helm release?
Is the only way to set spec.timeout to use a yaml file instead of flux cli for creating a helm release?
I'm fairly sure that's correct. --timeout
is a global option on the flux cli, it does not pass through to HelmRelease.
In my case, I had the wrong RBAC permissions used by the HelmRelease and it got stuck in install retries exhausted
. After fixing the RBAC, it doesn't try to install again. suspend
and resume
will give it a kick and retry installing which succeeded. Maybe a new command is needed instead of this workaround?
Hey, got exactly same issue while upgrading kube-prometheus-stack
Status was HelmRelease reconciliation failed: upgrade retries exhausted
and there was no other way to progress than executing suspend/resume workaround.
flux: v0.28.2
helm-controller: v0.18.1
kustomize-controller: v0.22.1
notification-controller: v0.23.1
source-controller: v0.22.2
(deleted) The underlying issues here are error message. Users that experience this error have misconfigured a Helm chart.
Hey folks,
Just wanted to echo the same here as I did in https://github.com/fluxcd/helm-controller/issues/149#issuecomment-1111860111. This message is from a little more than a week ago, and I am now after #477 at the point to start rewriting the release logic. While doing this, I will take this long standing issue into account, and ensure it's covered with a regression test.
Hello!
I can confirm this issue is still relevant for the latest version of a helm-controller. The workaround for now is this:
flux suspend hr <release_name> flux resume hr <release_name>
it will reconcile broken release states such as "exhausted" and "another rollback/release is in progress". Works for me.
Hopefully this helps to people also facing the same issue.
thanks! at the moment, this seems to be the only option for releases up to v0.21.0
@rabbice suspend and resume does not work for me. i need to always delete the helmrelease and reconcile again
It didn't work for me either. I'm still using an old copy of Flux 0.16.1 to manually trigger reconciliation of failed HRs. Removing the HR is not an option, as that would result in downtime.
In my case I didn't see error messaging that I was trying to use a service account in a different namespace. This interacts with eksctl, since on AWS you have to declare the service accounts in a meaningful way outside of your manifests. The issue is really error messaging.
Also experiencing this issue. From my side it would be expected behaviour that, when I change something in Flux Helm Release Manifest file, for example value of env variable, redeployment with new retry counter will take place.
Is there a way to just reset the number install retries?
The issue still exists in flux release v0.31.3.
This issue still being open means it still exists, otherwise we would have closed it (hopefully) :-).
But! This has been actively worked on since https://github.com/fluxcd/helm-controller/issues/454#issuecomment-1120945094. Latest update was https://github.com/fluxcd/helm-controller/pull/503 (merged two days ago), which is the foundation for the new solution. I will continue to focus on finishing this in the following weeks, including release candidates at some point.
Thank your for your patience :bow:
@hiddeco I can see your changes are merged. Would you be able to share us the flux version in which these fixes will be available Thanks
Yes, I already mentioned in the message itself the changes were merged. Followed by:
I will continue to focus on finishing this in the following weeks, including release candidates at some point.
Given this, there are no updates since my last message and I am still working on finalizing the new setup. While others are currently focused on landing OCIRepository
support for the next MINOR Flux release.
After this, the helm-controller changes will become more relevant for next-next MINOR.
Thank your for your patience :bow:
@hiddeco From flux docs, I can't find the difference between "flux reconcile" and "flux suspend/resume", but indeed the latter one works for me in most time for "install retries exhausted", In my another situation, I use a helmRelease to create Flux kutomization, I defined in the helm template if Kubernetes version upgraded, it will use another path to install newer version components like ingresss/ vault webhook..., but after we upgraded the version like 1.19 --> 1.22, helm-controller did nothing until I manually suspend/resume the helmrelease(tried reconcile, not work), so I think there should be difference between "flux reconcile" and "flux suspend/resume", what I want is "everything is automatic" if the error is fixed or just a condition changed On flux version 0.31.5 Thanks in advance!
difference between "flux reconcile" and "flux suspend/resume"
When you run flux reconcile
it annotates the Flux resource with a patch to set the ReconcileRequestAnnotation
and the current time, so that Flux triggers the periodic reconciliation ahead of the regularly scheduled interval.
When you run flux suspend
it also annotates the Flux resource, patching suspend: true
into the spec. This stops reconciliation altogether until it's reversed by flux resume
. I'm not sure what mechanism causes this, but an upgrade
is always triggered on resume. This is why the drift gets reverted then / upgrade retries gets a new starting count.
We have the Flux-E2E guide which doesn't cover this very well, and could probably stand to get an update soon as it still describes HelmRepository as "the preferred source" for HelmReleases, (hopefully we can say that OCI sources are preferred soon enough... but that's a digression here) – to update the text right now in detail would be less than advisable because of the rewrite in progress, important details are being worked out and will be changing soon to accommodate this:
what I want is "everything is automatic"
That's what we're all here for 👍 it's coming, please be patient, this is rocket surgery as far as I'm concerned 🚀⎈🏋️♂️ I mean this is the subject of the issue, if I understand the whole thread.
The general behavior of Flux is to always correct and control drift of any kind. The Helm Controller behavior is different today because of the mechanics of Helm, a secret which tracks every "upgrade" that should only be incremented/duplicated once for every time a change is made (or in the case of suspend/resume
at least today when an upgrade has been done, regardless of whether any changes happened.)
It's a priority to resolve this, stay tuned.
@kingdonb Thanks for your detailed explanation, waiting for the good message to correct drift issue perfectly:+1:
I might recommend a simple solution which would be to add a flux reset-retries option? and have it just wipe the retries counter wherever its set
In my experience, the important setting to control (or problematic setting if you miss it) is
spec.timeout
If you haven't set a value for
spec.timeout
you might have trouble diagnosing problematic HelmReleases. Historically they would fail to post errors as events because the helmrelease never timing out means the error is never formally raised in the Helm package that Helm Controller uses as upstream logic for its Helm-related activities. I'm not sure if that's still the case, but I still recommend settingspec.timeout
to everyone as soon as they report trouble with Helm Controller because it makes the failure mode and behavior more predictable.I'm not sure what happened in Flux 0.17 that might be relevant to this issue, but if you set
spec.timeout
to some reasonable value like2m0s
and wait at least that long, you should start to see errors that will lead you towards a solution. (The errors would generally appear in thekubectl describe HelmRelease
output, listed as an event.)If this does not immediately resolve your issue @siegenthalerroger maybe post the content of your
HelmRelease
and we can have a look at the details? Without more information, I'm afraid we won't be able to tell if this is the same issue or help much with finding out the root cause.
I read example from link https://fluxcd.io/flux/guides/helmreleases/#define-a-helm-release and don`t see timeout
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
name: podinfo
namespace: default
spec:
interval: 5m <- is it here or not here to increase the interval / set the interval if it does not exist ?
chart:
spec:
chart: <name|path>
version: '4.0.x'
sourceRef:
kind: <HelmRepository|GitRepository|Bucket>
name: podinfo
namespace: flux-system
interval: 1m <- is it here or not here to increase the interval / set the interval if it does not exist ?
values:
replicaCount: 2
is timeout an interval ?
If timeout is interval, where set interval to 2m0s?
Timeout is a timeout, it's spec.timeout
https://fluxcd.io/flux/components/helm/helmreleases/#reconciliation
You can also find it under HelmRelease.spec
here:
https://fluxcd.io/flux/components/helm/api/#helm.toolkit.fluxcd.io/v2beta1.HelmRelease
I think I'm here due to the same issue - I don't quite like having to suspend/resume a failed helmrelease so I configured the below options thinking that this would address my issue of releases getting stuck with upgrade retries exhausted
once every now and then.
interval: 24h
upgrade:
remediation:
retries: 14
I hoped this would attempt a reconciliation once a day, for 14 days, but that is not the behavior I got. A release failed yesterday and it's been one day and all the 14 attempts are tried according to the output of describe.
It's quite annoying, considering the point of gitops is not to have to manually look for failed releases to try them again. Am I understanding retries
wrong in relation to interval
? Are retries attempted immediately after timeout
?
If so, I could remediate this by setting infinite retries but then this is going to put unnecessary load on the controller. I think there should be one attempt per reconciliation - if not, another setting needs to be provided to 'space out' retries, I don't see such setting.
My understanding on timeout is that Flux will wait a specific amount of time when reconciliation until failing (e.g pod taking too long to start), it doesn't mean that if a release fails immediately, it will still wait - correct?
I also do not configure spec.timeout
because by default it is meant to be 5 minutes - if that's not the case, the schema will need to be updated as that's what it tells me.
The most predictable way I have addressed this is kubectl delete secret -l owner=helm,name=[release name],status=pending-upgrade && flux reconcile hr -n [release namespace] [release name]
thats a good hack :-) thankyou 👍
Been several months since the last update -- any fixes already in or planned for this?
Our organization is still encountering this bug regularly, and it's becoming more disruptive to our gitops setup. Just curious if there are any plans for addressing this at some point
Faced this issue today and fixed it using the proposed workaround: flux suspend / flux resume.
Facing the same issue. Really causing some headaches : ( Any Idea on some sort of time line when this will be tackled? Many thanks!
@MKruger777 There's already great progress, but it's behind a feature flag and you need to pay some attention to monitoring in order to set it up properly and reap the benefits that you're looking for. The tl;dr is, if you are monitoring HelmReleases appropriately, then you can set:
upgrade:
remediation:
retries: -1
and Flux will not give up, it will retry indefinitely, and you should not see "upgrade retries exhausted" anymore – (but by itself this answer leaves quite a great many gaps, and it also misses or glosses over some of the quite important developments in progress/already delivered/scheduled for Q3 and "Flux 2.1", ... altogether that may explain why this answer is so long...)
The key issue here is somewhat multi-faceted that's going to be hard to "fix" per-se, is that Helm is natively an imperative process, at least due to the way the Helm CLI is normally expected to be used, but also due to the concept of lifecycle hooks... Helm is an imperative process which uses a large amount of resources instantaneously during its install/upgrade attempts, and so repeated install/upgrade attempts are definitely a thing to minimize, especially when you have many going concurrently. You can imagine a situation where a simple error that a human operator could easily see "can never fix itself, not worth retrying" now unfortunately triggers some failure loop that crashes repeatedly, and this process that consumes so many resources just taking down the entire cluster, ...so it should be clear why the behavior listed above is not the default.
Also, adding to this that Helm counts upgrades in a way that facilitates some manner of "release accounting" with Helm's built-in secrets. These secrets are used to tell helm history
and helm rollback
how to work, which Flux abides and can work alongside. So to prevent trampling over the history, Helm Controller only upgrades when it's absolutely necessary.
Now this is all configurable as well, and in the default behavior of HelmRelease, it currently does this minimal upgrade in two ways, by having a configurable number of retries (by default it's 0 - it doesn't retry) and additionally, by tracking inputs - the HelmRelease spec, the chart itself (values, template) and any external secrets or configmaps that HelmRelease refers to.
When "upgrade retries exhausted" appears, if you haven't done any of this configuration, it just means an upgrade failed or timed out. If you are monitoring for these conditions, it might be perfectly acceptable to retry indefinitely until it succeeds, (so that nobody needs to "kick the release" once the condition leading up to the failure is resolved. That's flux suspend helmrelease
and flux resume helmrelease
– this doesn't always work, as it often depends on unknown external factors)
By default, on each reconcile HelmRelease will only trigger an upgrade if some of these inputs have actually changed – that is, up until this PR:
Now, in the feature flagged behavior, it also tracks for drift. You need to enable this feature flag for now, since drift detection has the same foot-guns as infinite retries mentioned above
https://fluxcd.io/flux/cheatsheets/bootstrap/#enable-helm-drift-detection
Hopefully it's clear why this drift correction is part of the solution – we need Helm to retry an "upgrade" whenever some resource has changed from the template in the Helm chart, too. This closes the loop, making Helm Controller behave in all ways how GitOps practitioners expect declarative appliers to work, correcting any drift that gets introduced unexpectedly, and not ever landing in a "stuck" or stalled condition like this (unless the inputs are actually invalid!)
I think successful adoption of this feature flag by enough users will help us better understand how to provide this feature in a way that "solves the issue" so we can turn the feature on for all users. But at present, it's a complicated issue and the users need to know some of these details in order to solve it.
For now, in order to enable drift detection safely, you need to have Flux's Alert and Provider resources configured to tell you about HelmRelease events. This will ensure you have a way to know when Flux is "trapped in an upgrade loop" which is something you can see happening a bit more often while drift detection is enabled, because of features like the lifecycle hooks and other things like Kubernetes operators installed via Helm, that may occasionally write back updates to some of the resources that Helm template installed. That's when you need a (human) operator to be notified and intervene ASAP.
So, since we're now "correcting drift" in Helm templates and Helm Controller sees a number of perfectly normal things as drift, we need to mark some drift as "allowable" – this is all covered in the cheatsheet link above, which links out to this doc:
https://fluxcd.io/flux/components/helm/helmreleases/#drift-detection
tl;dr: This issue should be "Solved" in Flux Helm GA scheduled for next quarter, which is the next entry on the Roadmap after Flux GitOps GA – I think that the intention is that it should be solved in Flux Helm GA, but caveat that: it actually remains to be seen how much of this problem can really be successfully abstracted away from users, and how much will change from the solution that's already available in existing Flux releases.
Describe the bug
When a helmrelease stuck in
reconciliation failed: upgrade retries exhausted
only fluxcli v1.16.1 can trigger a successful reconciliation .###Steps to reproduce
When a helmrelease stuck in
helm-controller reconciliation failed: upgrade retries exhausted
this can normally be fixed by running ` ./flux reconcile helmrelease from the command line, but only till fluxcli v0.16.1Expected behavior
flux reconcile should trigger a helm upgrade when it stuck in in
upgrade retries exhausted
Screenshots and recordings
This time i upgrade kube-prometheus-stack helmrelease
I tried different versions v0.17.2 v0.16.2 but only v0.16.1 triggered a successful helm upgrade
hlem-controller logs:
helmrelease spec:
OS / Distro
Ubuntu 20.04
Flux version
0.17.2
Flux check
❯ flux check ► checking prerequisites ✔ kubectl 1.20.11 >=1.18.0-0 ✔ Kubernetes 1.19.15 >=1.16.0-0 ► checking controllers ✔ helm-controller: deployment ready ► ghcr.io/fluxcd/helm-controller:v0.11.2 ✔ kustomize-controller: deployment ready ► ghcr.io/fluxcd/kustomize-controller:v0.14.1 ✔ notification-controller: deployment ready ► ghcr.io/fluxcd/notification-controller:v0.16.0 ✔ source-controller: deployment ready ► ghcr.io/fluxcd/source-controller:v0.15.4 ✔ all checks passed
Git provider
No response
Container Registry provider
No response
Additional context
Maybe we can collect some kind of documentation how to get out of this "upgrade exhausted" situation?
Code of Conduct