Open Subetov opened 3 months ago
Similar conversation - https://github.com/fluxcd/flux2/discussions/1000#discussioncomment-406907 Not sure that the previous helm operator support the interruption. could you mention the api version ?
Similar conversation - #1000 (comment) Not sure that the previous helm operator support the interruption. could you mention the api version ?
I probably won’t tell you the API version, but here is the operator version.
It is not possible to interrupt the reconciler in Flux v2.
@stefanprodan, Can a HelmRelease with the following annotations interrupt an existing release and start a new one? "reconcile.fluxcd.io/requestedAt=$TOKEN" \ "reconcile.fluxcd.io/forceAt=$TOKEN" https://fluxcd.io/flux/components/helm/helmreleases/#forcing-a-release
I've felt this issue, in kustomize controller as well. It might be to our benefit to introduce a way for new reconciliations to interrupt a timeout-in-progress, but it's hard to imagine that it wouldn't also be a breaking change.
The best recommendation I can make without breaking anything is to set your timeout lower. Do you expect deployments to take a full hour? If not, then why set timeout: 60m
?
In my case, if I reduce the timeout, I get a release in the failed state. And if I remove the wait, then I can't rely on the release statuses.
It is not so important why people set such or other timeouts (there are reasons for this). What is important is that you cannot start a new installation while the Helm controller is waiting for a timeout. =( Even if we assume that you have a smaller timeout, 15 minutes for example. You make an update that has an error and will never be successful, you are forced to wait 15 or any other minutes until you can fix it. It is clear that the smaller the timeout, the less painful it is. But there are cases when a long timeout is necessary. And so each attempt to fix a faulty release will take you a lot of time (and this can be critical). The old Helm operator initiated the installation of a new Helm release immediately, despite the timeouts.
I've felt this issue, in kustomize controller as well. It might be to our benefit to introduce a way for new reconciliations to interrupt a timeout-in-progress, but it's hard to imagine that it wouldn't also be a breaking change.
The best recommendation I can make without breaking anything is to set your timeout lower. Do you expect deployments to take a full hour? If not, then why set
timeout: 60m
?
@kingdonb This has troubled me before. I do think some of us used the timeout of 60m as it is actually mentioned in the docs as a recommendation:
https://fluxcd.io/flux/components/kustomize/kustomizations/#working-with-kustomizations
Maybe a note can be added to inform users of the tradeoffs in setting it high vs low? :-)
I'm not sure we are looking at the same doc, this one says:
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
name: webapp
namespace: apps
spec:
interval: 60m0s # detect drift and undo kubectl edits every hour
wait: true # wait for all applied resources to become ready
timeout: 3m0s # give up waiting after three minutes
Additionally it says:
retryInterval: 2m0s # retry every two minutes on apply or waiting failures
So when the Kustomization times out at 3m or fails early due to error, it will retry after 2m
(not 60m
)
This is the recommended configuration, because otherwise with an interval of 60m
you will get a timeout of the same value 60m
and it leads to this undesirable state of Kustomization locked, waiting for timeout, until the 60 minutes runs out.
% kubectl explain kustomization.spec.timeout
KIND: Kustomization
VERSION: kustomize.toolkit.fluxcd.io/v1
FIELD: timeout <string>
DESCRIPTION:
Timeout for validation, apply and health checking operations. Defaults to
'Interval' duration.
Whoops, you are definitely right. I experienced having a hard time with this some time ago, probably because I am not setting the timeout
to anything, and thus experiencing the "undesirable state of Kustomization locked", due the it taking the value of the interval
as default.
Hmm, I might not understand it fully still:
The paradigm is sort of, either set the interval on your Sources low, or instrument them with a Receiver
- so the Source updates within a short amount of time - the webhook receiver is used to avoid the need to poll the source frequently. A Kustomization (or a HelmRelease) is notified by the source whenever the source content changes, so there is no need to reconcile appliers when the source updates, or to reconcile it frequently at all (except to correct drift that occurs in the cluster).
So your GitRepository
source can have a short interval, and it doesn't cost a dry run of the Kustomization every time the git repository reconciles without any update. Makes sense?
The Kustomization timeout should be at least long enough for a deployment to complete in normal circumstances, but not more than about 2x that long. The interval at 10m is a good default, but it can be increased (longer) to reduce the load created on the cluster control plane by Kustomization dry runs that happen every interval. If you do that, you better set timeout shorter, because there's rarely any good reason to lengthen the timeout past the default 10m value.
If you're in an environment where a rollback can cause a disaster, then you should keep the timeout equal to interval. Usually you can shorten it, and use retryInterval to keep the duration of a rollback when deployment has timed out short. (When interval is equal to timeout, the duration of a rollback is zero, eg. the controller just proceeds directly into another reconcile attempt)
Thanks for the thorough info here! It makes sense :-)
I have used the Reciever
for one project, but my projects rely on OCI as the source most of the time now. I am not sure Receivers
support OCIRepository
yet.
However, I played around with the settings a bit in my homelab, and found that using the recommended settings while triggering reconciliation with a call to flux reconcile source oci flux-system
works well:
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
...
spec:
interval: 60m
timeout: 3m
retryInterval: 2m
...
This triggers an instant update, and fits well into workflows where the reconciliation is triggered from a pipeline, and not from the interval. Before today I relied on low interval (without setting timeout) values to accomplish the same, but often put me in a situation where either it was inconvenient to wait for the timeout, or deployments failed, as the timeout was too low.
So thanks for sharing @kingdonb!
You can indeed use OCIRepository with Receiver, though I cannot find it documented anywhere, the OCIRepository
kind is mentioned here in the resources target:
https://fluxcd.io/flux/components/notification/receivers/#resources
I have an example of it here, for use on GitHub (with GHCR): https://github.com/kingdon-ci/flux-docs/blob/main/kustomize-flux/flux-docs-receiver.yaml
If it's missing from the docs, we should add it
(The event to watch is package
and the GitHub-side configuration for the webhook is done in the Git repository that pushes to the GHCR OCI repository)
apiVersion: notification.toolkit.fluxcd.io/v1
kind: Receiver
metadata
...
spec:
resources:
- apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: OCIRepository
name: flux-docs
secretRef:
name: flux-docs-webhook
type: github
events:
- "package"
Hope this helps!
ref: #4944
I think this change is necessary. Similar features are: helm controller introduces the kustomize_controller retry_interval feature; stops the task; sets the task to failed. I may not understand gitops very well, but my deployment task does take a long time, so I need to define a long timeout. I am willing to join in anything that can be pushed.
cc @kingdonb @stefanprodan
Describe the bug
For example, you have HelmRelease with a timeout of 60 minutes (waiting for a job that takes a long time to complete). You make a change and apply the new version of HelmRelease while the previous installation with a timeout of 60 minutes is still running. But the Helm controller does not stop the previous installation and will wait until the timeout ends, and only then it will begin installing the new version of HelmRelease. In the previous version (helm-operator) it interrupted the current installation and started a new one. Isn't the previous behavior better, when when a new version of HelmRelease appears, the controller stops installing the previous version and starts installing the new one?
Steps to reproduce
Expected behavior
The new release is applied instantly (waiting for the previous one stops, and the new one begins installation)
Screenshots and recordings
No response
OS / Distro
N/A.
Flux version
v2.13.0
Flux check
N/A.
Git provider
No response
Container Registry provider
No response
Additional context
No response
Code of Conduct