fluxcd / flux2

Open and extensible continuous delivery solution for Kubernetes. Powered by GitOps Toolkit.
https://fluxcd.io
Apache License 2.0
6.6k stars 608 forks source link

The new version of HelmRelease does not trigger the update instantly, it will wait until the timeout expires. #4923

Open Subetov opened 3 months ago

Subetov commented 3 months ago

Describe the bug

For example, you have HelmRelease with a timeout of 60 minutes (waiting for a job that takes a long time to complete). You make a change and apply the new version of HelmRelease while the previous installation with a timeout of 60 minutes is still running. But the Helm controller does not stop the previous installation and will wait until the timeout ends, and only then it will begin installing the new version of HelmRelease. In the previous version (helm-operator) it interrupted the current installation and started a new one. Isn't the previous behavior better, when when a new version of HelmRelease appears, the controller stops installing the previous version and starts installing the new one?

Steps to reproduce

Expected behavior

The new release is applied instantly (waiting for the previous one stops, and the new one begins installation)

Screenshots and recordings

No response

OS / Distro

N/A.

Flux version

v2.13.0

Flux check

N/A.

Git provider

No response

Container Registry provider

No response

Additional context

No response

Code of Conduct

Muruganpari commented 3 months ago

Similar conversation - https://github.com/fluxcd/flux2/discussions/1000#discussioncomment-406907 Not sure that the previous helm operator support the interruption. could you mention the api version ?

Subetov commented 3 months ago

Similar conversation - #1000 (comment) Not sure that the previous helm operator support the interruption. could you mention the api version ?

I probably won’t tell you the API version, but here is the operator version.

stefanprodan commented 3 months ago

It is not possible to interrupt the reconciler in Flux v2.

Muruganpari commented 3 months ago

@stefanprodan, Can a HelmRelease with the following annotations interrupt an existing release and start a new one? "reconcile.fluxcd.io/requestedAt=$TOKEN" \ "reconcile.fluxcd.io/forceAt=$TOKEN" https://fluxcd.io/flux/components/helm/helmreleases/#forcing-a-release

kingdonb commented 3 months ago

I've felt this issue, in kustomize controller as well. It might be to our benefit to introduce a way for new reconciliations to interrupt a timeout-in-progress, but it's hard to imagine that it wouldn't also be a breaking change.

The best recommendation I can make without breaking anything is to set your timeout lower. Do you expect deployments to take a full hour? If not, then why set timeout: 60m?

Subetov commented 3 months ago

In my case, if I reduce the timeout, I get a release in the failed state. And if I remove the wait, then I can't rely on the release statuses.

It is not so important why people set such or other timeouts (there are reasons for this). What is important is that you cannot start a new installation while the Helm controller is waiting for a timeout. =( Even if we assume that you have a smaller timeout, 15 minutes for example. You make an update that has an error and will never be successful, you are forced to wait 15 or any other minutes until you can fix it. It is clear that the smaller the timeout, the less painful it is. But there are cases when a long timeout is necessary. And so each attempt to fix a faulty release will take you a lot of time (and this can be critical). The old Helm operator initiated the installation of a new Helm release immediately, despite the timeouts.

devantler commented 3 months ago

I've felt this issue, in kustomize controller as well. It might be to our benefit to introduce a way for new reconciliations to interrupt a timeout-in-progress, but it's hard to imagine that it wouldn't also be a breaking change.

The best recommendation I can make without breaking anything is to set your timeout lower. Do you expect deployments to take a full hour? If not, then why set timeout: 60m?

@kingdonb This has troubled me before. I do think some of us used the timeout of 60m as it is actually mentioned in the docs as a recommendation:

https://fluxcd.io/flux/components/kustomize/kustomizations/#working-with-kustomizations

Maybe a note can be added to inform users of the tradeoffs in setting it high vs low? :-)

kingdonb commented 3 months ago

I'm not sure we are looking at the same doc, this one says:

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: webapp
  namespace: apps
spec:
  interval: 60m0s # detect drift and undo kubectl edits every hour
  wait: true # wait for all applied resources to become ready
  timeout: 3m0s # give up waiting after three minutes

Additionally it says:

  retryInterval: 2m0s # retry every two minutes on apply or waiting failures

So when the Kustomization times out at 3m or fails early due to error, it will retry after 2m (not 60m)

This is the recommended configuration, because otherwise with an interval of 60m you will get a timeout of the same value 60m and it leads to this undesirable state of Kustomization locked, waiting for timeout, until the 60 minutes runs out.

% kubectl explain kustomization.spec.timeout
KIND:     Kustomization
VERSION:  kustomize.toolkit.fluxcd.io/v1

FIELD:    timeout <string>

DESCRIPTION:
     Timeout for validation, apply and health checking operations. Defaults to
     'Interval' duration.
devantler commented 3 months ago

Whoops, you are definitely right. I experienced having a hard time with this some time ago, probably because I am not setting the timeout to anything, and thus experiencing the "undesirable state of Kustomization locked", due the it taking the value of the interval as default.

Hmm, I might not understand it fully still:

kingdonb commented 3 months ago

The paradigm is sort of, either set the interval on your Sources low, or instrument them with a Receiver - so the Source updates within a short amount of time - the webhook receiver is used to avoid the need to poll the source frequently. A Kustomization (or a HelmRelease) is notified by the source whenever the source content changes, so there is no need to reconcile appliers when the source updates, or to reconcile it frequently at all (except to correct drift that occurs in the cluster).

So your GitRepository source can have a short interval, and it doesn't cost a dry run of the Kustomization every time the git repository reconciles without any update. Makes sense?

The Kustomization timeout should be at least long enough for a deployment to complete in normal circumstances, but not more than about 2x that long. The interval at 10m is a good default, but it can be increased (longer) to reduce the load created on the cluster control plane by Kustomization dry runs that happen every interval. If you do that, you better set timeout shorter, because there's rarely any good reason to lengthen the timeout past the default 10m value.

If you're in an environment where a rollback can cause a disaster, then you should keep the timeout equal to interval. Usually you can shorten it, and use retryInterval to keep the duration of a rollback when deployment has timed out short. (When interval is equal to timeout, the duration of a rollback is zero, eg. the controller just proceeds directly into another reconcile attempt)

devantler commented 3 months ago

Thanks for the thorough info here! It makes sense :-)

I have used the Reciever for one project, but my projects rely on OCI as the source most of the time now. I am not sure Receivers support OCIRepository yet.

However, I played around with the settings a bit in my homelab, and found that using the recommended settings while triggering reconciliation with a call to flux reconcile source oci flux-system works well:

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
...
spec:
  interval: 60m
  timeout: 3m
  retryInterval: 2m
...

This triggers an instant update, and fits well into workflows where the reconciliation is triggered from a pipeline, and not from the interval. Before today I relied on low interval (without setting timeout) values to accomplish the same, but often put me in a situation where either it was inconvenient to wait for the timeout, or deployments failed, as the timeout was too low.

So thanks for sharing @kingdonb!

kingdonb commented 3 months ago

You can indeed use OCIRepository with Receiver, though I cannot find it documented anywhere, the OCIRepository kind is mentioned here in the resources target:

https://fluxcd.io/flux/components/notification/receivers/#resources

I have an example of it here, for use on GitHub (with GHCR): https://github.com/kingdon-ci/flux-docs/blob/main/kustomize-flux/flux-docs-receiver.yaml

If it's missing from the docs, we should add it

(The event to watch is package and the GitHub-side configuration for the webhook is done in the Git repository that pushes to the GHCR OCI repository)

apiVersion: notification.toolkit.fluxcd.io/v1
kind: Receiver
metadata
  ...
spec:
  resources:
  - apiVersion: source.toolkit.fluxcd.io/v1beta2
    kind: OCIRepository
    name: flux-docs
  secretRef:
    name: flux-docs-webhook
  type: github
  events:
    - "package"

Hope this helps!

njuptlzf commented 3 months ago

ref: #4944

I think this change is necessary. Similar features are: helm controller introduces the kustomize_controller retry_interval feature; stops the task; sets the task to failed. I may not understand gitops very well, but my deployment task does take a long time, so I need to define a long timeout. I am willing to join in anything that can be pushed.

cc @kingdonb @stefanprodan