fluxcd / flux2

Open and extensible continuous delivery solution for Kubernetes. Powered by GitOps Toolkit.
https://fluxcd.io
Apache License 2.0
6.43k stars 595 forks source link

Helm release installation fails with etcdserver leader changed #4804

Open DeyvsonL opened 4 months ago

DeyvsonL commented 4 months ago

Describe the bug

When my cluster is starting and installing some helm releases (multiples at the same time), we are frequently (20% of the time) getting some Helm Releases failing with the error: Helm install failed for release "chart-name" with chart "char-name@version": etcdserver: leader changed.

From what I can verify, helm had this issue in the past. https://github.com/helm/helm/pull/11426

Same behavior applied to other errors raised by etcd, such as "etcdserver: request timed out".

In both cases, the helm installation continues in background without issues and the app with the helm release failing is installed successfully, but other apps that depends on the helm release will not start as they think the previous helm release failed.

Steps to reproduce

Create new cluster. Set cluster configuration to install Flux on cluster bootstrap and point to existing Git repository. Wait all helm releases in the repository be installed.

Sometimes the steps above will make some Helm Release fail with the error "etcdserver: leader changed".

Expected behavior

When helm face a "etcdserver: leader changed", the helm release should still retry the installation as doesn't impact helm installation and was already solved on helm main repository.

Screenshots and recordings

image

OS / Distro

Ubuntu 22.04

Flux version

v2.3.0

Flux check

► checking prerequisites W0520 15:17:54.120603 41669 warnings.go:70] Use tokens from the TokenRequest API or manually created secret-based tokens instead of auto-generated secret-based tokens. ✔ Kubernetes 1.28.8+rke2r1 >=1.28.0-0 ► checking version in cluster ✔ distribution: flux-v2.3.0 ✔ bootstrapped: true ► checking controllers ✔ helm-controller: deployment ready ► fluxcd/helm-controller:v1.0.1 ✔ image-automation-controller: deployment ready ► fluxcd/image-automation-controller:v0.38.0 ✔ image-reflector-controller: deployment ready ► fluxcd/image-reflector-controller:v0.32.0 ✔ kustomize-controller: deployment ready ► fluxcd/kustomize-controller:v1.3.0 ✔ notification-controller: deployment ready ► fluxcd/notification-controller:v1.3.0 ✔ source-controller: deployment ready ► fluxcd/source-controller:v1.3.0 ► checking crds ✔ alerts.notification.toolkit.fluxcd.io/v1beta3 ✔ buckets.source.toolkit.fluxcd.io/v1beta2 ✔ gitrepositories.source.toolkit.fluxcd.io/v1 ✔ helmcharts.source.toolkit.fluxcd.io/v1 ✔ helmreleases.helm.toolkit.fluxcd.io/v2 ✔ helmrepositories.source.toolkit.fluxcd.io/v1 ✔ imagepolicies.image.toolkit.fluxcd.io/v1beta2 ✔ imagerepositories.image.toolkit.fluxcd.io/v1beta2 ✔ imageupdateautomations.image.toolkit.fluxcd.io/v1beta2 ✔ kustomizations.kustomize.toolkit.fluxcd.io/v1 ✔ ocirepositories.source.toolkit.fluxcd.io/v1beta2 ✔ providers.notification.toolkit.fluxcd.io/v1beta3 ✔ receivers.notification.toolkit.fluxcd.io/v1 ✔ all checks passed

Git provider

Azure DevOps

Container Registry provider

Azure container registry

Additional context

No response

Code of Conduct

souleb commented 4 months ago

we could include https://github.com/helm/helm/pull/11426 into helm-controller.

The change should be done in https://github.com/fluxcd/helm-controller/blob/f731a805b1485f622ff08a63bb6558ba08296600/internal/kube/client.go#L129.

Are you willing to contribute this change @DeyvsonL ?

Valgueiro commented 4 months ago

I can give it a try!