fluxcd / flux2

Open and extensible continuous delivery solution for Kubernetes. Powered by GitOps Toolkit.
https://fluxcd.io
Apache License 2.0
6.39k stars 593 forks source link

Flux stops syncing GitRepositories if a HelmRepository is invalid #2641

Open Frankkkkk opened 2 years ago

Frankkkkk commented 2 years ago

Describe the bug

I was editing a HelmRepository and set an invalid .spec.interval=1d value.

Once the source-controller pulled the git repo, it immediately detected the invalid value:

{"level":"info","ts":"2022-04-14T16:06:45.038Z","logger":"controller.gitrepository","msg":"artifact up-to-date with remote revision: 'main/4d01594aef5aaf995dbfb8696755ab2321596e89'","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","name":"flux-system","namespace":"flux-system"}
{"level":"info","ts":"2022-04-14T16:06:46.765Z","logger":"controller.helmrepository","msg":"artifact up-to-date with remote revision: '29efdf95ddb2857d8634aa6d4901e1f180fa0cd7b8f1f5f2e069f73b8f6d84a9'","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"HelmRepository","name":"gresearch","namespace":"flux-system"}
W0414 16:07:45.408266       1 reflector.go:442] k8s.io/client-go@v0.23.4/tools/cache/reflector.go:167: watch of *v1beta2.HelmRepository ended with: an error on the server ("unable to decode an event from the watch stream: unable to decode watch event: time: unknown unit \"d\" in duration \"1d\"") has prevented the request from succeeding
W0414 16:07:46.629958       1 reflector.go:324] k8s.io/client-go@v0.23.4/tools/cache/reflector.go:167: failed to list *v1beta2.HelmRepository: time: unknown unit "d" in duration "1d"

As I didn't really looked at the logs, I deleted the source-controller pod

After that, I pushed another commit fixing the syntax (.spec.interval=1m) but the source-controller would not pull the repo, and it still failed on the error above.

Another restart/deletion of the source-controller pod did not fix the problem:

kl -n flux-system logs -f source-controller-c65ddffbb-rkss2 
{"level":"info","ts":"2022-04-14T16:07:59.257Z","logger":"controller-runtime.metrics","msg":"Metrics server is starting to listen","addr":":8080"}
{"level":"info","ts":"2022-04-14T16:07:59.257Z","logger":"setup","msg":"starting manager"}
{"level":"info","ts":"2022-04-14T16:07:59.258Z","msg":"Starting server","kind":"health probe","addr":"[::]:9440"}
{"level":"info","ts":"2022-04-14T16:07:59.258Z","msg":"Starting server","path":"/metrics","kind":"metrics","addr":"[::]:8080"}
W0414 16:07:59.281708       1 reflector.go:324] k8s.io/client-go@v0.23.4/tools/cache/reflector.go:167: failed to list *v1beta2.HelmRepository: time: unknown unit "d" in duration "1d"
E0414 16:07:59.281739       1 reflector.go:138] k8s.io/client-go@v0.23.4/tools/cache/reflector.go:167: Failed to watch *v1beta2.HelmRepository: failed to list *v1beta2.HelmRepository: time: unknown unit "d" in duration "1d"
W0414 16:08:00.131042       1 reflector.go:324] k8s.io/client-go@v0.23.4/tools/cache/reflector.go:167: failed to list *v1beta2.HelmRepository: time: unknown unit "d" in duration "1d"

and indeed the gitrepo was not updated:

  - lastTransitionTime: "2022-04-14T16:07:45Z"
    message: stored artifact for revision 'main/06fd945dcc398d76b7660c160900a63342f7cc1c'
    observedGeneration: 2
    reason: Succeeded
    status: "True"
    type: Ready

I manually kubectl edited the HelmResource and set a valid value. Immediately the gitrepo updated itself:

E0414 16:15:17.651602       1 reflector.go:138] k8s.io/client-go@v0.23.4/tools/cache/reflector.go:167: Failed to watch *v1beta2.HelmRepository: failed to list *v1beta2.HelmRepository: time: unknown unit "d" in duration "1d"

--- Edit the HR ---

I0414 16:15:52.758520       1 leaderelection.go:248] attempting to acquire leader lease flux-system/source-controller-leader-election...
I0414 16:15:52.782951       1 leaderelection.go:258] successfully acquired lease flux-system/source-controller-leader-election
{"level":"info","ts":"2022-04-14T16:15:52.783Z","logger":"controller.gitrepository","msg":"Starting EventSource","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository","source":"kind source: *v1beta2.GitRepository"}
{"level":"info","ts":"2022-04-14T16:15:52.783Z","logger":"controller.gitrepository","msg":"Starting Controller","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"GitRepository"}
kl -n flux-system get gitrepositories.source.toolkit.fluxcd.io flux-system
NAME          URL                                               AGE   READY   STATUS
flux-system   https://gitlab.gitlab/infra/k8s-deployments.git   11m   True    stored artifact for revision 'main/06fd945dcc398d76b7660c160900a63342f7cc1c'
--- Edit the HR ---
flux-system   https://gitlab.gitlab/infra/k8s-deployments.git   12m   True    stored artifact for revision 'main/b07d84adde0cb1085c6cb17cc67ef6c1b951d98f'
flux-system   https://gitlab.gitlab/infra/k8s-deployments.git   12m   True    stored artifact for revision 'main/b07d84adde0cb1085c6cb17cc67ef6c1b951d98f'

Steps to reproduce

Expected behavior

The source-controller should pull the git repo expecting a change

Screenshots and recordings

No response

OS / Distro

N/A

Flux version

v0.28.5

Flux check

► checking prerequisites ✔ Kubernetes 1.22.6+k3s1 >=1.20.6-0 ► checking controllers ✔ helm-controller: deployment ready ► ghcr.io/fluxcd/helm-controller:v0.18.2 ✔ kustomize-controller: deployment ready ► ghcr.io/fluxcd/kustomize-controller:v0.22.3 ✔ notification-controller: deployment ready ► ghcr.io/fluxcd/notification-controller:v0.23.2 ✔ source-controller: deployment ready ► ghcr.io/fluxcd/source-controller:v0.22.5 ✔ all checks passed

Git provider

No response

Container Registry provider

No response

Additional context

No response

Code of Conduct

melkosoft commented 1 year ago

Got the same situation: helm repository changed and flux failed to pull helm chart. Any changes to flux-system repository are not synced, flux is still using old repo files. Editing helmrepository not fixing the issue - old url returns back after flux reconcile flux-system source. How can we force flux to sync with latest git files?

rajat-tomar commented 1 year ago

Hey! I am having the same issue. Did you find any solution for this?

kingdonb commented 1 year ago

This issue has been open for a while without being substantiated, and a lot has happened since Flux 0.28.5 – I suspect @rajat-tomar that you might not have exactly the same issue, without hearing more detail about it.

I was not able to reproduce the issue... (quick recording of where I attempted to reproduce the issue – youtube will have it ready soon, it's still processing from the upload now)

If you have a current version of Flux and are able to repro this issue, could you provide more information about it? Following the steps in the original post, I did not see any issue.

Frankkkkk commented 1 year ago

Hi, I'll try to reproduce with the latest version of flux tomorrow and I'll keep you up to date. Thanks to you all !

duckfullstop commented 1 year ago

I'm able to reproduce this issue - not sure exactly how, but you can see the above referenced commit and just invert the change I made.

Spewing errors rather angrily and not letting source-controller come up: {"level":"info","ts":"2023-03-14T00:14:52.851Z","logger":"runtime","msg":"k8s.io/client-go@v0.26.2/tools/cache/reflector.go:169: failed to list *v1beta2.HelmRepository: time: unknown unit \"d\" in duration \"1d\"\n"}

hiddeco commented 1 year ago

@duckfullstop what version are you on? As the HelmRepository has a validation rule for the duration these days: https://github.com/fluxcd/source-controller/blob/main/api/v1beta2/helmrepository_types.go#L68-L71

duckfullstop commented 1 year ago

@duckfullstop what version are you on? As the HelmRepository has a validation rule for the duration these days: https://github.com/fluxcd/source-controller/blob/main/api/v1beta2/helmrepository_types.go#L68-L71

This was on v0.41.1. It's worth noting that I updated from v0.39.0 to attempt to fix this issue, but I observed the new source-controller pod (with the new image version!) continue to fail as above.

My guess would be that the validation you mentioned is occurring only on manifest apply, and if bad state is already in the cluster it breaks everything in the process. If that's mitigatable during read then great, otherwise a patch note / warning somewhere that (favourite search engine here) can pick up would probably be a good idea.

hiddeco commented 1 year ago

Validation rules are already in place since v0.35.0: https://github.com/fluxcd/flux2/releases/tag/v0.35.0, but they indeed only happen during apply.