fluxcd / flux

Successor: https://github.com/fluxcd/flux2
https://fluxcd.io
Apache License 2.0
6.9k stars 1.08k forks source link

flux stuck and does not recover #3044

Closed runningman84 closed 3 years ago

runningman84 commented 4 years ago

Describe the bug

After a few days flux got stuck and did not sync anymore. I killeded the pod and everything was up and running again.

To Reproduce

I have no idea... maybe the network connection was brokene for a few minutes or the file system got corrupted...

Expected behavior

flux should terminate if it cannot reach the git repo for a long time like 1 hour. This would allow k8s to restart the container which gives visibility and might also solve the issue.

Logs

│ ts=2020-05-05T09:17:13.783823769Z caller=checkpoint.go:24 component=checkpoint msg="up to date" latest=1.19.0                                                                                                                                 │
│ ts=2020-05-05T09:25:36.751714104Z caller=loop.go:107 component=sync-loop err="git repo not ready: git clone --mirror: fatal: Could not read from remote repository., full output:\n Cloning into bare repository '/tmp/flux-gitclone406024532 │
│ ts=2020-05-05T10:25:36.752085036Z caller=loop.go:107 component=sync-loop err="git repo not ready: git clone --mirror: fatal: Could not read from remote repository., full output:\n Cloning into bare repository '/tmp/flux-gitclone406024532 │
│ ts=2020-05-05T11:25:36.752559107Z caller=loop.go:107 component=sync-loop err="git repo not ready: git clone --mirror: fatal: Could not read from remote repository., full output:\n Cloning into bare repository '/tmp/flux-gitclone406024532 │
│ ts=2020-05-05T12:25:36.753326379Z caller=loop.go:107 component=sync-loop err="git repo not ready: git clone --mirror: fatal: Could not read from remote repository., full output:\n Cloning into bare repository '/tmp/flux-gitclone406024532 │
│ ts=2020-05-05T13:25:36.754100308Z caller=loop.go:107 component=sync-loop err="git repo not ready: git clone --mirror: fatal: Could not read from remote repository., full output:\n Cloning into bare repository '/tmp/flux-gitclone406024532 │
│ ts=2020-05-05T14:25:36.754661578Z caller=loop.go:107 component=sync-loop err="git repo not ready: git clone --mirror: fatal: Could not read from remote repository., full output:\n Cloning into bare repository '/tmp/flux-gitclone406024532 

Additional context

brenix commented 4 years ago

Just encountered this in our setup as well. DNS had been temporarily unavailable, therefore it failed to clone the repo. It seems that the readiness/liveness probe did not detect this as a failure, so the pod remained up with issues. Once the pod had been re-created, it started working again.

primeroz commented 4 years ago

possible duplicate/related to https://github.com/fluxcd/flux/issues/3014 ?

We have been having this kind of issues since upgrading to 1.19

jukvalim commented 4 years ago

We've been having this issue as well, on two clusters. One has flux 1.19.0, the other 1.20.0.

RichiCoder1 commented 4 years ago

Ran into this same issue, had to cycle the pods to get them to start syncing again.

kingdonb commented 3 years ago

I was experiencing this issue (regularly, at least once a week) on my Okteto Cloud flux deployment, but I upgraded it to v1.21.2 several days ago and haven't seen it again. It would run for days and keep trying to sync, with a failure due to intermittent DNS issue that once the pod saw this issue, it would just fail to clone from then on until restarted.

I don't know of any specific changes that could have fixed it, but unless someone has a current repro with latest version of Flux v1, then I will have to close this.