git fetch --tags often throws: [rejected] TAGNAME -> TAGNAME (would clobber existing tag)

ChristianTrolleMikkelsen commented 3 years ago

Describe the bug

We have a repository which is monitored by 4 flux instances. They each monitor their own folder in the root. They each have their own tag to update.

This worked fine until we upgraded from 1.21.0 to 1.24.2 where we also change the git pull time from 1 min to 20s. Now we are seeing this warning/error:

git fetch --tags /tmp/flux-gitclone931172227 ['+refs/tags/:refs/tags/']: From /tmp/flux-gitclone931172227 ! [rejected] TAGNAME -> TAGNAME (would clobber existing tag)

This error will come from an instance of flux which is supposed to use TAG1, but the error message will say TAG2/TAG3/TAG4 or a mix. Im not sure but I think this might cause a full sync, this is however yet on verified.

I might be wrong but normally, locally, I would fix this with a "git fetch --tags -f" since the tags are supposed to be 100% controlled by flux at any time. It cloud also be some sort of race condition, since we have multiple flux instances monitoring each own folder and have each own tag. Im not sure.

In the last case maybe flux should only fetch the tags it is configured to use like only fetch TAG1 if configured to use TAG1?

Unfortunately im no go expert. I looks to me like this could be fixed here: https://github.com/fluxcd/flux/blob/master/pkg/git/operations.go#L183

Steps to reproduce

This is our config (relevant part) used with different label values, path, etc for each of our instances of flux:

--git-url=$(GITHUB-URL)
- --git-branch=master
- --git-path=$(GITHUB-PATH)
- --git-label=$(GITHUB-TAG)
- --git-user=$(GITHUB-USER)
- --git-email=$(GITHUB-EMAIL)
- --listen-metrics=:3031
- --registry-disable-scanning
- --sync-garbage-collection=true
- --git-poll-interval=20s
- --sync-interval=720m

Expected behavior

I expect the error:

git fetch --tags /tmp/flux-gitclone931172227 ['+refs/tags/:refs/tags/*']: From /tmp/flux-gitclone931172227 ! [rejected] TAGNAME -> TAGNAME (would clobber existing tag)**

to appear every once in a while and this might trigger a full sync, which is bad when you have 1000's of files.

Kubernetes version / Distro / Cloud provider

GKE 1.20.10-gke.1600

Flux version

Flux 1.24.2, docker, no helm chart

Git provider

Github

Container Registry provider

No response

Additional context

No response

Maintenance Acknowledgement

[X] I am aware of Flux v1's maintenance status

Code of Conduct

[X] I agree to follow this project's Code of Conduct

kingdonb commented 3 years ago

Sorry that you have experienced some trouble! I have taken a look at your report and I have some feedback and suggestions.

20s is an extremely tight interval for syncing. I think when your sync interval is less than 30s, and especially with larger repositories, what you are likely to see are issues that come up when those requests aren't able to complete in time.

The fetch command is called in a couple of places, I think refspec is the tag (or ref) to fetch

git fetch --tags /tmp/flux-gitclone931172227 ['+refs/tags/*:refs/tags/*']

The argument refspec here is that array appended, with +refs/tags/*:refs/tags/* -- this tells git to update all refs that are tags, eg. git fetch --tags --force -- it comes from here:

https://github.com/fluxcd/flux/blob/62943c19104b8874982c763b884fc65697e9579e/pkg/git/working.go#L91

I think it could be more surgical as you suggested, fetching only the important tag, but I am not certain if it's safe to make this change. In general I'm very leery to make any changes that deep in the internals without knowing what the consequences may be.

As Flux v1 is committed to not making any breaking changes while we are in maintenance mode, it would have to be a compelling fix to an issue that many people are experiencing, and moreover provably safe to make this change. You mentioned that you made the sync interval shorter when you upgraded; while there were a number of changes between 1.21.0 and 1.24.2 including dependencies and some related to git syncing bugs, I don't think any would have been likely to have had an effect on this code path. Is it possible this is just an effect of shortening the interval too far? (Have you tested out reverting the setting to 60s against 1.24.2?)

We usually recommend webhooks (like flux-recv, or in Flux v2 the Notification Controller's "Receiver" CRD) for situations where developers want to see their changes reflected in the cluster immediately, or rather, at least as quickly as possible. Making the sync interval tighter puts a great deal of mostly unnecessary pressure on the git upstream to respond quickly and continuously, since most git remotes can tell you via a Webhook whenever there are changes. This is bound to fall down occasionally.

As I think you have seen here, ramping up the interval to a high frequency increases the likelihood of experiencing negative side effects, which can certainly be caused by timing out. We have included a minimum interval of 30 seconds in Flux v2 for that reason; any sync intervals shorter than 30s are automatically increased to a default value of 30s instead of the alternative of overworking Flux and overwhelming the git upstream or cluster control plane with repeated syncs, most of which are no-ops.

ChristianTrolleMikkelsen commented 2 years ago

Okay so we did a workaround. Unfortunately push to flux is not an option in our setup.

We changed flux to write sync-state to the k8s ssh secret instead of a repo tag. Then we wrote a small k8s event listener, which then updates the tag and everything works for us.

We would like to move to v2 but we couldnt find any documentation on how to get the sync-state written to k8s secret / tags.

fluxcd / flux