carsonoid commented 4 years ago

Summary

Webhooks for git changes should only trigger resync for apps when the app files have actually changed.

Motivation

Right now every time the apiserver process a webhook it will trigger a resync for every app that uses that repo if the ref has changed since the since last sync, regardless of whether or not any files that the app uses in that repo have changed. Reference

For small use-cases this is fine, but we have a large installation with a single repo containing over 600 apps. That same repo also contains the app manifests themselves What this means is that when we push changes for a single app to our repo, ~599 excessive syncs are triggered. This was slowing down our argo performance significantly for no benefit.

We created a fork of the apiserver and wrote some custom logic in to skip app syncs when they were not needed and it improved our Argo performance at our scale by an order of magnitude.

One current workaround is to put all Application manifests in a dedicated repo, but that shouldn't be required. And still has the same issue of excessive refreshes if you have a large number of Application manifests that are constantly changing (like we do).

Proposal

Most of the major webhook formats include lists of files that have changed. The apiserver should be able to intelligently trigger selective resyncs based on the list of changed files in the webhook.

However: Because kustomize can do tricky things with directory traversal we cannot simply check for changed files under the .spec.source.path by default. Because it's possible to not trigger updates properly for kustomize based apps.

I propose two new, optional annotations on argoproj.io/Application resources:

argocd.argoproj.io/refresh-on-path-updates-only: "true|false"
- If the apiserver processes an app this this annotation set to true then it will use .spec.source.path to see if any of the files under that path have been changed according to the webhook. If so, it will trigger the sync like always. if not, it will skip the refresh for the app.
argocd.argoproj.io/refresh-prefix: "PATHSTRING"
- If the apiserver processes an app this this annotation then it will us its value to see if any of the files under that path have been changed according to the webhook. If so, it will trigger the sync like always. if not, it will skip the refresh for the app.

This means that all existing installations are unchanged and anyone not doing extensive kustomize backwards directory traversal (../../../file) can simply set argocd.argoproj.io/refresh-on-path-updates-only: "true" on all their apps and get an instant performance boost from the reduced refreshes!

Alternatively, even complicated kustomize based installations could use argocd.argoproj.io/refresh-prefix to get the same benefit, albeit with more work.

carsonoid commented 4 years ago

I've submitted a PR with my proposed code change to implement this here: https://github.com/argoproj/argo-cd/pull/4273

The code there is basically a general-purpose implementation of what we are already doing in our installation.

jessesuen commented 4 years ago

@carsonoid we don't think this will achieve that much other than delay the eventual refresh that anyways happens for all apps (by default 3 minute intervals). In other words, the skipping would only take affect for the annotated app, but within a few minutes all apps get refreshed anyways.

carsonoid commented 4 years ago

@jessesuen that's fair. We had to increase that interval quite a bit internally, But when we did, paired with sync-skipping options I've outlined here, we saw a 10x increase in the time to sync apps in our large installation. We can't be the only ones to be hitting this issue either. Large installations need a way to avoid excessive, unnecessary syncs.

jessesuen commented 4 years ago

We plan on improving repo server performance by paralyzing simultaneous queries against a single repo URL. Currently there is a mutex protecting a single repoURL (per repo server). I hope with that improvement, annotations won't be necessary

jessesuen commented 4 years ago

@carsonoid to workaround your issue, have you tried bumping up the number of repo-servers replicas to accommodate the simultaneous requests for a single mono-repo?

carsonoid commented 4 years ago

Sure did! We scaled up the number of repo servers and tweaked the app-controller as suggested in the HA docs. It had almost no impact on our performance.

The only thing that actually improved performance was running a fork of the api server which did something much like the proposed feature and dramatically reduced the number of pointless refreshes being done by the controller. The controller really seemed to be the bottleneck as it was basically doing app refreshes non-stop until we fixed the webhook.

jrkt commented 4 years ago

:+1: for this feature request. We have the same problem with one repo for about 600 services in six k8s clusters and have noticed a lot of pointless refreshes of apps across all our k8s clusters when only one app changes in one cluster. This would be a HUGE benefit for us.

jannfis commented 4 years ago

This is basically the same request as #3890

carsonoid commented 4 years ago

This is basically the same request as #3890

Except I have a more in-depth proposal and actual code for the feature tested and ready to go :)

pyr0hu commented 4 years ago

Also 👍 , we are using a single repo that defines the whole infrastructure and it'd be a really good perf improvement. As the deploy process updates the infra repo with the new app image, all application will refresh and resync (if autosync enabled) if the deployment changes, even it it was only one webapp.

alexmt commented 4 years ago

Added one suggestion to PR https://github.com/argoproj/argo-cd/pull/4273#issuecomment-696440801 . it was also discussed it with @jessesuen offline. @carsonoid , let me know what you think please.

dominykas commented 3 years ago

I can see #4699 was merged and released in v1.8, however I'm a little uncertain about the behaviors.

Does this apply to webhooks only? i.e. the filtering does not apply when used with polling?
How does this affect the semantics of failed syncs - when the application source is filtered, the commit-SHA is still the last commit in the repo, not the last commit for the application or is it? Documentation says:

Automatic sync will not reattempt a sync if the previous sync attempt against the same commit-SHA and parameters had failed.

carsonoid commented 3 years ago

I'm pretty sure that the final version of this change offloads all the refresh work to happen on webhook. I think polling will still trigger refreshes on all apps for the repo, but those that have been refreshed by a webhook call will go faster. But I'm not positive. I know from personal experience that we have had better performance by turning the poll frequency down with --app-resync=1800 as the docs say. We have had great performance with these settings and a very large repo for all apps.
The commit SHA will always show the latest for the repo and has no link to the argocd.argoproj.io/manifest-generate-paths filtering. This means that failed syncs will either retrigger when the next git poll happens after any commit to the repo or when a change is pushed to the target path in the repo.

Disclaimer: I am not on the Argo-CD team. This is all from my personal experience writing and using this one feature and Argo-CD a lot.

argoproj / argo-cd

Conditionally skip app resync requests for webhooks #4272

Summary

Motivation

Proposal