argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
17.54k stars 5.34k forks source link

High CPU usage in repo server for plugin detection with >8.000 apps #15763

Open woehrl01 opened 12 months ago

woehrl01 commented 12 months ago

Checklist:

Describe the bug

Using 2.8.3 of argocd we can see high cpu usages in the repo server for detecting the plugins.

We are using a huge monorepo for our applications, without any templating (just plain yaml). But the detection of plugins take a significant amount of time.

Flame graph with pixie: Bildschirmfoto 2023-09-08 um 13 47 16

Bildschirmfoto 2023-09-08 um 13 46 59

Another one on cleanup: Bildschirmfoto 2023-09-08 um 12 57 08

Slack discussion: https://cloud-native.slack.com/archives/C01TSERG0KZ/p1694514516286809?thread_ts=1694175483.721089&cid=C01TSERG0KZ

CC: @csantanapr

To Reproduce

Apply thousand of apps at the same time

Expected behavior

Apply them "fast"

Screenshots

Version

v2.8.3+77556d9
crenshaw-dev commented 12 months ago

Do you actually use any CMPs, or is all that truly completely wasted CPU?

woehrl01 commented 12 months ago

@crenshaw-dev I use a CMP, but not in that repository.

crenshaw-dev commented 12 months ago

Gotcha. We could cache the discovery result on a per-commit basis, but my guess is that you're hitting the high CPU use with new commits.

An alternative would be to explicitly set helm kustomize or directory in the spec.source field. That should force Argo to bypass the CMP detection phase.

woehrl01 commented 12 months ago

No actually, it's the same commit, but I have a mono repo, so it does the resolving 8.000 times for each root folder of the apps.

Great, I'll check the directory part

JuozasVainauskas commented 12 months ago

We encountered the same issue on v2.7.11. After migrating plugins from ArgoCD-cm to sidecars, CPU and memory usage skyrocketed. Consequently, argocd-repo-server pods started to get throttled, ArgoCD slowed down and eventually got stuck. Bumping argocd-repo-server resources requests and limits did not help. Therefore, we had to revert the changes.

As a result, we can not use ArgoCD sidecar plugins and are blocked from updating ArgoCD to v2.8

Resource usage increase after plugins migration to sidecars:

Screenshot 2023-10-03 at 13 28 01
crenshaw-dev commented 12 months ago

@woehrl01 this might also help mitigate the issue if your monorepo is large due to non-yaml resources: https://argo-cd.readthedocs.io/en/latest/operator-manual/config-management-plugins/#plugin-tar-stream-exclusions

JuozasVainauskas commented 12 months ago

Gotcha. We could cache the discovery result on a per-commit basis, but my guess is that you're hitting the high CPU use with new commits.

An alternative would be to explicitly set helm kustomize or directory in the spec.source field. That should force Argo to bypass the CMP detection phase.

Could you please elaborate on this solution?

woehrl01 commented 12 months ago

Thanks @crenshaw-dev the repo only consists of yaml files, but I still use it to exclude the .git folder.

I also experience that I have to lower the parallel repo actions from 50 to 5 otherwise I'll end up in a strange deadlock situation. Could be because of the plugin detect, too.

crenshaw-dev commented 12 months ago

@JuozasVainauskas Argo CD only does plugin "discovery" if you haven't explicitly specified in your App manifest that you want something besides a plugin. For example:

kind: Application
spec:
  source:
    kustomize:
      images: [a=b]

For this app, Argo CD would skip plugin discovery because it automatically knows it'll be using Kustomize instead.

JuozasVainauskas commented 12 months ago

@JuozasVainauskas Argo CD only does plugin "discovery" if you haven't explicitly specified in your App manifest that you want something besides a plugin. For example:

kind: Application
spec:
  source:
    kustomize:
      images: [a=b]

For this app, Argo CD would skip plugin discovery because it automatically knows it'll be using Kustomize instead.

Understood, thank you. Unfortunately, this will not help us since we use plugins by name instead of discovery.

woehrl01 commented 12 months ago

@crenshaw-dev I just deployed the fix with the directory across all our clusters. The CPU usage of the repo-server has not changed (but isn't an issue yet), I'll monitor and keep you updated.

todaywasawesome commented 11 months ago

Comments from @crenshaw-dev - When you add a CMP, all apps now have to query that CMP to see if it can be handled. This is by design to keep potential issues out of repo server. However, it does create a performance penalty if you add a single CMP for a single app because all apps have to check against that CMP. Will review to see if we should architect differently.

todaywasawesome commented 11 months ago

Related proposal: https://github.com/argoproj/argo-cd/issues/15006

todaywasawesome commented 11 months ago

Another suggested stop-gap: Support a feature flag to disable discovery.

@alexmt suggests keeping it disabled by default.

JuozasVainauskas commented 11 months ago

We managed to keep CPU usage under control by setting --parallelismlimit flag. However, after argocd-cm plugins migration to sidecars, CPU usage still increased significantly and ArgoCD got slower. As a result, we can not migrate our argocd-cm plugins to sidecars and upgrade ArgoCD instances to 2.8.x

Screenshot 2023-10-06 at 13 36 56
JuozasVainauskas commented 11 months ago

Update: we have successfully solved the performance issue by setting --plugin-tar-exclude value to .git/* and migrated argocd-cm plugins to sidecars.

sidewinder12s commented 11 months ago

Potentially unrelated I had wondered if it might not be easier/better if we could configure the plugin-tar as inclusive per plugin rather than globally and an exclusion list. At least in large monorepos it's much easier to decide what I want to send to the CMP rather than trying to exclude.

woehrl01 commented 11 months ago

@crenshaw-dev

We just did a redeploy of about 6.000 apps today, with the fix of assigning the directory and bypassing the plugin detection, we have now received a really awesome deployment time of about 20 minutes. CPU usage of the repo servers is also great!

Bildschirmfoto 2023-10-18 um 12 38 25

CPU usage of repo server:

Bildschirmfoto 2023-10-18 um 12 35 11

Possible optimization points to further improve the performance is getting rid of the multiple git operations considering that it's a mono repo and a single commit which triggered the redeploy:

Bildschirmfoto 2023-10-18 um 12 43 35

sidewinder12s commented 11 months ago

I can also confirm huge perf improvements by adding:

directory:
  includes: '*'

To our directory argocd apps (we only have maybe 30-50 of them). git_ms timing from our repo-server logs went from 40s to 20s

ctrought commented 11 months ago

Close to 70 cores peak for a repo server pod in one of our clusters.

ArgoCD 2.8.4

image
silveiralexf commented 7 months ago

Our problem seems quite similar to the ones from folks in the thread... we have a big mono repo and a high number of applications (+8k), and cloning the same repo for each app seems to be the cause of the performance problems.

When using the plugin as an initContainer in previous versions of ArgoCD (2.4.18), the same plugin synchronizes all apps and resources in just a few seconds (~20s), in the other hand, when using it as a sidecar it takes around 20 minutes, lots of CPU, and often never fully completes...

Question: Is it possible/recommended to still use plugins as InitContainer instead of changing to the new sidecar approach? had the impression the option was removed from v2.8+ versions but couldn't tell for sure from the docs so far...

Also, is there any way to avoid the multiple cloning of the same repo that we might have missed from the docs?

Thanks a ton in advance for any insights!!!

jsolana commented 5 months ago

Hi, also related: https://github.com/argoproj/argo-cd/issues/17948 and https://github.com/argoproj/argo-cd/issues/17951

kfirfer commented 3 months ago

We having the same performance issue since we moved to CMP plugin, specifically we using helmfile plugin integration We have around 100 apps, and each app contains 2-3 charts, and its slower around x20 times more

darioef commented 1 month ago

It seems to work better for me after upgrading to 2.11 and applying the new argocd.argoproj.io/manifest-generate-paths annotation to my Applications/ApplicationSets (previously this feature worked only for webhooks).

Example:

      annotations:
        argocd.argoproj.io/manifest-generate-paths: "."