Open d-wierdsma opened 1 year ago
Just was able to regenerate this issue again when attempting an upgrade to v2.6.7
from v2.6.1
. This time I did not see any failed to get git client for repo
errors however I rolled back within 15 minutes so I'm guessing it just didn't have time to reach the git rate limit.
We can see from these images that CPU spikes almost immediately causing the HPA to scale up the number of repo-servers which compounds the issue.
As for logs, we can also see a distinct spike in logs at this time as well. I'm still investigating these logs to see if there is any apparent issues, but at first glance it looks like the following:
{"level":"info","msg":"manifest cache hit: \u0026ApplicationSource{RepoURL:https://chartmuseum.xxx.com,Path:,TargetRevision:0.2.4,Helm:\u0026ApplicationSourceHelm{ValueFiles:[$values/apps/test/application-values.yaml $values/apps/test/test-values.yaml $values/clusters/test/cluster-values.yaml],Parameters:[]HelmParameter{},ReleaseName:,Values:,FileParameters:[]HelmFileParameter{},Version:,PassCredentials:false,IgnoreMissingValueFiles:false,SkipCrds:false,},Kustomize:nil,Directory:nil,Plugin:nil,Chart:gitops,Ref:,}/0.2.4","time":"2023-04-06T15:17:41Z"}
{"grpc.code":"OK","grpc.method":"GenerateManifest","grpc.service":"repository.RepoServerService","grpc.start_time":"2023-04-06T15:17:41Z","grpc.time_ms":588.414,"level":"info","msg":"finished unary call with code OK","span.kind":"server","system":"grpc","time":"2023-04-06T15:17:41Z"}
{"level":"info","msg":"manifest cache miss: \u0026ApplicationSource{RepoURL:git@git.xxx.git,Path:,TargetRevision:HEAD,Helm:nil,Kustomize:nil,Directory:nil,Plugin:nil,Chart:,Ref:values,}/0a6daabbea0494097a41c0bcbacece4cb1908631","time":"2023-04-06T15:17:41Z"}
{"grpc.code":"OK","grpc.method":"GenerateManifest","grpc.service":"repository.RepoServerService","grpc.start_time":"2023-04-06T15:17:41Z","grpc.time_ms":3.614,"level":"info","msg":"finished unary call with code OK","span.kind":"server","system":"grpc","time":"2023-04-06T15:17:41Z"}
{"level":"info","msg":"manifest cache hit: \u0026ApplicationSource{RepoURL:https://chartmuseum.xxx.com,Path:,TargetRevision:0.2.4,Helm:\u0026ApplicationSourceHelm{ValueFiles:[$values/apps/shared-services/application-values.yaml $values/apps/shared-services/shared-services-values.yaml $values/clusters/shared-services/cluster-values.yaml],Parameters:[]HelmParameter{},ReleaseName:,Values:,FileParameters:[]HelmFileParameter{},Version:,PassCredentials:false,IgnoreMissingValueFiles:false,SkipCrds:false,},Kustomize:nil,Directory:nil,Plugin:nil,Chart:gitops,Ref:,}/0.2.4","time":"2023-04-06T15:17:41Z"}
{"grpc.code":"OK","grpc.method":"GenerateManifest","grpc.service":"repository.RepoServerService","grpc.start_time":"2023-04-06T15:17:41Z","grpc.time_ms":628.879,"level":"info","msg":"finished unary call with code OK","span.kind":"server","system":"grpc","time":"2023-04-06T15:17:41Z"}
{"level":"info","msg":"manifest cache miss: \u0026ApplicationSource{RepoURL:git@git.xxx.git,Path:,TargetRevision:HEAD,Helm:nil,Kustomize:nil,Directory:nil,Plugin:nil,Chart:,Ref:values,}/61355835ddae350ebe8c19e3ed49a426c574a464","time":"2023-04-06T15:17:41Z"}
{"grpc.code":"OK","grpc.method":"GenerateManifest","grpc.service":"repository.RepoServerService","grpc.start_time":"2023-04-06T15:17:41Z","grpc.time_ms":3.632,"level":"info","msg":"finished unary call with code OK","span.kind":"server","system":"grpc","time":"2023-04-06T15:17:41Z"}
{"level":"info","msg":"manifest cache hit: \u0026ApplicationSource{RepoURL:https://chartmuseum.xxx.com,Path:,TargetRevision:0.2.4,Helm:\u0026ApplicationSourceHelm{ValueFiles:[$values/apps/dev/application-values.yaml $values/apps/dev/dev-values.yaml $values/clusters/dev/cluster-values.yaml],Parameters:[]HelmParameter{},ReleaseName:,Values:,FileParameters:[]HelmFileParameter{},Version:,PassCredentials:false,IgnoreMissingValueFiles:false,SkipCrds:false,},Kustomize:nil,Directory:nil,Plugin:nil,Chart:gitops,Ref:,}/0.2.4","time":"2023-04-06T15:17:41Z"}
{"grpc.code":"OK","grpc.method":"GenerateManifest","grpc.service":"repository.RepoServerService","grpc.start_time":"2023-04-06T15:17:41Z","grpc.time_ms":635.541,"level":"info","msg":"finished unary call with code OK","span.kind":"server","system":"grpc","time":"2023-04-06T15:17:41Z"}
I've also just verified that our gitlab instance has no authenticated API request rate limits set, as we are using SSH creds I assume this is how Repo Server is making requests.
The interesting part of this to me is that repo Server seems to be pulling all repos on startup, even though we have disabled automatic sync and only intend to trigger Syncs from Webhooks themselves. Entirely possible I'm misunderstanding the Repo Server startup process though
I experienced similar issue, argocd multi-source Applications stayed in Unknown state until manually refreshed. I also noticed some github rate limiting during this issue.
I turned on ApplicationSet and Application Controller Debug logs and started to see that there were a ton of reconciliation loops being created by the Application Controller due to Orphaned resources.
I had set orphanedResources tag on all my ArgoCD Projects that made my applications attempt to claim ownership of all orphaned resources within its namespace that the application is deployed to.
spec:
description: Argocd Project
orphanedResources:
warn: false
Here is the difference in reconciliation and git ls-remote calls.
There is some reconciliation loops still in place that I need to investigate, but it's significantly better now.
Ref: https://github.com/argoproj/argo-cd/issues/8100#issuecomment-1076067184
Hello, Any updates on this? We've tried to upgrade from version 2.6.2 to 2.6.11 and Git Requests and Reconciliation start increasing immediately. After rolling it back it decreases significantly.
@crenshaw-dev Did you see sth similar? I saw you asked us to create a separate issue. Please could you help us with it?
I've noticed the issue starts at version 2.6.3 and an endless loop of reconciliation happening to applications that have recurse: true
Same Here!
Is everyone here using ApplicationSets? I suspect the issue might be related to the ApplicationSet controller failing to normalize the App spec before applying it. The Application controller and the ApplicationSet controller end up fighting over the correct App manifest, resulting in constant reconciliation.
I've merged a fix: https://github.com/argoproj/argo-cd/pull/14481
Yes, we're using ApplicationSets as the other guys mentioned #14712. I've tried version 2.7.10 with no success.
Relevant comments to this. Adding them here for reference: https://github.com/argoproj/argo-cd/issues/14712#issuecomment-1662320951, https://github.com/argoproj/argo-cd/issues/14712#issuecomment-1662346485, https://github.com/argoproj/argo-cd/issues/14712#issuecomment-1663420058, https://github.com/argoproj/argo-cd/issues/14712#issuecomment-1663813660, https://github.com/argoproj/argo-cd/issues/14712#issuecomment-1663834340, https://github.com/argoproj/argo-cd/issues/14712#issuecomment-1663840412
@stafot do we know for certain yet that the appset controller is involved at all in the high request count in your env? What happens if you scale down the controller for a few minutes?
I'm a little suspicious that multi-source apps might be to blame in your case: https://github.com/argoproj/argo-cd/issues/14725
@crenshaw-dev We did the test, after upgrading Argocd and scaling down the appset controller to zero the reconciliation activity kept increasing.
@crenshaw-dev Do you think the case: #14725 is related? I was reading the recent messages that seem very similar to our problem, we're using the app-of-apps pattern with multi-source apps.
@crenshaw-dev I am wondering if there will be any actions on this. It is happening for several months, it has been mentioned by several and seems that there's no really activity on this issue.
It's a pity that we cannot even upgrade to latest versions of ArgoCD and catchup with security vulnerabilities and latest improvements. Is there any other workaround?
@nromriell We're following your amazing work in this issue where two parts has already merged. We believe our issue is related and we want to share some results after upgrading to 2.9.5. We're stuck in version 2.6.2 since the bug was introduced so we upgraded from it. Before your changes, the reconciliation and git requests started to increase non-stop, now it is high but stable like the graphics below:
We've ~150 apps with multi-source.
Do you believe is this expected until the merge of the third and fourth parts of the issue?
Thanks.
Hi @andrleite as of the last state I would expect the checkouts at least to be lower
My changes are primarily around fixing the number of git requests between cache invalidations. Looking at the graphs you shared here it looks like your cache is nearly constantly invalidated which looks like the primary issue and likely why you aren't seeing the behavior you'd expect. Have you tried setting timeout.reconciliation
to something very high like 24 hours rather than 0 to compare?
My time has been pretty limited lately so I haven't been able to continue to make improvements here but should at least be able to look at opening up the remaining two PRs here shortly. I think as is though those probably won't fix what you're seeing since they rely on the items being cached, it would just reduce the call count per cycle.
Checklist:
argocd version
.Describe the bug
After upgrading to ArgoCD version 2.6.4 from 2.6.2 we experienced an issue where Repo Server was unable to resolve a git client and forced many apps into an
Unknown
state. When I manually refreshed the apps they were able to resolve the git client without issues, however they would automatically refresh on their own leading once again to an Unknown state regardless of them now beingHealthy
. As you can see from screenshots it appears that the repoServer kept attempting connections to the git client in large amounts seen by thels-remote
dashboard panel screenshot.Our current ArgoCD cluster has 5 clusters total that it deploys to using an App of Apps generator method, we currently have ~120 applications managed by this centralized ArgoCD cluster.
Our current setup for applications is roughly as follows: Per team and environment we have an App of Apps that creates another set of App of Apps for each application that we would like to deploy, this sub App of Apps will then deploy the application (for example prometheus) to the appropriate external clusters.
We are also utilizing the new Multi-source applications to use values files contained in our private git repositories with a mix of private and public helm charts.
To Reproduce
We have disabled the automatic refresh on apps in favour for git webhooks to refresh apps when there are changes to the repos
Have ~100 applications on a HA ArgoCD setup, with the following relevant settings:
I assume that the repoServer reached a rate-limit built into our internal gitlab instance and kept sending requests after getting failures.
Expected behavior
I expect Repo Server to eventually fail on calls to the git service and not keep sending requests when there are no changes and the application is healthy.
Screenshots
https://user-images.githubusercontent.com/13317139/225333581-0e118090-e311-457a-987e-0a9861860129.png https://user-images.githubusercontent.com/13317139/225334095-3786b36b-87ab-45b4-a572-18593fa371ee.png
Version Unable to determine the exact SHA as it took out our git version, but it was
v2.6.4
as of approx.Tue Mar 14 12:03:49 2023 -0400
Logs