Closed dnlloyd closed 4 months ago
Same issue here!
We are running argocd-server v2.10.9
also... been going on for weeks. The only fix we've found is to restart argocd-application-controller
's each time our developers run into the issue.
We configured for HA, are using annotations for paths, and are using 20 min jitter... due to deploying all app from one 'deployment' repo; which is essentially our overlay repo. We've also tried in creasing application processors and using the 'round-robin' sharding algorithm
Annotation added to app manifests
# resolves to the spec.source.path directory
argocd.argoproj.io/manifest-generate-paths: .
what was the latest version of argoCD you used that did not have this issue? what kubernetes version is the cluster?
what was the latest version of argoCD you used that did not have this issue? what kubernetes version is the cluster?
We were using ArgoCD version 2.6.6
+ k8s version 1.25
for about a year before this issue started occurring. The first occurrence of this issue did happen 2-3 weeks after a k8s upgrade to 1.29
. Since then, we have upgraded ArgoCD to v2.10.9
as noted in the issue.
Current k8s version: 1.29
Issue occurred again this morning right on schedule. This time 3 out of the 5 clusters "successfully synced":
05:50:25.809
time="2024-06-01T10:50:25Z" level=info msg="Start syncing cluster" server="https://ObfuscatedClusterA"
05:50:53.355
time="2024-06-01T10:50:53Z" level=info msg="Cluster successfully synced" server="https://ObfuscatedClusterA"
05:54:58.234
time="2024-06-01T10:54:58Z" level=info msg="Start syncing cluster" server="https://ObfuscatedClusterB"
06:03:31.571
time="2024-06-01T11:03:31Z" level=info msg="Start syncing cluster" server="https://ObfuscatedClusterC"
06:03:32.280
time="2024-06-01T11:03:32Z" level=info msg="Cluster successfully synced" server="https://ObfuscatedClusterC"
06:08:41.076
time="2024-06-01T11:08:41Z" level=info msg="Start syncing cluster" server="https://ObfuscatedClusterD"
06:09:38.045
time="2024-06-01T11:09:38Z" level=info msg="Cluster successfully synced" server="https://ObfuscatedClusterD"
06:11:49.761
time="2024-06-01T11:11:49Z" level=info msg="Start syncing cluster" server="https://ObfuscatedClusterE"
are u on 1.29.4 or higher? "Kube-apiserver: fixes a 1.27+ regression in watch stability by serving watch requests without a resourceVersion from the watch cache by default, as in <1.27 (disabling the change in #115096 by default). This mitigates the impact of an etcd watch bug (https://github.com/etcd-io/etcd/pull/17555). If the 1.27 change in #115096 to serve these requests from underlying storage is still desired despite the impact on watch stability, it can be re-enabled with a WatchFromStorageWithoutResourceVersion feature gate. (https://github.com/kubernetes/kubernetes/pull/123973, @serathius) [SIG API Machinery]" is a maybe related fix that is only available in >= 1.29.4
@tooptoop4 I mistakenly stated we were on 1.29
. We are actually on 1.28.9
. I'll see if we can get an upgrade to 1.29.4
(or higher) scheduled so we can test the option you posted.
We upgraded to 1.29.4
and and I immediately learned that most API server feature gates are not exposed on EKS (there are only three enabled by default on 1.29). So looks like this option is off the table for now.
Hello, I have a similar behaviour on the GUI with Red Hat Openshift Gitops in Version 1.12.3 which is running ArgoCD v2.10.10+9b3d0c0. In the logs of the app-controller the pod is showing only sync messages and then goes into OOM and crashLoopBackOff. In my case increasing the memory limit of the app-controller pod from 2 GB to 3 GB fixed the issue. Do you see any pod crashes ? of the app-controller ? Any OOMs ?
In my case increasing the memory limit of the app-controller pod from 2 GB to 3 GB fixed the issue. Do you see any pod crashes ? of the app-controller ? Any OOMs ?
Thanks @sockejr. We have Argo CD deployed to a dedicated 4 node cluster with plenty of CPU/mem available. We currently have no limits on our Argo CD pods. I see no OOM in the logs and all Argo CD pods show healthy and are in a running state.
I'm actually experiencing the issue as I type this now. One of our three app controller pods shows its using 2 cores, while the others are using less than .5. The cluster syncs started around an hour ago and 3/5 have completed. Assuming this is like every other occurrence, the remaining two will never complete.
Here's something stands out to me... The temporary fix for this issue is to cycle the app controller and repo server pods. Cycling these pods kicks off the same "cluster syncs" as above. But when cycling the pods, the syncs always succeed. Maybe they succeed because they run on a "fresh" container? I.e. no existing cache on disk.
maybe its a variance of the kubectl/fork issue from here: Trying the solution to set or experiment with different settings of --kubectl-parallelism-limit=10 might be worth a try. https://github.com/argoproj/argo-cd/issues/5505#issuecomment-783502260
maybe its a variance of the kubectl/fork issue from https://github.com/argoproj/argo-cd/issues/5505 Trying the solution to set or experiment with different settings of --kubectl-parallelism-limit=10 might be worth a try. https://github.com/argoproj/argo-cd/issues/5505#issuecomment-783502260
Thanks @sockejr , we looked into your suggestion and even set the kubectl-parallelism-limit
but it did not affect the issue.
Issue still occurring, on a daily basis ( sometimes twice ), only solution we've found is to restart argocd-application-controller
... any news?
I'm have the same issue here, one of our application-controller killed by OOMKill, the only sollution that works, is to restart application-controller
Uggghhhh.....
I'm trying to figure out if the issue we are having (similar to @klinux) is the same thing everybody else here is experiencing:
Without any (apparent) increase in workload, the application-controller-0
pod (always the -0
pod) starts OOM-ing, however recreating all the pods does not help. We had this issue a few months ago, and we had to bump memory limit to 3.5 Gi, and now again, where we had to bump to 5Gi.
The interesting thing is that memory usage peaks at about 2Gi, but we still need 5Gi for the pod to start and not immediately OOM-kill:
There are of course peaks in memory usage too short to be picked up by metrics-server, so it sure feels like some kind of memory leak.
We "only" have about 230 apps (created by 20 appsets), and we have manifest-generate-paths
set to a reasonably fine granularity.
Hi @wanddynosios, my problem here, is the project that manager the application, has the monitoring resources enabled, and the coluster that application-controller was monitoring has a lot of orphans resources, after disable monitoring resources in the project, our argocd stops do eat memory.
@wanddynosios You can usually tell if a pod has been OOM killed by looking at the container status in the Pod manifest. This prometheus alert is also useful.
At the start of the pods, The application-controller will initiate watches with the configured cluster and from my understanding, this, with the initial processing of all applications, is what consumes a lot of initial memory requiring the limit to be much higher than the average memory consumption. Every 24h (I think) or sporadically when a watch fails, the controller will "relist"/reconnect the watch causing some memory spikes too.
manifest-generate-paths
should not affect the controller memory much.
Some tips to reduce the memory:
And if that is not possible, then increase the memory limit. These are based on my observation, but you can also do go profiling on the pod to find more what is consuming the memory.
Thank you both for the pointers. We don't have orphaned resources monitoring enabled, and we already have the OOM-killed alert.
I will definitely look into the tips to reduce memory: Though I don't think we can watch less namespaces, we might have some resources, which don't have to be watched, and we can look into sharding. I'll report back :-)
For CPU issues, I suggest you to join slack and post your ArgoCD metrics dashboard. Similar to this post: https://cloud-native.slack.com/archives/C01TSERG0KZ/p1719410085483709?thread_ts=1719236455.982699&cid=C01TSERG0KZ
You can also look at the history for other investigations. Multiple things can cause an application to hangs, but most of the time it is reflected by a workqueue with too many items or the controller CPU hitting the ceiling.
Of course, if you don't see any logs in the controller, it may be in a broken state due to a deadlock, which can also be seen with the CPU being near 0.
@wanddynosios you can set the env variable ARGOCD_CLUSTER_CACHE_LIST_SEMAPHORE
which defaults to 50 to a lower number. This would increase the cluster sync time at startup, but also consume less memory.
From the code:
// EnvClusterCacheListSemaphore is the env variable to control size of the list semaphore // This is used to limit the number of concurrent memory consuming operations on the // k8s list queries results across all clusters to avoid memory spikes during cache initialization.
So I did some playing around in our non-productive environment (~130 apps). I definitely did encounter the core of this issue, where syncs never finish unless you cycle all application-controller pods (and sometimes repo-server, as noted above).
To summarize my findings regarding resource consumption:
For now, our previous setup of fewer controllers with a larger memory buffer offers the most stability for the money. But this queueing issue and the high resource consumption don't appear to be directly connected.
We are sharing our case here, which may help some others.
We had a similar problem in our Openshift 4.12 environments using ArgoCD 2.10 and below. These environments were all affected by this unstability.
In the argocd-application-controller statefulset logs we could see :
time="2024-06-11T15:55:27Z" level=info msg="Start syncing cluster" server="https://kubernetes.default.svc"
.
Usually after about 3 minutes and 20 seconds we get the following message
time="2024-06-11T15:58:47Z" level=info msg="Cluster successfully synchronised" server="https://kubernetes.default.svc"
.
It has happened very often (probably once a day) that the process of refreshing the cache with the cluster gets "stuck" or hangs, and then we see in the web UI that all the applications are running in a continuous "refresh" state, as described in the main topic here.
These clusters have a very large number of objects, but only a few objects are literally managed by ArgoCD: The usual CACHE INFO in ArgoCD looked like this APIs COUNT: over 200, but the same number in all clusters, RESOURCES COUNT: from 15000 to 22000, APPLICATIONS COUNT: About 100 in all clusters.
As already mentioned by @agaudreault, we chose the option:
Watch [fewer resources] (https://argo-cd.readthedocs.io/en/stable/operator-manual/declarative-setup/#resource-exclusioninclusion)
We implemented the resourceInclusions/resourceExclusions logic to significantly reduce the COUNT of APIs to watch, and also the COUNT of RESOURCES, as described here for the Openshift GitOps ArgoCD operator: https://docs.openshift.com/gitops/1.12/argocd_instance/argo-cd-cr-component-properties.html
For example, we use the following, which covers the helm charts/objects we deploy through ArgoCD:
resourceInclusions: |
- apiGroups:
- "*"
clusters:
- "*"
kinds:
- "AppProject"
- "Application"
- "ApplicationSet"
- "ArgoCD"
- "ConfigMap"
- "ControllerRevision"
- "Deployment"
- "EndpointSlice"
- "Endpoints"
- "HorizontalPodAutoscaler"
- "Ingress"
- "Pod"
- "ReplicaSet"
- "Role"
- "RoleBinding"
- "Route"
- "Secret"
- "Service"
- "ServiceAccount"
- "StatefulSet"
The change had a significant impact on the cluster synchronisation process. We went from over 3 minutes to about 45 seconds in all clusters, regardless of the RESOURCES COUNT. I assume it's related to the number of APIs to watch.
I can observe the same issue in openshift-gitops-operator 1.11.0 which uses ArgoCD v2.9.2+c5ea5c4
reconcillation queue just gets stuck and is never processed. We have implemented automatic restart every 3h to make sure that environment gets reconciled at the end.
After the restart it works for some time. We observe it a lot during creation of a lot of new applications, but usually we check ArgoCD then, so it probably starts much before. We are not using any WATCH exclusions, around 250 applications managing around 4 external clusters at the moment.
We see the same error in the logs
W0703 20:48:32.266283 1 warnings.go:70] apps.openshift.io/v1 DeploymentConfig is deprecated in v4.14+, unavailable in v4.10000+1743time="2024-07-03T20:50:09Z" level=error msg="Failed to cache app resources: error getting resource tree: failed to get app hosts: error synchronizing cache state : failed to sync cluster https://api.c52.obfuscated.ocp.obfuscated.acme.com:6443: failed to load initial state of resource ImageStream.image.openshift.io: the server was unable to return a response in the time allotted, but may still be processing the request" application=argocd-mgmt/c52-machine-config-chrony dedup_ms=0 diff_ms=0 git_ms=73518 health_ms=0 live_ms=0 settings_ms=0 sync_ms=01744time="2024-07-03T20:50:09Z" level=error msg="Failed to cache app resources: error getting managed resources: error getting cluster cache: error synchronizing cache state : failed to sync cluster https://api.c52.obfuscated.ocp.obfuscated.acme.com:6443: failed to load initial state of resource ImageStream.image.openshift.io: the server was unable to return a response in the time allotted, but may still be processing the request" application=argocd-mgmt/c52-argocd-managed-ocp dedup_ms=181361 diff_ms=0 git_ms=1234969 health_ms=0 live_ms=0 settings_ms=0 sync_ms=01745time="2024-07-03T20:50:09Z" level=error msg="Failed to cache app resources: error getting resource tree: failed to get app hosts: error synchronizing cache state : failed to sync cluster https://api.c52.obfuscated.ocp.obfuscated.acme.com:6443: failed to load initial state of resource ImageStream.image.openshift.io: the server was unable to return a response in the time allotted, but may still be processing the request" application=argocd-mgmt/c52-image dedup_ms=0 diff_ms=0 git_ms=73913 health_ms=0 live_ms=0 settings_ms=0 sync_ms=01746time="2024-07-03T20:50:10Z" level=error msg="Failed to cache app resources: error getting managed resources: error getting cluster cache: error synchronizing cache state : failed to sync cluster https://api.c52.obfuscated.ocp.obfuscated.acme.com:6443: failed to load initial state of resource ImageStream.image.openshift.io: the server was unable to return a response in the time allotted, but may still be processing the request" application=argocd-mgmt/c52-monitoring dedup_ms=182086 diff_ms=12 git_ms=1099531 health_ms=0 live_ms=0 settings_ms=0 sync_ms=01747time="2024-07-03T20:50:12Z" level=error msg="Failed to cache app resources: error getting resource tree: failed to get app hosts: error synchronizing cache state : failed to sync cluster https://api.c52.obfuscated.ocp.obfuscated.acme.com:6443: failed to load initial state of resource ImageStream.image.openshift.io: the server was unable to return a response in the time allotted, but may still be processing the request" application=argocd-mgmt/c52-ipam-controller dedup_ms=0 diff_ms=0 git_ms=0 health_ms=0 live_ms=0 settings_ms=0 sync_ms=01748E0703 20:50:49.184759 1 retrywatcher.go:130] "Watch failed" err="context canceled"1749E0703 20:50:49.185175 1 retrywatcher.go:130] "Watch failed" err="context canceled"1750E0703 20:50:49.185203 1 retrywatcher.go:130] "Watch failed" err="context canceled"1751E0703 20:50:49.184763 1 retrywatcher.go:130] "Watch failed" err="context canceled"1752E0703 20:50:49.184787 1 retrywatcher.go:130] "Watch failed" err="context canceled"1753E0703 20:50:49.184794 1 retrywatcher.go:130] "Watch failed" err="context canceled"1754E0703 20:50:49.184819 1 retrywatcher.go:130] "Watch failed" err="context canceled"1755E0703 20:50:49.184827 1 retrywatcher.go:130] "Watch failed" err="context canceled"1756E0703 20:50:49.185382 1 retrywatcher.go:130] "Watch failed" err="context canceled"1757E0703 20:50:49.184835 1 retrywatcher.go:130] "Watch failed" err="context canceled"1758E0703 20:50:49.184854 1 retrywatcher.go:130] "Watch failed" err="context canceled"1759E0703 20:50:49.184878 1 retrywatcher.go:130] "Watch failed" err="context canceled"1760E0703 20:50:49.184905 1 retrywatcher.go:130] "Watch failed" err="context canceled"1761E0703 20:50:49.184942 1 retrywatcher.go:130] "Watch failed" err="context canceled"1762E0703 20:50:49.184958 1 retrywatcher.go:130] "Watch failed" err="context canceled"1763E0703 20:50:49.184962 1 retrywatcher.go:130] "Watch failed" err="context canceled"1764E0703 20:50:49.184984 1 retrywatcher.go:130] "Watch failed" err="context canceled"1765E0703 20:50:49.185014 1 retrywatcher.go:130] "Watch failed" err="context canceled"1766E0703 20:50:49.185028 1 retrywatcher.go:130] "Watch failed" err="Get \"https://api.c52.obfuscated.ocp.obfuscated.acme.com:6443/apis/user.openshift.io/v1/users?allowWatchBookmarks=true&resourceVersion=165834346&watch=true\": context canceled"1767E0703 20:50:49.185097 1 retrywatcher.go:130] "Watch failed" err="context canceled"1768E0703 20:50:49.185135 1 retrywatcher.go:130] "Watch failed" err="context canceled"1769E0703 20:50:49.185139 1 retrywatcher.go:130] "Watch failed" err="context canceled"1770E0703 20:50:49.185146 1 retrywatcher.go:130] "Watch failed" err="context canceled"1771E0703 20:50:49.185168 1 retrywatcher.go:130] "Watch failed" err="context canceled"1772E0703 20:50:49.185170 1 retrywatcher.go:130] "Watch failed" err="context canceled"1773E0703 20:50:49.184759 1 retrywatcher.go:130] "Watch failed" err="context canceled"1774E0703 20:50:49.185409 1 retrywatcher.go:130] "Watch failed" err="context canceled"1775E0703 20:50:49.185448 1 retrywatcher.go:130] "Watch failed" err="context canceled"1776E0703 20:50:49.185449 1 retrywatcher.go:130] "Watch failed" err="context canceled"
@jbartyze-rh assuming the CPU and memory are fine at that moment, and that there are no deadlocks (this can be seen with pprof), did you check the Kubernetes API Client metrics? There are some Env variables that can be adjusted to reduce burst and throttling. https://github.com/argoproj/argo-cd/pull/8404.
Those ENV variables look promising! @agaudreault do we have some rule of thumb recommendations for tunning those?
Missed the pprof part - sadly this is highly regulated environment where I do not have flexibility to access such tools.
Also looks like this RH OCPBUG https://issues.redhat.com/browse/GITOPS-4757 is correlated to that particular issue. I cannot share the content, but it was not possible to replicate the bug with ArgoCD 2.11. Once new openshift-gitops-operator 1.13 will be released will validate if controller hang is still happening. There is some information(lets treat it with grain of salt) that problem could be correlated to git timeouts experienced by ArgoCD and git cache introduced in 2.11 could help.
Some metric dump here
Currently as we can see controller is hanged.
API activity looks like its going down when controlller hangs.
There is huge amount of gorutines launched in in one controller vs the other ones
When hanging starts we see also spike in git requests (ls-remote)
On ETCD level on that hub we see slowness on disk sync introduced 10 minutes before. Could have an effect on API response on that particular hub cluster.
On API level on that hub
I was able to find some controllers that were in a deadlock state on Kubernetes API changes. Here are the relevant logs from the pprof goroutine capture. I am not sure if it will fully fix this all the issue in this thread, but it definitely causes the controller to hang.
I will be working on a fix for this specific scenario.
Summary:
loadInitialState(..., false)
is called on watch event. At this moment, it does not hold the cache locklistResources
, which use a semaphore (default of 50)startMissingWatches
loadInitialState(..., true)
is called on watch event. At this moment, it does hold the cache locklistResources
, which use a semaphore (default of 50)// NO LOCK. HAS SEMAPHORE LOCK. WAITING FOR LOCK...... 50 routines, and semaphore lock is 50.....
50 @ 0x43e64e 0x44faf8 0x44facf 0x46dce5 0x47d13d 0x47e611 0x47e5f2 0x1f1b37c 0x1f1c0e7 0x1f1b6a2 0x1f1bd33 0x1f1c996 0x1b7ab1b 0x1335f53 0x133496c 0x13366e7 0x1336569 0x1335fb3 0x1335ecb 0x1b7a916 0x1f1c785 0x471e61
# 0x46dce4 sync.runtime_SemacquireMutex+0x24 /usr/local/go/src/runtime/sema.go:77
# 0x47d13c sync.(*Mutex).lockSlow+0x15c /usr/local/go/src/sync/mutex.go:171
# 0x47e610 sync.(*Mutex).Lock+0x30 /usr/local/go/src/sync/mutex.go:90
# 0x47e5f1 sync.(*RWMutex).Lock+0x11 /usr/local/go/src/sync/rwmutex.go:147
# 0x1f1b37b github.com/argoproj/gitops-engine/pkg/cache.runSynced+0x3b /go/pkg/mod/github.com/argoproj/gitops-engine@v0.7.1-0.20240416142647-fbecbb86e412/pkg/cache/cluster.go:524
# 0x1f1c0e6 github.com/argoproj/gitops-engine/pkg/cache.(*clusterCache).loadInitialState.func1+0x306 /go/pkg/mod/github.com/argoproj/gitops-engine@v0.7.1-0.20240416142647-fbecbb86e412/pkg/cache/cluster.go:586
# 0x1f1b6a1 github.com/argoproj/gitops-engine/pkg/cache.(*clusterCache).listResources+0x201 /go/pkg/mod/github.com/argoproj/gitops-engine@v0.7.1-0.20240416142647-fbecbb86e412/pkg/cache/cluster.go:567
# 0x1f1bd32 github.com/argoproj/gitops-engine/pkg/cache.(*clusterCache).loadInitialState+0x92 /go/pkg/mod/github.com/argoproj/gitops-engine@v0.7.1-0.20240416142647-fbecbb86e412/pkg/cache/cluster.go:571
# 0x1f1c995 github.com/argoproj/gitops-engine/pkg/cache.(*clusterCache).watchEvents.func1+0x155 /go/pkg/mod/github.com/argoproj/gitops-engine@v0.7.1-0.20240416142647-fbecbb86e412/pkg/cache/cluster.go:607
# 0x1b7ab1a github.com/argoproj/gitops-engine/pkg/utils/kube.RetryUntilSucceed.func1+0xfa /go/pkg/mod/github.com/argoproj/gitops-engine@v0.7.1-0.20240416142647-fbecbb86e412/pkg/utils/kube/kube.go:411
# 0x1335f52 k8s.io/apimachinery/pkg/util/wait.PollImmediateUntil.ConditionFunc.WithContext.func1+0x12 /go/pkg/mod/k8s.io/apimachinery@v0.26.11/pkg/util/wait/wait.go:222
# 0x133496b k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext+0x4b /go/pkg/mod/k8s.io/apimachinery@v0.26.11/pkg/util/wait/wait.go:235
# 0x13366e6 k8s.io/apimachinery/pkg/util/wait.WaitForWithContext+0x106 /go/pkg/mod/k8s.io/apimachinery@v0.26.11/pkg/util/wait/wait.go:662
# 0x1336568 k8s.io/apimachinery/pkg/util/wait.poll+0x88 /go/pkg/mod/k8s.io/apimachinery@v0.26.11/pkg/util/wait/wait.go:596
# 0x1335fb2 k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext+0x52 /go/pkg/mod/k8s.io/apimachinery@v0.26.11/pkg/util/wait/wait.go:547
# 0x1335eca k8s.io/apimachinery/pkg/util/wait.PollImmediateUntil+0x6a /go/pkg/mod/k8s.io/apimachinery@v0.26.11/pkg/util/wait/wait.go:538
# 0x1b7a915 github.com/argoproj/gitops-engine/pkg/utils/kube.RetryUntilSucceed+0x115 /go/pkg/mod/github.com/argoproj/gitops-engine@v0.7.1-0.20240416142647-fbecbb86e412/pkg/utils/kube/kube.go:409
# 0x1f1c784 github.com/argoproj/gitops-engine/pkg/cache.(*clusterCache).watchEvents+0x244 /go/pkg/mod/github.com/argoproj/gitops-engine@v0.7.1-0.20240416142647-fbecbb86e412/pkg/cache/cluster.go:598
// HAS WRITE LOCK. WAITING FOR SEMAPHORE LIST
1 @ 0x43e64e 0x44eac5 0x1d77e45 0x1f1b526 0x1f1bd33 0x1f1ae30 0x1f1e4ca 0x1f1acf0 0x1f1e0f7 0x1f1b3c6 0x1f1d219 0x1b7ab1b 0x1335f53 0x133496c 0x13366e7 0x1336569 0x1335fb3 0x1335ecb 0x1b7a916 0x1f1c785 0x471e61
# 0x1d77e44 golang.org/x/sync/semaphore.(*Weighted).Acquire+0x324 /go/pkg/mod/golang.org/x/sync@v0.3.0/semaphore/semaphore.go:60
# 0x1f1b525 github.com/argoproj/gitops-engine/pkg/cache.(*clusterCache).listResources+0x85 /go/pkg/mod/github.com/argoproj/gitops-engine@v0.7.1-0.20240416142647-fbecbb86e412/pkg/cache/cluster.go:531
# 0x1f1bd32 github.com/argoproj/gitops-engine/pkg/cache.(*clusterCache).loadInitialState+0x92 /go/pkg/mod/github.com/argoproj/gitops-engine@v0.7.1-0.20240416142647-fbecbb86e412/pkg/cache/cluster.go:571
# 0x1f1ae2f github.com/argoproj/gitops-engine/pkg/cache.(*clusterCache).startMissingWatches.func1+0xcf /go/pkg/mod/github.com/argoproj/gitops-engine@v0.7.1-0.20240416142647-fbecbb86e412/pkg/cache/cluster.go:494
# 0x1f1e4c9 github.com/argoproj/gitops-engine/pkg/cache.(*clusterCache).processApi+0xc9 /go/pkg/mod/github.com/argoproj/gitops-engine@v0.7.1-0.20240416142647-fbecbb86e412/pkg/cache/cluster.go:736
# 0x1f1acef github.com/argoproj/gitops-engine/pkg/cache.(*clusterCache).startMissingWatches+0x36f /go/pkg/mod/github.com/argoproj/gitops-engine@v0.7.1-0.20240416142647-fbecbb86e412/pkg/cache/cluster.go:493
# 0x1f1e0f6 github.com/argoproj/gitops-engine/pkg/cache.(*clusterCache).watchEvents.func1.4+0x16 /go/pkg/mod/github.com/argoproj/gitops-engine@v0.7.1-0.20240416142647-fbecbb86e412/pkg/cache/cluster.go:701
# 0x1f1b3c5 github.com/argoproj/gitops-engine/pkg/cache.runSynced+0x85 /go/pkg/mod/github.com/argoproj/gitops-engine@v0.7.1-0.20240416142647-fbecbb86e412/pkg/cache/cluster.go:526
# 0x1f1d218 github.com/argoproj/gitops-engine/pkg/cache.(*clusterCache).watchEvents.func1+0x9d8 /go/pkg/mod/github.com/argoproj/gitops-engine@v0.7.1-0.20240416142647-fbecbb86e412/pkg/cache/cluster.go:700
# 0x1b7ab1a github.com/argoproj/gitops-engine/pkg/utils/kube.RetryUntilSucceed.func1+0xfa /go/pkg/mod/github.com/argoproj/gitops-engine@v0.7.1-0.20240416142647-fbecbb86e412/pkg/utils/kube/kube.go:411
# 0x1335f52 k8s.io/apimachinery/pkg/util/wait.PollImmediateUntil.ConditionFunc.WithContext.func1+0x12 /go/pkg/mod/k8s.io/apimachinery@v0.26.11/pkg/util/wait/wait.go:222
# 0x133496b k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext+0x4b /go/pkg/mod/k8s.io/apimachinery@v0.26.11/pkg/util/wait/wait.go:235
# 0x13366e6 k8s.io/apimachinery/pkg/util/wait.WaitForWithContext+0x106 /go/pkg/mod/k8s.io/apimachinery@v0.26.11/pkg/util/wait/wait.go:662
# 0x1336568 k8s.io/apimachinery/pkg/util/wait.poll+0x88 /go/pkg/mod/k8s.io/apimachinery@v0.26.11/pkg/util/wait/wait.go:596
# 0x1335fb2 k8s.io/apimachinery/pkg/util/wait.PollImmediateUntilWithContext+0x52 /go/pkg/mod/k8s.io/apimachinery@v0.26.11/pkg/util/wait/wait.go:547
# 0x1335eca k8s.io/apimachinery/pkg/util/wait.PollImmediateUntil+0x6a /go/pkg/mod/k8s.io/apimachinery@v0.26.11/pkg/util/wait/wait.go:538
# 0x1b7a915 github.com/argoproj/gitops-engine/pkg/utils/kube.RetryUntilSucceed+0x115 /go/pkg/mod/github.com/argoproj/gitops-engine@v0.7.1-0.20240416142647-fbecbb86e412/pkg/utils/kube/kube.go:409
# 0x1f1c784 github.com/argoproj/gitops-engine/pkg/cache.(*clusterCache).watchEvents+0x244 /go/pkg/mod/github.com/argoproj/gitops-engine@v0.7.1-0.20240416142647-fbecbb86e412/pkg/cache/cluster.go:598
any idea which argo version introduced this bug?
@agaudreault I had opened an issue for that specific problem and submitted a PR for it here: https://github.com/argoproj/argo-cd/issues/18902
Checklist:
argocd version
.Describe the bug
In the logs, I see all of our (five) clusters start to sync around the same time. This is denoted by the the following log message:
After few minutes, some of the cluster syncs complete, while other clusters never complete. The following logs show this behavior:
Note: For the clusters that don't sync, there is never an additional log message that states the sync failed. I've watched the logs for up to three days.
Shortly after this occurs the Argo CD UI shows applications in a perpetual syncing ("Refreshing") state. See screen shot
#1
. This state continues until the application-controller and repo-server are restarted.In addition to the behavior above, I also see this new message in the logs:
Note: This log message never occurs unless we're experiencing this issue and goes away after cycling the application-controller and repo-server.
To Reproduce
Wait for the clusters to automatically sync which appears to be happening once a day (perhaps this is related to the generated manifests cache invalidation that defaults to 24h).
Expected behavior
"Cluster successfully synced"
message for each cluster.Screenshots
1
Version
Logs
Other notes
We are using AWS EKS but are not using AWS VPC CNI network policies.