Closed Funk66 closed 1 year ago
@Funk66
We are experiencing a similar situation, but on ArgoCD v1.8.4+28aea3d
. Something strange we noticed is, that we encounter this issue currently only on EKS clusters. We have identical clusters in Azure and the problem does not occur there.
Mind me asking if your ArgoCD setup is in AWS? Did the issue happen around the 4th of Januar - at least that what happened with our idendical environments
@patrickjahns, yes, this is on AWS. It started in December, on the day we upgraded to v2.2.1, as explained in the description. I have no reason to think that this is related to the underlying infrastructure. If you have any indication to the contrary, please let me know and I'll try reaching out to the AWS support team.
I met the same problem, but at version v1.7.14+92b0237
@Funk66 It might be a red herring, but I can elaborate shortly what lead me down the path for asking:
We have several kubernetes environments in AWS and Azure. ArgoCD is installed locally - from alls clusters, there are 3 EKS clusters in the same region and their version is 1.18-eks.8 and 1.19-eks.6. We are seeing the issue on 3 of our clusters from the same region, and it started to surface on the same day (4. January ) around the same time ( half an hour difference ).
We increased the verbosity of logging to debug/trace but haven't found any further indicators so far. So this is really mind boggling right now
@fatalc Any chance this is happening in EKS? If not, at least I am a bit more sure that EKS is not the right direction to investigate ;-)
@patrickjahns, did the issue by any chance start after an application controller pod restart? We're on EKS 1.20 and see this happening on every cluster in every region. The only change around the time it started was the ArgoCD upgrade, which is why I'm inclined to think that this problem is caused by ArgoCD being unable to properly keep track of the apps it has already refreshed. That said, I haven't taken the time to look into the code, so that just an uninformed guess.
We didn't perform any operations on the controllers. By chance they controllers must have all three been restarted around the same time (same day, within 1 hour from each other)
We are seeing this on our k3s cluster (v1.22.4+k3s1) with ArgoCD v2.1.8. CPU generally high too.
Further digging in our environments revealed, that there were permanent updates to externalsecrets resource (status field) by the external-secrets controller. In our environments that was triggered through expired certificates (mTLS authentication of external-secrets) which we didn't catch.
We've resolved the underlying issues with the certificates and the reconciliation loop stopped. In the ArgoCD documentation we noticed als that one can disable that StatusChanges trigger reconciliation loops
data:
resource.compareoptions: |
# disables status field diffing in specified resource types
# 'crd' - CustomResourceDefinition-s (default)
# 'all' - all resources
# 'none' - disabled
ignoreResourceStatusField: all
https://argo-cd.readthedocs.io/en/stable/user-guide/diffing/#system-level-configuration
Maybe this is something people can try and see if that is the trigger in their environments. Besides that, another option would be to iterate over the resources and watch for changes. Didn't find a nice command yet to do a watch on all resources (i.e. something along the lines of kubectl watch * ) yet - if anyone has an idea - highly appreciated
Something like https://github.com/corneliusweig/ketall/issues/29 would be good to catch I suppose.
In the team we discussed also how we could have more easily catched the changes - and we came to the conclusion that it would be great, if logging in ArgoCD (DEBUG/TRACE) mode could include more information on what changes/events triggered the ArgoCD reconciliation.
Maybe this is something that ArgoCD maintainers would consider (cc @alexmt (pinging you since it was added to a milestone for investigation))
In the team we discussed also how we could have more easily catched the changes - and we came to the conclusion that it would be great, if logging in ArgoCD (DEBUG/TRACE) mode could include more information on what changes/events triggered the ArgoCD reconciliation.
I agree, this information would be really useful. We had reconciliation loops bugs in the past, where it wasn't clear which resource(s) actually triggered the reconciliation and took tremendous efforts to troubleshoot.
The issue about changing secrets was mentioned in #6108. I have checked all resources being tracked by the corresponding applications and none of them sees to change, or at least not at that rate. The ignoreResourceStatusField
parameter didn't help in my case. I will have to dig deeper to ferret out what is going on. I agree that more comprehensive logging would make this much easier.
So I've finally taken some time to have another look at this and here's what I found. First, I can confirm that the issue started with v2.2.0. Reverting the application-controller image to an earlier version makes the problem go away. Furthermore, I think the issue was introduced with commit 05935a9d, where an 'if' statement to exclude orphaned resources was removed.
The problem itself is that ArgoCD detects changes to config-maps used for leader election purposes. These can be easily identified with kubectl get cm -A -w
, since the leader election process requires updating the config-map every few seconds. Now, even though these resources are listed in spec.orphanedResources.ignore
of the AppProject manifest, the ApplicationController.handleObjectUpdated
method flags them as being managed by every App in that namespace, hence calling requestAppRefresh
for each one of them roughly every second.
I could submit a PR reverting the conflicting change, but I'd appreciate having other opinions on how to better fix this.
Running argocd 2.1.3 in EKS and have problem with high cpu usage and throttling of application controller aswell. So do not think 2.2 is the only issue though.
For what it's worth I tried the solution suggested by @patrickjahns above and our ArgoCD went from consuming ~1000-1500m to ~ 20m CPU.
i.e. setting this in argocd-cm
and restarting the argocd-application-controller
deployment:
data:
resource.compareoptions: |
ignoreResourceStatusField: all
Running ArgoCD 2.2.5 in EKS 1.21.
I'm also hit by the high cpu caused by reconciliation loop. Thanks to @Funk66 I verified that it is caused by the leader election configmaps. Is there any workaround available or a fix in progress? The problem exists for me in Argo CD 2.3.1 and 2.3.2 with the following configmaps:
FYI: If you remove spec.orphanedResources completely from your "kind: AppProject" the reconciliation loop and high cpu stops. I had it set to warn: false to be able to see orphaned resources in the web ui:
spec:
description: Argocd Project
orphanedResources:
warn: false
Removing it lead to a complete stop of the reconciliation loop and a significant drop in cpu:
using the command suggested by @Funk66 I was also able to see that I have several cm that keep popping in the list, but one of them is in a namespace we see many reconciliations for.
is there a workaround?
orphanedResources
(even if it is empty, but present in spec issue is still ongoing) spec. https://github.com/argoproj/argo-cd/issues/8100#issuecomment-1076067184Tested with version V2.3.3
@Vladyslav-Miletskyi thanks! That did the trick. We were having the exact same problem and now the load is normal.
Is there something else than a debug log that we could use to detect this in a production deployment? Enabling debug in production is not something that is possible for us.
I am mainly looking at a way to find resources that are continuously regenerated.
Disabling orphanedResources
didn't do the trick for me. I am observing around 2k / min of "Refreshing app status (controller refresh requested)
in logs with only 170 apps. ArgoCD v2.4.11
The issue is still present in v2.5.1 and the orphanedResources
is not in spec.
We are having the same issue with Keda ScaledObjects. Keda appears to update the status.lastActiveTime field every few seconds which in turn appears to trigger a reconciliation. Setting ignoreResourceStatusField
to crd
or all
doesn't appear to make a difference.
Is there any way to ignore reconciliation on specific resources or fields?
We are having the same issue with Keda ScaledObjects. Keda appears to update the status.lastActiveTime field every few seconds which in turn appears to trigger a reconciliation. Setting
ignoreResourceStatusField
tocrd
orall
doesn't appear to make a difference. Is there any way to ignore reconciliation on specific resources or fields?8100, #8914 and #6108 all appear to be pretty similar and I can't see a workaround in any of those so would appreciate it if anyone can suggest one!
In case it helps anyone else, increasing the ScaledObject pollingInterval made a massive difference to the ArgoCD CPU usage.
I've been seeing this a lot still on v2.6.2 with two different metallb deployments. Constantly loops over them and the orphanedResources is not in the project spec for default.
In v2.6.1 with ignoreAggregatedRoles: true
, ignoreResourceStatusField: all
, timeout.reconciliation: 300s
and increased polling interval for Keda, the issue is still present. The application controller (4 replicas) is using 16 CPUs with ~280 applications.
ArgoCD version:
{
"Version": "v2.5.7+e0ee345",
"BuildDate": "2023-01-18T02:23:39Z",
"GitCommit": "e0ee3458d0921ad636c5977d96873d18590ecf1a",
"GitTreeState": "clean",
"GoVersion": "go1.18.10",
"Compiler": "gc",
"Platform": "linux/amd64",
"KustomizeVersion": "v4.5.7 2022-08-02T16:35:54Z",
"HelmVersion": "v3.10.3+g835b733",
"KubectlVersion": "v0.24.2",
"JsonnetVersion": "v0.18.0"
}
we even bumped timeout.reconciliation from 30m to 2h, but that didn't help.
we ran into this issue when using custom plugins for our applications:
plugin:
env: []
name: custom-plugin
repoURL: ssh://git@<your-repo-server>/argo/deploy-sample-app.git
targetRevision: main
and noticed the following logs in application controller:
{"application":"argocd/deploy-sample-“app,”level":"info","msg":"Refreshing app status (spec.source differs), level (3)","time":"2023-03-02T06:16:35Z"}
with multiple test environments configured to use argocd and 100s of argo apps per env, this crashed our git servers every couple of days.
so we had to add the following dummy var to fix the constant refresh of the app:
plugin:
env:
- name: DUMMY_VAR_TO_STOP_ARGO_REFRESH
value: "true"
I'm also seeing this issue with AzureKeyVaultSecret
argocd-application-controller-8] time="2023-03-10T00:39:15Z" level=debug msg="Refreshing app argocd/application for change in cluster of object namespace/avk of type spv.no/v1/AzureKeyVaultSecret"
this then triggers a level (1) refresh that takes a long time:
[argocd-application-controller-8] time="2023-03-10T00:39:14Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application= argocd/application
The behavior can be configured in ignoreResourceUpdates
to resolve this issue.
@Funk66 did u submit a PR for https://github.com/argoproj/argo-cd/issues/8100#issuecomment-1033514595 ?
I tried implementing a fix but couldn't make it work fully. I may try again in the coming weeks, if nobody else does.
Checklist:
argocd version
.Describe the bug
Upon upgrading from v2.1.7 to v2.2.1, the argocd application controller started performing continuous reconciliations for every app (about one per second, which is as much as CPU capacity allows). Issues #3262 and #6108 sound similar but didn't help. I haven't been able to figure out the reason why a refresh keeps being requested. The log below shows the block that keeps repeating for each app every second.
Expected behavior
The number of reconciliations should be two orders of magnitude lower.
Version
Logs