argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
17.79k stars 5.44k forks source link

Reconciliation loop #8100

Closed Funk66 closed 1 year ago

Funk66 commented 2 years ago

Checklist:

Describe the bug

Upon upgrading from v2.1.7 to v2.2.1, the argocd application controller started performing continuous reconciliations for every app (about one per second, which is as much as CPU capacity allows). Issues #3262 and #6108 sound similar but didn't help. I haven't been able to figure out the reason why a refresh keeps being requested. The log below shows the block that keeps repeating for each app every second.

Expected behavior

The number of reconciliations should be two orders of magnitude lower.

Version

v2.2.2+03b17e0

Logs

time="2022-01-05T12:34:33Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=kube-proxy
time="2022-01-05T12:34:33Z" level=info msg="Ignore status for CustomResourceDefinitions"
time="2022-01-05T12:34:33Z" level=info msg="Ignore '/spec/preserveUnknownFields' for CustomResourceDefinitions"
time="2022-01-05T12:34:33Z" level=info msg="Comparing app state (cluster: https://kubernetes.default.svc, namespace: kube-system)" application=kube-proxy
time="2022-01-05T12:34:33Z" level=info msg="getRepoObjs stats" application=prometheus-adapter build_options_ms=0 helm_ms=0 plugins_ms=0 repo_ms=0 time_ms=178 unmarshal_ms=163 version_ms=14
time="2022-01-05T12:34:33Z" level=info msg="Skipping auto-sync: application status is Synced" application=monitoring-common
time="2022-01-05T12:34:33Z" level=info msg="No status changes. Skipping patch" application=monitoring-common
time="2022-01-05T12:34:33Z" level=info msg="Reconciliation completed" application=monitoring-common dedup_ms=0 dest-name= dest-namespace=services dest-server="https://kubernetes.default.svc" diff_ms=0 fields.level=1 git_ms=391 health_ms=0 live_ms=119 settings_ms=0 sync_ms=0 time_ms=1098
patrickjahns commented 2 years ago

@Funk66 We are experiencing a similar situation, but on ArgoCD v1.8.4+28aea3d. Something strange we noticed is, that we encounter this issue currently only on EKS clusters. We have identical clusters in Azure and the problem does not occur there.

Mind me asking if your ArgoCD setup is in AWS? Did the issue happen around the 4th of Januar - at least that what happened with our idendical environments

Funk66 commented 2 years ago

@patrickjahns, yes, this is on AWS. It started in December, on the day we upgraded to v2.2.1, as explained in the description. I have no reason to think that this is related to the underlying infrastructure. If you have any indication to the contrary, please let me know and I'll try reaching out to the AWS support team.

cnfatal commented 2 years ago

I met the same problem, but at version v1.7.14+92b0237

patrickjahns commented 2 years ago

@Funk66 It might be a red herring, but I can elaborate shortly what lead me down the path for asking:

We have several kubernetes environments in AWS and Azure. ArgoCD is installed locally - from alls clusters, there are 3 EKS clusters in the same region and their version is 1.18-eks.8 and 1.19-eks.6. We are seeing the issue on 3 of our clusters from the same region, and it started to surface on the same day (4. January ) around the same time ( half an hour difference ).

We increased the verbosity of logging to debug/trace but haven't found any further indicators so far. So this is really mind boggling right now

@fatalc Any chance this is happening in EKS? If not, at least I am a bit more sure that EKS is not the right direction to investigate ;-)

Funk66 commented 2 years ago

@patrickjahns, did the issue by any chance start after an application controller pod restart? We're on EKS 1.20 and see this happening on every cluster in every region. The only change around the time it started was the ArgoCD upgrade, which is why I'm inclined to think that this problem is caused by ArgoCD being unable to properly keep track of the apps it has already refreshed. That said, I haven't taken the time to look into the code, so that just an uninformed guess.

patrickjahns commented 2 years ago

We didn't perform any operations on the controllers. By chance they controllers must have all three been restarted around the same time (same day, within 1 hour from each other)

MrSaints commented 2 years ago

We are seeing this on our k3s cluster (v1.22.4+k3s1) with ArgoCD v2.1.8. CPU generally high too.

patrickjahns commented 2 years ago

Further digging in our environments revealed, that there were permanent updates to externalsecrets resource (status field) by the external-secrets controller. In our environments that was triggered through expired certificates (mTLS authentication of external-secrets) which we didn't catch.

We've resolved the underlying issues with the certificates and the reconciliation loop stopped. In the ArgoCD documentation we noticed als that one can disable that StatusChanges trigger reconciliation loops

data:
  resource.compareoptions: |
    # disables status field diffing in specified resource types
    # 'crd' - CustomResourceDefinition-s (default)
    # 'all' - all resources
    # 'none' - disabled
    ignoreResourceStatusField: all

https://argo-cd.readthedocs.io/en/stable/user-guide/diffing/#system-level-configuration

Maybe this is something people can try and see if that is the trigger in their environments. Besides that, another option would be to iterate over the resources and watch for changes. Didn't find a nice command yet to do a watch on all resources (i.e. something along the lines of kubectl watch * ) yet - if anyone has an idea - highly appreciated

Something like https://github.com/corneliusweig/ketall/issues/29 would be good to catch I suppose.

In the team we discussed also how we could have more easily catched the changes - and we came to the conclusion that it would be great, if logging in ArgoCD (DEBUG/TRACE) mode could include more information on what changes/events triggered the ArgoCD reconciliation.

Maybe this is something that ArgoCD maintainers would consider (cc @alexmt (pinging you since it was added to a milestone for investigation))

jannfis commented 2 years ago

In the team we discussed also how we could have more easily catched the changes - and we came to the conclusion that it would be great, if logging in ArgoCD (DEBUG/TRACE) mode could include more information on what changes/events triggered the ArgoCD reconciliation.

I agree, this information would be really useful. We had reconciliation loops bugs in the past, where it wasn't clear which resource(s) actually triggered the reconciliation and took tremendous efforts to troubleshoot.

Funk66 commented 2 years ago

The issue about changing secrets was mentioned in #6108. I have checked all resources being tracked by the corresponding applications and none of them sees to change, or at least not at that rate. The ignoreResourceStatusField parameter didn't help in my case. I will have to dig deeper to ferret out what is going on. I agree that more comprehensive logging would make this much easier.

Funk66 commented 2 years ago

So I've finally taken some time to have another look at this and here's what I found. First, I can confirm that the issue started with v2.2.0. Reverting the application-controller image to an earlier version makes the problem go away. Furthermore, I think the issue was introduced with commit 05935a9d, where an 'if' statement to exclude orphaned resources was removed. The problem itself is that ArgoCD detects changes to config-maps used for leader election purposes. These can be easily identified with kubectl get cm -A -w, since the leader election process requires updating the config-map every few seconds. Now, even though these resources are listed in spec.orphanedResources.ignore of the AppProject manifest, the ApplicationController.handleObjectUpdated method flags them as being managed by every App in that namespace, hence calling requestAppRefresh for each one of them roughly every second. I could submit a PR reverting the conflicting change, but I'd appreciate having other opinions on how to better fix this.

nilsbillo commented 2 years ago

Running argocd 2.1.3 in EKS and have problem with high cpu usage and throttling of application controller aswell. So do not think 2.2 is the only issue though.

albgus commented 2 years ago

For what it's worth I tried the solution suggested by @patrickjahns above and our ArgoCD went from consuming ~1000-1500m to ~ 20m CPU.

i.e. setting this in argocd-cm and restarting the argocd-application-controller deployment:

data:
  resource.compareoptions: |
    ignoreResourceStatusField: all

Running ArgoCD 2.2.5 in EKS 1.21.

pyromaniac3010 commented 2 years ago

I'm also hit by the high cpu caused by reconciliation loop. Thanks to @Funk66 I verified that it is caused by the leader election configmaps. Is there any workaround available or a fix in progress? The problem exists for me in Argo CD 2.3.1 and 2.3.2 with the following configmaps:

pyromaniac3010 commented 2 years ago

FYI: If you remove spec.orphanedResources completely from your "kind: AppProject" the reconciliation loop and high cpu stops. I had it set to warn: false to be able to see orphaned resources in the web ui:

spec:
  description: Argocd Project
  orphanedResources:
    warn: false

Removing it lead to a complete stop of the reconciliation loop and a significant drop in cpu:

image image image
ybialik commented 2 years ago

using the command suggested by @Funk66 I was also able to see that I have several cm that keep popping in the list, but one of them is in a namespace we see many reconciliations for.

is there a workaround?

Vladyslav-Miletskyi commented 2 years ago
  1. Delete orphanedResources (even if it is empty, but present in spec issue is still ongoing) spec. https://github.com/argoproj/argo-cd/issues/8100#issuecomment-1076067184
  2. Restart application controller(-s)
  3. Enjoy

Tested with version V2.3.3

bakkerpeter commented 2 years ago

@Vladyslav-Miletskyi thanks! That did the trick. We were having the exact same problem and now the load is normal.

agaudreault commented 2 years ago

Is there something else than a debug log that we could use to detect this in a production deployment? Enabling debug in production is not something that is possible for us.

I am mainly looking at a way to find resources that are continuously regenerated.

prein commented 1 year ago

Disabling orphanedResources didn't do the trick for me. I am observing around 2k / min of "Refreshing app status (controller refresh requested) in logs with only 170 apps. ArgoCD v2.4.11

roeizavida commented 1 year ago

The issue is still present in v2.5.1 and the orphanedResources is not in spec.

jamesalucas commented 1 year ago

We are having the same issue with Keda ScaledObjects. Keda appears to update the status.lastActiveTime field every few seconds which in turn appears to trigger a reconciliation. Setting ignoreResourceStatusField to crd or all doesn't appear to make a difference. Is there any way to ignore reconciliation on specific resources or fields?

8100, #8914 and #6108 all appear to be pretty similar and I can't see a workaround in any of those so would appreciate it if anyone can suggest one!

jamesalucas commented 1 year ago

We are having the same issue with Keda ScaledObjects. Keda appears to update the status.lastActiveTime field every few seconds which in turn appears to trigger a reconciliation. Setting ignoreResourceStatusField to crd or all doesn't appear to make a difference. Is there any way to ignore reconciliation on specific resources or fields?

8100, #8914 and #6108 all appear to be pretty similar and I can't see a workaround in any of those so would appreciate it if anyone can suggest one!

In case it helps anyone else, increasing the ScaledObject pollingInterval made a massive difference to the ArgoCD CPU usage.

BongoEADGC6 commented 1 year ago

I've been seeing this a lot still on v2.6.2 with two different metallb deployments. Constantly loops over them and the orphanedResources is not in the project spec for default.

roeizavida commented 1 year ago

In v2.6.1 with ignoreAggregatedRoles: true, ignoreResourceStatusField: all, timeout.reconciliation: 300s and increased polling interval for Keda, the issue is still present. The application controller (4 replicas) is using 16 CPUs with ~280 applications.

neiljain commented 1 year ago

ArgoCD version:

{
    "Version": "v2.5.7+e0ee345",
    "BuildDate": "2023-01-18T02:23:39Z",
    "GitCommit": "e0ee3458d0921ad636c5977d96873d18590ecf1a",
    "GitTreeState": "clean",
    "GoVersion": "go1.18.10",
    "Compiler": "gc",
    "Platform": "linux/amd64",
    "KustomizeVersion": "v4.5.7 2022-08-02T16:35:54Z",
    "HelmVersion": "v3.10.3+g835b733",
    "KubectlVersion": "v0.24.2",
    "JsonnetVersion": "v0.18.0"
}

we even bumped timeout.reconciliation from 30m to 2h, but that didn't help.

we ran into this issue when using custom plugins for our applications:

      plugin:
        env: []
        name: custom-plugin
      repoURL: ssh://git@<your-repo-server>/argo/deploy-sample-app.git
      targetRevision: main

and noticed the following logs in application controller: {"application":"argocd/deploy-sample-“app,”level":"info","msg":"Refreshing app status (spec.source differs), level (3)","time":"2023-03-02T06:16:35Z"}

with multiple test environments configured to use argocd and 100s of argo apps per env, this crashed our git servers every couple of days.

so we had to add the following dummy var to fix the constant refresh of the app:

      plugin:
        env:
        - name: DUMMY_VAR_TO_STOP_ARGO_REFRESH
          value: "true"
nferro commented 1 year ago

I'm also seeing this issue with AzureKeyVaultSecret

argocd-application-controller-8] time="2023-03-10T00:39:15Z" level=debug msg="Refreshing app argocd/application for change in cluster of object namespace/avk of type spv.no/v1/AzureKeyVaultSecret"

this then triggers a level (1) refresh that takes a long time:

[argocd-application-controller-8] time="2023-03-10T00:39:14Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application= argocd/application
agaudreault commented 1 year ago

The behavior can be configured in ignoreResourceUpdates to resolve this issue.

tooptoop4 commented 5 months ago

@Funk66 did u submit a PR for https://github.com/argoproj/argo-cd/issues/8100#issuecomment-1033514595 ?

Funk66 commented 5 months ago

I tried implementing a fix but couldn't make it work fully. I may try again in the coming weeks, if nobody else does.