argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
17.8k stars 5.43k forks source link

Sync loop for Helm Applications that are using post-delete hooks #17117

Open ZF-fredericvanlinthoudt opened 8 months ago

ZF-fredericvanlinthoudt commented 8 months ago

Checklist:

Describe the bug

Since we've updated to ArgoCD v2.10.0, we are facing a constant refresh/sync issue with Applications that have a Helm template as source and are using "post-delete" hooks in Helm. Probably this is related to the new feature that added support for post-delete hooks. The application diff (see screenshot below) shows that it wants to two post-delete-finalizer.argocd.argoproj.io finalizers from the Application. This change gets synced but almost instantaneously it gets out-of-sync again with the same diff and repeats the same process over and over again. On our production ArgoCD instance, with more than 1200 applications, this causes ArgoCD to freeze and not sync any other applications anymore (those other application's sync are just stuck in "waiting to start").

To Reproduce

https://REDACTED.git is a placeholder for a GIT repository that contains directories with Applications

Expected behavior

Applications that use post-delete Helm hooks should be synced successfully in one go and should not constantly be synced over and over again when auto-sync is enabled.

Screenshots

image

Version

argocd: v2.10.0+2175939.dirty
BuildDate: 2024-02-06T15:31:31Z
GitCommit: 2175939ed6156ddd743e60f427f7f48118c971bf
GitTreeState: dirty
GoVersion: go1.21.6
Compiler: gc
Platform: linux/amd64
argocd-server: v2.10.0+2175939

Logs

No relevant logs found.

pohldk commented 8 months ago

We also experienced this and since we have Argo CD installed via helm we had fun trying to rollback 😅

tcpecheanu commented 8 months ago

On our production ArgoCD, with 1000+ applications, after updating to v2.10.0, the sync and refresh buttons completely freeze the UI. We noticed that the application controller used twice as much memory and cpu but also we didn't found any relevant logs. We had to rollback to v2.9.5.

AnubhavSabharwa commented 8 months ago

The sharding is not working in 2.10.0 as it was working in previous versions. If you try to remove env variable ARGOCD_CONTROLLER_REPLICAS and restart controller

You will see sync and refresh will start working again

Skoucail commented 7 months ago

We experience the same sync loop issue with version 2.10.5. image

Anyone found a solution for this? Is it an option to add the 2 finalizers to the Application in git? Or would that break an initial deploy?

joebowbeer commented 5 months ago

Fixed by #18003 ?

ricardojdsilva87 commented 5 months ago

Hello, We started also seeing several Applications on ArgoCD being out of sync constantly with those 2 finalizers as diff. This started to happen after upgrading from version 2.9.6 to v2.11.0. After reverting v2.9.6 everything went back to normal. After the upgrade to v2.11.0 we started seeing every metrics going up (memory usage, CPU usage and also the queue times that were zero). The upgrade occurred around 9AM today May 20th image image image

After installing v2.9.6 everything went back to normal again, please ignore the gap between ~17:35 and ~18:00 we had an issue with the metrics collections. image image image image

It can be clearly seen that there is a spike in every metric of the application controller (CPU, RAM kubernetes executions) and a drop after reverting to v2.9.6. We could see an immediate increase in the queue time that remains at zero after reverting the version.

At the moment we have only these metrics for v2.9.6 and v2.11.0. For some reason with other versions our metrics agent is not being able to gather any information, will check what can be done and test with other different versions to see if this issue with the finalisers persists.

Thanks!

UPDATE Hello, Just to add more information, regarding the issue. It seems that v2.9.15 works as v2.9.6, trying out v2.10.10 caused the issues mentioned above so it must be something introduced in v2.10.x. As this version is installed we start seeing the queue increasing and the apps starting a sync loop. Thanks for the support

mmalyska commented 5 months ago

I'm on v2.11.2+25f7504 version and experience the same problems. I'm stuck on infinite loops if selfHeal is on. obraz

antonio-tolentino commented 4 months ago

I've installed the version below and I am facing the same issue: { "Version": "v2.11.3+3f344d5", "BuildDate": "2024-06-06T08:42:00Z", "GitCommit": "3f344d54a4e0bbbb4313e1c19cfe1e544b162598", "GitTreeState": "clean", "GoVersion": "go1.21.9", "Compiler": "gc", "Platform": "linux/amd64", "KustomizeVersion": "v5.2.1 2023-10-19T20:13:51Z", "HelmVersion": "v3.14.4+g81c902a", "KubectlVersion": "v0.26.11", "JsonnetVersion": "v0.20.0" }

argocd_issue

didlawowo commented 2 months ago

got the same with nvidia gpu operator and self heal disabled don't change anything

ricardojdsilva87 commented 3 weeks ago

The same is still happening in the latest version v2.12.4: image

gadiener commented 2 weeks ago

We are also experiencing this, is there a workaround for that?

igorivan commented 2 weeks ago

We're experiencing the same issue with the Falcon sensor, as mentioned in the previous comment. Could you please advise?

wikka commented 2 weeks ago

Got also the same issue. Any tips on how to circumvent it?

lorenzboguhn commented 2 weeks ago

Hey, i found a possible mitigation in Issue-17433 This ticket is probably a duplicate to this ticket. TLDR; Just add the following to the argocd-cm to ignore differences in Argocd Applications source comment

resource.customizations.ignoreDifferences.argoproj.io_Application: |
  jqPathExpressions:
    - .metadata.finalizers[]? | select(. == "post-delete-finalizer.argocd.argoproj.io" or . == "post-delete-finalizer.argocd.argoproj.io/cleanup")
    - if (.metadata.finalizers | length) == 0 then .metadata.finalizers else empty end
ricardojdsilva87 commented 2 weeks ago

Hello, indeed the mentioned snippet stops the post-delete hooks to be considered as a diff. After enabling this setting the resource usage of the controller is not as high as mentioned before. image But the queue still increases: image We are using the ArgoCD datadog integration, so these metrics are directly reported by the ArgoCD pods. One metric that we can see that increased alot and might be related are these ones: image They seem to be related to the Repository server now. Could this be also related to the queue increasing? This might be also another issue not related to the post delete hook, but is just happening after upgrading to a release > 2.10.x. In this release the server-side diff feature was added, but as I know it is disabled by default on the configmap and enabling it with controller.diff.server.side documentation.

I'll post here if I can find anything else new