argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
18.01k stars 5.49k forks source link

Argocd app Sync is taking time #20786

Open shnigam2 opened 1 week ago

shnigam2 commented 1 week ago

Checklist:

Describe the bug

When we are simply syncing any apps it is taking lots of time. **To Reproduce** We just simply need to sync any app, it will take more than 20 mins for sure even if there is no change. **Expected behavior** In other environment it is getting synced in not more than 2 mins. **Screenshots** image We just simply need to trigger the sync of any app, it is taking upto 1-1.5Hrs. **Version** ```shell argocd: v2.5.1+504da42 BuildDate: 2022-11-01T21:14:30Z GitCommit: 504da424c2c9bb91d7fb2ebf3ae72162e7a5a5be GitTreeState: clean GoVersion: go1.18.8 Compiler: gc Platform: linux/amd64 ``` **Logs** ``` Paste any relevant application logs here. ```
andrii-korotkov-verkada commented 1 week ago

What's your argocd server version? What kind of sync are you talking about? If there are a lot of steps and for example Rollouts with long canaries, sync is expected to take a while. Does it get stuck?

jimmy-daruwala commented 1 week ago

Hello Andrii,

Our argocd version is: v2.5.1+504da42. Correct, there seems to be a quite a few steps including canaries one. But this seemed to work fine before. Currently it takes 10-15+ minutes and mostly gets stuck and all applications end up in "Unknown" status. Let us know if you need to check any additional logs, or set up a call. PS our argocd addons set up below: jimmy_daruwala@M-TFX2T4PV62 ~ % k get pods -A | grep argo argocd argocd-application-controller-0 1/1 Running 0 16h argocd argocd-application-controller-1 1/1 Running 0 16h argocd argocd-applicationset-controller-67fd897584-jhb7j 1/1 Running 0 16h argocd argocd-notifications-controller-d547c8d76-tzw27 1/1 Running 0 16h argocd argocd-redis-6cd966fffc-mcg9b 1/1 Running 0 16h argocd argocd-repo-server-55f974b986-bnz9s 1/1 Running 0 16h argocd argocd-repo-server-55f974b986-fb62h 1/1 Running 0 16h argocd argocd-repo-server-55f974b986-qbmx6 1/1 Running 0 16h argocd argocd-repo-server-55f974b986-qfbtf 1/1 Running 0 16h argocd argocd-server-574dd6b597-4xbq9 1/1 Running 0 16h argocd argocd-server-574dd6b597-54rkc 1/1 Running 0 16h argocd argocd-server-574dd6b597-px464 1/1 Running 0 16h argocd argocd-server-574dd6b597-vkbrk 1/1 Running 0 16h argocd container-secret-sync-28859175-vcdw5 0/1 Completed 0 40m argocd container-secret-sync-28859190-j6nrc 0/1 Completed 0 25m argocd container-secret-sync-28859205-bz2jr 0/1 Completed 0 10m

andrii-korotkov-verkada commented 1 week ago

That might be a cli version. Do you have an output for argocd-server? Is it also old? If so, please, try upgrading.

jimmy-daruwala commented 1 week ago

Sure Andrii, PS the argocd-server version below: jimmy_daruwala@M-TFX2T4PV62 ~ % argocd version --client 2024/11/13 22:04:52 maxprocs: Leaving GOMAXPROCS=10: CPU quota undefined argocd: v2.13.0+347f221 BuildDate: 2024-11-04T15:30:50Z GitCommit: 347f221adba5599ef4d5f12ee572b2c17d01db4d GitTreeState: clean GoVersion: go1.23.2 Compiler: gc Platform: darwin/arm64

andrii-korotkov-verkada commented 1 week ago

I see. Can you share the resources yamls you are syncing, please? In particular, if you have rollouts with long canaries.

ibrarahmed1124 commented 6 days ago

Hi Andril. We are using argocd version 2.5.1 Please let me know what resources YAML files is required.

andrii-korotkov-verkada commented 6 days ago

The Yamls which define application resources.

jimmy-daruwala commented 5 days ago

When trying to reproduce this issue, only thing I could find in the argocd-server pods logs was this error:

level=error msg="finished unary call with code Unknown" error="error getting cached app state: error getting application by query: application refresh deadline exceeded" grpc.code=Unknown grpc.method=ManagedResources grpc.service=application.ApplicationService grpc.start_time="2024-11-20T04:58:28Z" grpc.time_ms=60000.188 span.kind=server system=grpc 2024-11-20T00:00:17-05:00 time="2024-11-20T05:00:17Z" level=info msg="received unary call /application.ApplicationService/ResourceTree" grpc.method=ResourceTree grpc.request.claims="{\"aud\":\"argocd\",\"email\":\"Karthikeyan_Sekar@mckinsey.com\",\"exp\":1732079857,\"iat\":1732078657,\"iss\":\"https://prod-login-con01.intranet.mckinsey.com/auth/idp/k8sIdp\",\"jti\":\"T9qDumUGC2KvnFvEx6LnTw\",\"name\":\" Sekar\",\"nbf\":1732078537,\"nonce\":\"4ba21011-3b96-406d-9d05-2f9773942d5e\",\"preferred_username\":\"x-48-xx-48-xufx-54-xgyrgngx-56-xnkx-50-xvbx-51-xx-53-xx-54-x\",\"sub\":\"00uf6gyrgnG8nK2Vb356\"}" grpc.request.content="applicationName:\"my-app-converge-12126\" appNamespace:\"argocd\" " grpc.service=application.ApplicationService grpc.start_time="2024-11-20T05:00:17Z" span.kind=server system=grpc 2024-11-20T00:00:17-05:00 time="2024-11-20T05:00:17Z" level=info msg="Requested app 'my-app-converge-12126' refresh"

Not sure if it's related.

andrii-korotkov-verkada commented 5 days ago

Can you try to enter the browser's developer mode and debug what's stalling the page? One case where I saw this is when an app had like 5k old jobs. I manually cleaned up those and things started to load well again.

andrii-korotkov-verkada commented 5 days ago

How can we check on old (stale or stuck) jobs that do need cleaning up?

You can use kubectl and query by label of resources belonging to the app. Something like

kubectl get jobs -l app.kubernetes.io/instance=my-app

But not sure what exact annotation you have for tracking belonging to the app.

jimmy-daruwala commented 5 days ago

Posting the same above comments without Screenshots.

Hi Andrii, the application team unfortunately can not allow to post the yaml files here in the open forum. To put issues forward again: Issue 1 - App takes long time to SYNC. Especially “my-app-converge-12126”. Issue 2 - When we select “my-app-converge-12126" Application in Argo. The whole Argo UI gets glitchy and after few seconds it gives “Page Unresponsive” Message. This issue is pretty consistent. But for all other apps Argo UI runs fine and we do not see “Page Unresponsive” error. Another note: it is happening only on this Cluster. The resource usage for all the nodes as well as pods in question seems normal as well, that would tell us that this could not be a result of resource overload (For argocd and ns-converge-12126-prod both).

As also discussed, we found nothing valuable in Developer mode when reproducing the issue.

I will check on the jobs as you requested above.

jimmy-daruwala commented 5 days ago

I actually tried finding all the jobs in all NS's in the cluster and only got this: jimmy_daruwala@M-TFX2T4PV62 .aws % kubectl get jobs -A NAMESPACE NAME STATUS COMPLETIONS DURATION AGE argocd container-secret-sync-28868790 Complete 1/1 8s 40m argocd container-secret-sync-28868805 Complete 1/1 7s 25m argocd container-secret-sync-28868820 Complete 1/1 8s 10m argocd splunk-sync Complete 1/1 7s 96d namespace-operator-system snow-registration-28785600 Complete 1/1 8s 57d namespace-operator-system snow-registration-28787040 Complete 1/1 7s 56d namespace-operator-system snow-registration-28788480 Complete 1/1 9s 55d namespace-operator-system snow-registration-28867680 Failed 0/1 19h 19h openunison check-certs-openunison-28864920 Complete 1/1 10s 2d17h openunison check-certs-openunison-28866360 Complete 1/1 11s 41h openunison check-certs-openunison-28867800 Complete 1/1 10s 17h

jimmy-daruwala commented 4 days ago

Any next steps, or any other suggestions regarding this? Happy to set up a call between us and the App team as well.

andrii-korotkov-verkada commented 4 days ago

Can you enable debug logs and see how much various steps take, e.g. search for "Reconciliation completed"?