argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
18.07k stars 5.53k forks source link

ArgoCD Application Stuck In Syncing/Terminating State #8113 #11088

Open goodosoft opened 2 years ago

goodosoft commented 2 years ago

Discussed in https://github.com/argoproj/argo-cd/discussions/8116

Originally posted by **utkarsh-devops** January 8, 2022 Hello ArgoCDians, :wave: We’re facing a weird issue in production where one of the applications is stuck in a terminating and sync state. **`Version: 2.1.7`** **`Background:`** We manually terminated the sync of an application and now that the application is stuck in the terminating sync state and the disable auto-sync button also not working. **`Note:`** Creating a new application with similar configs is working but we want to investigate why the current application is stuck. **`Steps performed:`** * We tried disabling the auto-sync from the UI and CLI * We tried terminating the application * We tried syncing the app `Commands tried` ``` argocd app terminate-op APPNAME argocd app sync APPNAME argocd app sync APPNAME --force --prune ``` Screenshot 2022-01-07 at 7 29 50 PM ![148561751-eea24402-dd45-4d3d-be53-df8f44bd1c3b](https://user-images.githubusercontent.com/10195013/148581338-ba2438e6-17e6-49ea-b9b7-9a5cf10ac7a9.png) ![148561877-138bd03d-9fd6-4dea-991c-113a9ac37bd1](https://user-images.githubusercontent.com/10195013/148581346-f5f0f96b-1d41-426a-ae36-08fbb2b9053c.png) **Some Logs (not sure if these are relevant) :** ``` time="2022-01-07T16:19:36Z" level=debug msg="Failed to apply normalization: error in remove for path: '/spec/preserveUnknownFields': Unable to remove nonexistent key: preserveUnknownFields: missing value" time="2022-01-07T16:19:36Z" level=debug msg="Failed to apply normalization: error in remove for path: '/status': Unable to remove nonexistent key: status: missing value" time="2022-01-07T16:19:36Z" level=debug msg="Failed to apply normalization: error in remove for path: '/spec/preserveUnknownFields': Unable to remove nonexistent key: preserveUnknownFields: missing value" time="2022-01-07T16:19:36Z" level=debug msg="Failed to apply normalization: error in remove for path: '/status': Unable to remove nonexistent key: status: missing value" time="2022-01-07T16:19:36Z" level=debug msg="Failed to apply normalization: error in remove for path: '/spec/preserveUnknownFields': Unable to remove nonexistent key: preserveUnknownFields: missing value" time="2022-01-07T16:19:36Z" level=debug msg="Failed to apply normalization: error in remove for path: '/spec/preserveUnknownFields': Unable to remove nonexistent key: preserveUnknownFields: missing value" time="2022-01-07T16:19:36Z" level=debug msg="Failed to apply normalization: error in remove for path: '/status': Unable to remove nonexistent key: status: missing value" time="2022-01-07T16:19:36Z" level=debug msg="Failed to apply normalization: error in remove for path: '/spec/preserveUnknownFields': Unable to remove nonexistent key: preserveUnknownFields: missing value" time="2022-01-07T16:19:36Z" level=debug msg="patch: {\"status\":{\"reconciledAt\":\"2022-01-07T16:19:36Z\"}}" application=argocd time="2022-01-07T16:19:36Z" level=info msg="Failed to Update application operation state: etcdserver: request is too large, retrying in 1s"time="2022-01-07T16:19:36Z" level=info msg="Update successful" application=argocd time="2022-01-07T16:19:36Z" level=info msg="Reconciliation completed" application=argocd dedup_ms=0 dest-name= dest-namespace=argocd dest-server="https://417901E660A9365B4057207C70C682EE.gr7.ap-south-1.eks.amazonaws.com" diff_ms=174 fields.level=2 git_ms=13 health_ms=3 live_ms=1 settings_ms=0 sync_ms=0 time_ms=249 time="2022-01-07T16:19:36Z" level=info msg="Refreshing app status (controller refresh requested), level (1)" application=argocd ```
christophercutajar commented 2 years ago

We're seeing the same issue whereby the UI is showing as Syncing.

image

Logs showing that No operations is in progress

argo-cd-argocd-server-59747649ff-wxlxj server time="2022-11-18T15:11:17Z" level=info msg="finished unary call with code InvalidArgument" error="rpc error: code = InvalidArgument desc = Unable to terminate operation. No operation is in progress" grpc.code=InvalidArgument grpc.method=TerminateOperation grpc.service=application.ApplicationService grpc.start_time="2022-11-18T15:11:17Z" grpc.time_ms=18.361 span.kind=server system=grpc

When I try to manually Terminate via the Terminate button, nothing happens and get an error highlighting the log as above

image

Experiencing this on ArgoCD 2.5

{
    "Version": "v2.5.0+b895da4",
    "BuildDate": "2022-10-25T14:40:01Z",
    "GitCommit": "b895da457791d56f01522796a8c3cd0f583d5d91",
    "GitTreeState": "clean",
    "GoVersion": "go1.18.7",
    "Compiler": "gc",
    "Platform": "linux/amd64",
    "KustomizeVersion": "v4.5.7 2022-08-02T16:35:54Z",
    "HelmVersion": "v3.10.1+g9f88ccb",
    "KubectlVersion": "v0.24.2",
    "JsonnetVersion": "v0.18.0"
}

Pressing the Sync operation once again, triggers a sync operation which if as per UI there is a sync operation ongoing, the second sync should have been denied.

yoadduani commented 1 year ago

we facing this issue with v2.1.2 we tried restart all the pods - nothing help also, we tried to upgrade argocd to 2.6.7 - didn't help. any solution ?

update: in our case we deleted the app with non-cascade: argocd app delete APPNAME --cascade=false

sosoriov commented 1 year ago

we are having the same issue and nothing helps. Any idea?

crenshaw-dev commented 1 year ago

@sosoriov does this happen to be a particularly large app? Like > 1000 resources?

sosoriov commented 1 year ago

@crenshaw-dev at this point the entire cluster is stuck, most of the apps keep on "refreshing" mode so, I haven't been able to pin point exactly which is the application that is causing the issue. I have a bunch of apps some of them are applicationset, let's say in total 100 apps but none of them is really big.

I also noticed that the applicationcontroller hits 100% cpu, in some way is Limited to 1cpu but It's nowhere defined, even if I increase the resources in the yaml manifest it doesn't take it into account.

crenshaw-dev commented 1 year ago

Ah okay, if you're hitting CPU limits on the controller, that would explain the behavior. You might have a default being enforced.

sosoriov commented 1 year ago

Thanks for your answer .

it's nowhere in the yaml, I'm using Argo v2.5.5 Do you know if it might be enforced somehow via code ? @crenshaw-dev

crenshaw-dev commented 1 year ago

It would be enforced by something outside Argo CD, such as a mutating webhook.

sosoriov commented 1 year ago

@crenshaw-dev It's seems that it's not CPU related issue . Additionally I found out that this is the api call that is taking "forever" to respond

https://argocdxxxx/api/v1/stream/applications/my-app-name/resource-tree?appNamespace=argocd

and in some of my applications I've seen that the ?appNamespace= appears empty .

Any idea ? Thanks in advanced.

sosoriov commented 1 year ago

hey I found my issue. In the end was nothing related to Argocd , the Kubeapi was overfload with thousands of requests generated by certmanager and ofc Argocd was trying to sync all that :S

crenshaw-dev commented 1 year ago

For others, I think the issue(s) may be related to this: https://github.com/argoproj/argo-cd/issues/14224#issuecomment-1636337124

Psalmz777 commented 1 year ago

@sosoriov Im having an issue similar to yours. None of the pods of the applications are showing in ArgoCD,I tried deleting and recreating but its not working , all applications are always showing refreshing , I noticed you mentioned you found the issue that what causing this..Would you mind sharing how you arrived at that and how you fixed it

adim commented 10 months ago

Same issue, any why to solve it?

sosoriov commented 10 months ago

@Psalmz777 @adim in my case it was an application which was "renewing" a certificate every X seconds, causing thousands of certificaterequests resources in the cluster. So basically, issue with a lot of resources and argo was trying to sync them all. I suggest you to check your K8s api and check if there is anything creating lots of resources.

dee-kryvenko commented 8 months ago

Run into this as well. In my case, I'm not getting crazy with super-apps, this is just a single-source Helm app that is using Jenkins helm chart. The problem is - values file for it is too big, JCasC that defines jobDSL... it was stuck in sync, nothing can help, and the error was:

error patching application with operation state: etcdserver: request is too large

I had to patch the application manually and remove history from the status field.

tooptoop4 commented 7 months ago

@dee-kryvenko how do u stop history being populated in status field going forward?

dee-kryvenko commented 7 months ago

@dee-kryvenko how do u stop history being populated in status field going forward?

I don't, not sure there is a way. But I limited how many revisions it retains to 3, which is by default was 10, and to me it was good enough. YMMV of course as it all comes down to the size of your casc and jobdsl config.