Sync operation getting stuck with high number of resources in Application

bartoszbryk commented 1 year ago

Checklist:

[x] I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
[x] I've included steps to reproduce the bug.
[x] I've pasted the output of argocd version.

Describe the bug

ArgoCD doesn't finish syncing the application and the sync seems to be getting stuck with higher number of resources (around 4000) in Application. The sync also cannot be terminated in this state. Only deleting the application-controller pod helps. However all the resources in the application appear to be synced and the log doesn't indicate any reason for sync being stuck.

To Reproduce

Create an application with high number of resources (in our case 4000 Kafka topics and users) and try to sync it automatically

Expected behavior

The sync finishes successfully

Screenshots

Version

argocd: v2.6.1+3f143c9.dirty
  BuildDate: 2023-02-08T22:21:48Z
  GitCommit: 3f143c9307f99a61bf7049a2b1c7194699a7c21b
  GitTreeState: dirty
  GoVersion: go1.19.5
  Compiler: gc
  Platform: darwin/arm64
argocd-server: v2.6.3+e05298b
  BuildDate: 2023-02-27T14:40:19Z
  GitCommit: e05298b9c6ab8610104271fa8491f019fee3c587
  GitTreeState: clean
  GoVersion: go1.18.10
  Compiler: gc
  Platform: linux/amd64
  Kustomize Version: v4.5.7 2022-08-02T16:35:54Z
  Helm Version: v3.10.3+g835b733
  Kubectl Version: v0.24.2
  Jsonnet Version: v0.19.1

Logs

time="2023-06-27T11:53:46Z" level=info msg="Refreshing app status (comparison expired, requesting refresh. reconciledAt: 2023-06-27 11:46:28 +0000 UTC, expiry: 6m0s), level (2)" application=argocd/appname
time="2023-06-27T11:53:46Z" level=info msg="Comparing app state (cluster: <url>, namespace: appname)" application=argocd/appname
time="2023-06-27T11:53:46Z" level=info msg="Resuming in-progress operation. phase: Running, message: " application=argocd/appname
time="2023-06-27T11:53:46Z" level=info msg="Comparing app state (cluster: <url>, namespace: appname)" application=argocd/appname
time="2023-06-27T11:54:08Z" level=info msg="getRepoObjs stats" application=argocd/appname build_options_ms=0 helm_ms=21842 plugins_ms=0 repo_ms=0 time_ms=21962 unmarshal_ms=120 version_ms=0
time="2023-06-27T11:54:11Z" level=info msg="getRepoObjs stats" application=argocd/appname build_options_ms=0 helm_ms=22072 plugins_ms=0 repo_ms=0 time_ms=24971 unmarshal_ms=2898 version_ms=0
time="2023-06-27T11:54:14Z" level=info msg=Syncing application=argocd/appname skipHooks=false started=false syncId=00002-Jebev
time="2023-06-27T11:54:15Z" level=info msg="Tasks (dry-run)" application=argocd/appname syncId=00002-Jebev tasks="[Sync/0 resource kafka.strimzi.io/KafkaUser:appname/argouser obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-0 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-1 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-10 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-100 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-1000 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-1001 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-1002 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-1003 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-1004 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-1005 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-1006 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-1007 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-1008 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-1009 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-101 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-1010 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-1011 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-1012 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-1013 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-1014 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-1015 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-1016 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-1017 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-1018 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-1019 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-102 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-1020 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-1021 obj->obj (,,), Sync/0 resource kafka.strimzi.io/KafkaTopic:appname/operator-topic-1022 obj->obj (,,), Sync/0 resource (....)
time="2023-06-27T12:09:26Z" level=info msg="Skipping auto-sync: another operation is in progress" application=argocd/appname
time="2023-06-27T12:09:26Z" level=info msg="Updated sync status: OutOfSync -> Synced" application=appname dest-namespace=appname dest-server="<url>" reason=ResourceUpdated type=Normal
time="2023-06-27T12:09:26Z" level=info msg="Updated health status: Progressing -> Healthy" application=appname dest-namespace=appname dest-server="<url>" reason=ResourceUpdated type=Normal
time="2023-06-27T12:09:27Z" level=info msg="Update successful" application=argocd/appname
time="2023-06-27T12:09:27Z" level=info msg="Reconciliation completed" application=argocd/appname dedup_ms=36 dest-name= dest-namespace=appname dest-server="<url>" diff_ms=2478 fields.level=1 git_ms=134 health_ms=1745 live_ms=197 settings_ms=0 sync_ms=21 time_ms=8011

blakepettersson commented 1 year ago

There were some deadlock fixes that came in 2.6.8, could you try that and see if the issue persists?

deadlysyn commented 1 year ago

running argocd + rollouts + image updater, all on latest release, doing blue/green with pre- and post-analysis. routinely see similar behavior in a large but significantly smaller app (few hundred resource). providing context in case it helps since this is the closest bug report i've found to what we see:

    "Version": "v2.7.6+00c914a.dirty",
    "BuildDate": "2023-06-20T20:51:13Z",
    "GitCommit": "00c914a948d9e8ad99be8bd82a368fbdeba12f88",
    "GitTreeState": "dirty",
    "GoVersion": "go1.19.10",
    "Compiler": "gc",
    "Platform": "linux/amd64",
    "KustomizeVersion": "v5.0.1 2023-03-14T01:32:48Z",
    "HelmVersion": "v3.11.2+g912ebc1",
    "KubectlVersion": "v0.24.2",
    "JsonnetVersion": "v0.19.1"

app enters a perpetual progressing state (usually run with auto-sync)
rollout appears to think it's done, so can't abort
to try and force an update, disable auto-sync and manually override with a newer image
rollout scales new replica set, orphans the previous (needs manual deletion)
appears stuck in loop, new analysis run does not start
repeat for as many images as you like (seems stuck here)
did some housekeeping (manually deleted all unused rs and analysisrun) -- know this shouldn't be necessary or affect anything but was trying things out of desperation
hit "sync" which scaled another replica set w/ latest image, created analysisrun, appears to have returned to normal behavior

we are still root causing, but the main log we see during this "stuck' state is "another operation is in progress" which brought me here.

bartoszbryk commented 1 year ago

updating to 2.6.8 didn't help @deadlysyn We are observing the same logs in our issues - "another operation is in progress"

crenshaw-dev commented 1 year ago

@deadlysyn I think your issue may be different from @bartoszbryk. Just a theory, but I have recent experience which makes me think this.

@bartoszbryk I think what's happening is that, when Argo CD is trying to sync your ~4000 resources, it's trying to patch status.operationState with the current state of the sync operation. But because the resource has gotten so big, the Kubernetes API is rejecting the patch. You can validate this theory by searching your k8s API logs for patches against that resource. You should see error responses.

You will not see corresponding Argo CD error logs because, currently, we don't log errors encountered when updating the operation state. We retry indefinitely. I've put up a PR to fix this. Incidentally, the retry spams the k8s API with requests that are doomed to fail. Intuit's k8s team was the first to notice the issue, due to an elevated number or error responses.

We bought ourselves a little time by setting controller.resource.health.persist to "false". This offloaded some of the .status field to Redis, which got us back under the k8s resource size limit.

But we quickly hit the limit again as the number of resources increased. We ended up splitting the app into two apps to get back under the limit. But it's just a band-aid.

I've opened an issue to brainstorm ways to get Argo CD to gracefully handle large apps. I've scheduled time at the next SIG Scalability meeting to discuss as well.

Please let me know if this theory matches what you're seeing. I'd love to help work out a solution.

PavelPikat commented 1 year ago

@crenshaw-dev Were there any findings/actions from the SIG meeting?

crenshaw-dev commented 1 year ago

@PavelPikat only that the idea of compressing the status seems like a reasonable way to counteract the problem.

karanrn commented 2 months ago

Any updates or solutions to this issue? We are facing issues too

argoproj / argo-cd

Sync operation getting stuck with high number of resources in Application #14224