Open bartoszbryk opened 1 year ago
There were some deadlock fixes that came in 2.6.8
, could you try that and see if the issue persists?
running argocd + rollouts + image updater, all on latest release, doing blue/green with pre- and post-analysis. routinely see similar behavior in a large but significantly smaller app (few hundred resource). providing context in case it helps since this is the closest bug report i've found to what we see:
"Version": "v2.7.6+00c914a.dirty",
"BuildDate": "2023-06-20T20:51:13Z",
"GitCommit": "00c914a948d9e8ad99be8bd82a368fbdeba12f88",
"GitTreeState": "dirty",
"GoVersion": "go1.19.10",
"Compiler": "gc",
"Platform": "linux/amd64",
"KustomizeVersion": "v5.0.1 2023-03-14T01:32:48Z",
"HelmVersion": "v3.11.2+g912ebc1",
"KubectlVersion": "v0.24.2",
"JsonnetVersion": "v0.19.1"
we are still root causing, but the main log we see during this "stuck' state is "another operation is in progress" which brought me here.
updating to 2.6.8 didn't help @deadlysyn We are observing the same logs in our issues - "another operation is in progress"
@deadlysyn I think your issue may be different from @bartoszbryk. Just a theory, but I have recent experience which makes me think this.
@bartoszbryk I think what's happening is that, when Argo CD is trying to sync your ~4000 resources, it's trying to patch status.operationState
with the current state of the sync operation. But because the resource has gotten so big, the Kubernetes API is rejecting the patch. You can validate this theory by searching your k8s API logs for patches against that resource. You should see error responses.
You will not see corresponding Argo CD error logs because, currently, we don't log errors encountered when updating the operation state. We retry indefinitely. I've put up a PR to fix this. Incidentally, the retry spams the k8s API with requests that are doomed to fail. Intuit's k8s team was the first to notice the issue, due to an elevated number or error responses.
We bought ourselves a little time by setting controller.resource.health.persist
to "false"
. This offloaded some of the .status
field to Redis, which got us back under the k8s resource size limit.
But we quickly hit the limit again as the number of resources increased. We ended up splitting the app into two apps to get back under the limit. But it's just a band-aid.
I've opened an issue to brainstorm ways to get Argo CD to gracefully handle large apps. I've scheduled time at the next SIG Scalability meeting to discuss as well.
Please let me know if this theory matches what you're seeing. I'd love to help work out a solution.
@crenshaw-dev Were there any findings/actions from the SIG meeting?
@PavelPikat only that the idea of compressing the status seems like a reasonable way to counteract the problem.
Any updates or solutions to this issue? We are facing issues too
Checklist:
argocd version
.Describe the bug
ArgoCD doesn't finish syncing the application and the sync seems to be getting stuck with higher number of resources (around 4000) in Application. The sync also cannot be terminated in this state. Only deleting the application-controller pod helps. However all the resources in the application appear to be synced and the log doesn't indicate any reason for sync being stuck.
To Reproduce
Create an application with high number of resources (in our case 4000 Kafka topics and users) and try to sync it automatically
Expected behavior
The sync finishes successfully
Screenshots
Version
Logs