Open andrii-korotkov-verkada opened 3 months ago
This issue was also sometimes observed, duplicate applies can increase the time spent https://github.com/argoproj/argo-cd/issues/19458.
I removed the overtuned controller kubectl parallelism limit and cleaned up repo server parallelism limit, restarted the controller and repo server. It didn't fix the issue, though seems like it happens a bit less (but I didn't observe for long enough).
~~I've noted, that gaps between Deployment
apply dry run and next resource apply dry run align closely with some number of seconds almost up to millisecond level. I observed almost exactly 8s, 9s, 10s and 11s gaps.
There are also gaps for other resources, but generally it's at most 1s, but also almost exactly to millisecond level.~~
Nvm the above, the original timestamps seem to be up to a second level, and the gaps are probably results of how Datadog ingests logs.
Note, that object json gets printed, so according to my reading of the code it can mean that kubectl actually run successfully rather quickly, but I'll double-check.
I've profiled the app controller while the longest running processor time was in several minutes, and seems like diff takes relatively little cpu, so I assume I/O or lock bottleneck.
Here's profiling for CompareAppState
Looks like /home/ubuntu/.kube/cache
wasn't working properly for discovery in particular, since in our setup the root filesystem is set to readonly. I've mounted empty dirs for both that path and /tmp
and monitoring the state.
I've discovered this after getting failed to write cache to
logs with gloglevel
set to 4.
That didn't solve the problem with slow diffing unfortunately.
gloglevel
equal to 6 is a magic number to get more insights. I've started getting lines that confirm it's API call to do a dry run apply which is slow for Deployments
in this case, taking 3-5 seconds, while for other resources it can be like 10ms.
Log lines are like
PATCH https://<ip>:443/apis/apps/v1/namespaces/default/deployments/<name>?dryRun=All&fieldManager=argocd-controller&force=true 200 OK in 5069 milliseconds
I've checked the k8s logs. It's due to our validating webhook.
Since it's not ArgoCD issue, I'll close this.
But one thing I've noticed is that there can be 3 calls per Deployment
:
PATCH https://<ip>:443/apis/apps/v1/namespaces/default/deployments/<name>?dryRun=All&fieldManager=argocd-controller&force=true 200 OK in A milliseconds
PATCH https://<ip>:443/apis/apps/v1/namespaces/default/deployments/<name>?dryRun=All&fieldManager=argocd-controller 200 OK in B milliseconds
PATCH https://<ip>:443/apis/apps/v1/namespaces/default/deployments/<name>?dryRun=All&fieldManager=argocd-controller&force=true 200 OK in C milliseconds
Note the force = true in the first one, lack of it in the second one and presence again in the 3rd one. I don't quite understand why it's needed. Does anyone know?
I've confirmed that logs are for the same resource and timestamp-wise they happen practically sequentially (e.g. every ~10ms if each step takes ~10ms).
Checklist:
argocd version
.Describe the bug
The diffing step (
diff_ms
checkpoint) can start taking several minutes sometimes on medium-size clusters and/or namespaces (thousands to tens of thousands resources). It may resolve on its own (like after 24 hours in one case), but often does not. It happens on some clusters but not other similar ones.When looking at the logs "Applying resource ..." in dry run mode, they typically come in batches during short period of time, but some resource kinds may have gaps after them (i.e. 10 seconds without a log about resource apply for an application). Once it happened for
ScaledObject
, another time forDeployment
.It can happen regardless diff cache usage, though diff cache makes it less likely due to dry run of less resources.
I've checked pending and running kubectl commands for app controller, but there's no strong correlation there. The K8s QPS variables have been bumped from defaults and bumping them further doesn't help.
The setup includes webhooks and larger
timeout.reconciliation
(30m instead of default 3m) and some jitter for clusters where it keeps happening. I've tried to bump jitter and it seems to help a bit, but slow diffs still happen. Reducing it seems like not an option, since once I tried that some time ago, the load and queueing has increased.Cpu and memory for the controller are actually overprovisioned by a lot, so this shouldn't be an issue.
To Reproduce
I don't have a consistent repro, but it just keeps happening on some of the clusters.
Expected behavior
There are no gaps between dry run resource applies, diffing is relatively quick, maybe like below 30s.
Screenshots
Version
A custom build from master on 2024/08/02.
Logs