The application controller may get slow on diffing (3-5+ min) when doing server side diff

andrii-korotkov-verkada commented 3 months ago

Checklist:

[x] I've searched in the docs and FAQ for my answer: https://bit.ly/argocd-faq.
[ ] I've included steps to reproduce the bug.
[x] I've pasted the output of argocd version.

Describe the bug

The diffing step (diff_ms checkpoint) can start taking several minutes sometimes on medium-size clusters and/or namespaces (thousands to tens of thousands resources). It may resolve on its own (like after 24 hours in one case), but often does not. It happens on some clusters but not other similar ones.

When looking at the logs "Applying resource ..." in dry run mode, they typically come in batches during short period of time, but some resource kinds may have gaps after them (i.e. 10 seconds without a log about resource apply for an application). Once it happened for ScaledObject, another time for Deployment.

It can happen regardless diff cache usage, though diff cache makes it less likely due to dry run of less resources.

I've checked pending and running kubectl commands for app controller, but there's no strong correlation there. The K8s QPS variables have been bumped from defaults and bumping them further doesn't help.

The setup includes webhooks and larger timeout.reconciliation (30m instead of default 3m) and some jitter for clusters where it keeps happening. I've tried to bump jitter and it seems to help a bit, but slow diffs still happen. Reducing it seems like not an option, since once I tried that some time ago, the load and queueing has increased.

Cpu and memory for the controller are actually overprovisioned by a lot, so this shouldn't be an issue.

To Reproduce

I don't have a consistent repro, but it just keeps happening on some of the clusters.

Expected behavior

There are no gaps between dry run resource applies, diffing is relatively quick, maybe like below 30s.

Screenshots

Version

A custom build from master on 2024/08/02.

Logs

andrii-korotkov-verkada commented 3 months ago

This issue was also sometimes observed, duplicate applies can increase the time spent https://github.com/argoproj/argo-cd/issues/19458.

andrii-korotkov-verkada commented 3 months ago

I removed the overtuned controller kubectl parallelism limit and cleaned up repo server parallelism limit, restarted the controller and repo server. It didn't fix the issue, though seems like it happens a bit less (but I didn't observe for long enough).

~~I've noted, that gaps between Deployment apply dry run and next resource apply dry run align closely with some number of seconds almost up to millisecond level. I observed almost exactly 8s, 9s, 10s and 11s gaps. There are also gaps for other resources, but generally it's at most 1s, but also almost exactly to millisecond level.~~

Nvm the above, the original timestamps seem to be up to a second level, and the gaps are probably results of how Datadog ingests logs.

Note, that object json gets printed, so according to my reading of the code it can mean that kubectl actually run successfully rather quickly, but I'll double-check.

andrii-korotkov-verkada commented 3 months ago

I've profiled the app controller while the longest running processor time was in several minutes, and seems like diff takes relatively little cpu, so I assume I/O or lock bottleneck.

Screenshot 2024-08-09 at 9 46 38 PM

andrii-korotkov-verkada commented 3 months ago

Here's profiling for CompareAppState

Screenshot 2024-08-09 at 10 12 15 PM

andrii-korotkov-verkada commented 3 months ago

Looks like /home/ubuntu/.kube/cache wasn't working properly for discovery in particular, since in our setup the root filesystem is set to readonly. I've mounted empty dirs for both that path and /tmp and monitoring the state.

I've discovered this after getting failed to write cache to logs with gloglevel set to 4.

andrii-korotkov-verkada commented 3 months ago

That didn't solve the problem with slow diffing unfortunately.

andrii-korotkov-verkada commented 3 months ago

gloglevel equal to 6 is a magic number to get more insights. I've started getting lines that confirm it's API call to do a dry run apply which is slow for Deployments in this case, taking 3-5 seconds, while for other resources it can be like 10ms.

Log lines are like

PATCH https://<ip>:443/apis/apps/v1/namespaces/default/deployments/<name>?dryRun=All&fieldManager=argocd-controller&force=true 200 OK in 5069 milliseconds

andrii-korotkov-verkada commented 3 months ago

I've checked the k8s logs. It's due to our validating webhook.

andrii-korotkov-verkada commented 3 months ago

Since it's not ArgoCD issue, I'll close this.

andrii-korotkov-verkada commented 3 months ago

But one thing I've noticed is that there can be 3 calls per Deployment:

PATCH https://<ip>:443/apis/apps/v1/namespaces/default/deployments/<name>?dryRun=All&fieldManager=argocd-controller&force=true 200 OK in A milliseconds
PATCH https://<ip>:443/apis/apps/v1/namespaces/default/deployments/<name>?dryRun=All&fieldManager=argocd-controller 200 OK in B milliseconds
PATCH https://<ip>:443/apis/apps/v1/namespaces/default/deployments/<name>?dryRun=All&fieldManager=argocd-controller&force=true 200 OK in C milliseconds

Note the force = true in the first one, lack of it in the second one and presence again in the 3rd one. I don't quite understand why it's needed. Does anyone know?

andrii-korotkov-verkada commented 2 months ago

I've confirmed that logs are for the same resource and timestamp-wise they happen practically sequentially (e.g. every ~10ms if each step takes ~10ms).

argoproj / argo-cd

The application controller may get slow on diffing (3-5+ min) when doing server side diff #19463