Open mpelekh opened 1 month ago
The issue
In large clusters where Argo CD monitors numerous resources, the processing of watches becomes significantly slow—in our case (total k8s resources in cluster: ~400k, Pods: ~76k, ReplicaSets: ~52k), taking around 10 minutes. As a result, the Argo CD UI displays outdated information, impacting several features reliant on sync waves, like PruneLast. Eventually, the sheer volume of events from the cluster overwhelmed the system, causing Argo CD to stall completely.
To address this, we disabled the tracking of Pods and ReplicaSets, although this compromises one of the main benefits of the Argo CD UI. We also filtered out irrelevant events and tried to optimize various settings in the application controller. However, vertical scaling of the application controller had no effect, and horizontal scaling is not an option for a single cluster due to sharding limitations.
Issue causes
During the issue investigation, it was found that the problem lies in the following:
Patched v2.10.9
v2.10.9 was patched with the following commits.
The suggested fix https://github.com/argoproj/gitops-engine/issues/602 to optimize the lock usage has not improved the issue in large clusters.
Avoid resources lock contention utilizing channel
Since we still have significant lock contention in massive clusters, and the approaches above didn’t resolve the issue, another approach has been considered. It is a part of this PR.
When we must acquire a write lock in each goroutine, we can’t handle more than one event at a time. What if we introduce the channel where all the received events are sent, and one goroutine is responsible for processing events in batch from the channel? In such a way, the locks from each goroutine are moved to the goroutine, which processes events from the channel. This means we would have only one place where the write lock is acquired; in such a way, we would get rid of the lock contention.
Conclusions The fix shows significant performance improvements. We left Nodes, ReplicaSets, and Pods enabled on large clusters. ArgoCD UI is working smoothly. The original issue has been resolved - users can manage Pods and ReplicaSets on large clusters.
Your analysis is excruciatingly thorough, I love it! I've posted it to SIG Scalability, and we'll start analyzing ASAP. Please be patient, it'll take us a while to give it a really thorough review.
@mpelekh would you be interested in joining a SIG Scalability meeting to talk through the changes?
Could you open an Argo CD PR pointing to this commit so that we can run all Argo's tests?
@mpelekh would you be interested in joining a SIG Scalability meeting to talk through the changes?
@crenshaw-dev Yes, I’d be happy to join the SIG Scalability meeting to discuss the changes. Please let me know the time and details or if there’s anything specific I should prepare in advance.
Great! The event is on the Argoproj calendar, and we coordinate in CNCF Slack. The next meeting is two Wednesdays from now at 8am eastern time.
No need to prepare anything really, just be prepared to answer questions about the PR. :-)
Could you open an Argo CD PR pointing to this commit so that we can run all Argo's tests?
@crenshaw-dev Sure. Here it is - https://github.com/argoproj/argo-cd/pull/20329.
A couple things from the contributors meeting last week:
1) we should probably make this configurable via a flag from Argo CD; the more I think about it, the more I think we should have a quick opt-out option 2) if feasible, we should have batches process on a ticker or some max slice size; that'll help manage particularly high-churn spikes
Do we have a definitive answer yet for whether sync status/operation status are currently atomically updated vs. just very-quickly updated? Because if we're losing atomicity, that could be a big problem. If we're just slowing something down that used to be fast, I think that's relatively okay.
I provided the details in this comment - https://github.com/argoproj/argo-cd/pull/20329#issuecomment-2460145267
tl;dr The sync and operation status are not atomically updated. They just very quickly updated.
Issues
0 New issues
0 Accepted issues
Measures
0 Security Hotspots
0.0% Coverage on New Code
2.6% Duplication on New Code
Thanks for the review @andrii-korotkov-verkada. I addressed the comments in this commit - https://github.com/argoproj/gitops-engine/pull/629/commits/7a53ecab1d0eab6197dd11221f7c0b9ed3bed738
I am going to rebase -i --autosquash
it before merge.
Problem statement is in https://github.com/argoproj/argo-cd/issues/8172#issuecomment-2277585238
The IterateHierrchyV2 significantly improved performance, getting us ~90% of the way there. But on huge clusters, we still have significant lock contention.
The fix in this pull request approaches the problem differently - it avoids lock contention by utilizing a channel to process events from the cluster.
More details are in the comments.