fix: avoid resources lock contention utilizing channel

mpelekh commented 1 month ago

Problem statement is in https://github.com/argoproj/argo-cd/issues/8172#issuecomment-2277585238

The IterateHierrchyV2 significantly improved performance, getting us ~90% of the way there. But on huge clusters, we still have significant lock contention.

The fix in this pull request approaches the problem differently - it avoids lock contention by utilizing a channel to process events from the cluster.

More details are in the comments.

mpelekh commented 1 month ago

The issue

In large clusters where Argo CD monitors numerous resources, the processing of watches becomes significantly slow—in our case (total k8s resources in cluster: ~400k, Pods: ~76k, ReplicaSets: ~52k), taking around 10 minutes. As a result, the Argo CD UI displays outdated information, impacting several features reliant on sync waves, like PruneLast. Eventually, the sheer volume of events from the cluster overwhelmed the system, causing Argo CD to stall completely.

To address this, we disabled the tracking of Pods and ReplicaSets, although this compromises one of the main benefits of the Argo CD UI. We also filtered out irrelevant events and tried to optimize various settings in the application controller. However, vertical scaling of the application controller had no effect, and horizontal scaling is not an option for a single cluster due to sharding limitations.

Issue causes

During the issue investigation, it was found that the problem lies in the following:

resource lock contention
slow performance of IterateHierarchy

Patched v2.10.9

v2.10.9 was patched with the following commits.

Though patches significantly improve performance, Argo CD still can not handle the load from large clusters.

In the screenshot, you can see one of the largest clusters. Here, the patched with the above commits v2.10.9 build is running. - till 12:50, pods and replica sets are disabled from tracking - from 12:50 to 13:34, pods and replica sets are enabled to be tracked - after 13:34, pods and replica sets are disabled from tracking As can be seen, once pods and rs are enabled to be tracked, the cluster event count falls close to zero, and reconciliation time increases drastically. ![Screenshot 2024-08-09 at 20 40 44](https://github.com/user-attachments/assets/00751664-755c-411d-a108-b432de59844c) ![Screenshot 2024-08-09 at 20 51 00](https://github.com/user-attachments/assets/6ffa97fd-9c66-4282-9e59-b1a88c993567) Number of pods in cluster: ~76k Number of rs in cluster: ~52k A more detailed comparison of different patched versions is added to this comment - https://github.com/argoproj/argo-cd/issues/8172#issuecomment-2278663389

The potential reason is lock contention.

Here, a few more metrics were added, and it was found that when the number of events is significant, sometimes it takes ~5 minutes to acquire a lock, which leads to a delay in reconciliation. https://github.com/mpelekh/gitops-engine/commit/560ef00bcce9201083200f906f15bf1716fbfcc0#diff-9c9e197d543705f08c9b1bc2dc404a55506cfc2935a988e6007d248257aadb1aR1372 ![Screenshot 2024-08-09 at 21 11 33](https://github.com/user-attachments/assets/86255c3d-f8e4-45b1-8898-87f42a3c8b68)

The suggested fix https://github.com/argoproj/gitops-engine/issues/602 to optimize the lock usage has not improved the issue in large clusters.

Avoid resources lock contention utilizing channel

Since we still have significant lock contention in massive clusters, and the approaches above didn’t resolve the issue, another approach has been considered. It is a part of this PR.

When we must acquire a write lock in each goroutine, we can’t handle more than one event at a time. What if we introduce the channel where all the received events are sent, and one goroutine is responsible for processing events in batch from the channel? In such a way, the locks from each goroutine are moved to the goroutine, which processes events from the channel. This means we would have only one place where the write lock is acquired; in such a way, we would get rid of the lock contention.

The fix results

![2a758ca7-9f86-4c98-b0c7-4bf81e6b91e7](https://github.com/user-attachments/assets/22a0c868-61fd-4c04-988b-93923ce3a1c0) As can be seen from metrics, once the fixed version was deployed and Node, ReplicaSets, and Pods were enabled for tracking, the number of cluster events was stable and didn’t go down.

Conclusions The fix shows significant performance improvements. We left Nodes, ReplicaSets, and Pods enabled on large clusters. ArgoCD UI is working smoothly. The original issue has been resolved - users can manage Pods and ReplicaSets on large clusters.

crenshaw-dev commented 1 month ago

Your analysis is excruciatingly thorough, I love it! I've posted it to SIG Scalability, and we'll start analyzing ASAP. Please be patient, it'll take us a while to give it a really thorough review.

crenshaw-dev commented 1 month ago

@mpelekh would you be interested in joining a SIG Scalability meeting to talk through the changes?

crenshaw-dev commented 1 month ago

Could you open an Argo CD PR pointing to this commit so that we can run all Argo's tests?

mpelekh commented 1 month ago

@mpelekh would you be interested in joining a SIG Scalability meeting to talk through the changes?

@crenshaw-dev Yes, I’d be happy to join the SIG Scalability meeting to discuss the changes. Please let me know the time and details or if there’s anything specific I should prepare in advance.

crenshaw-dev commented 1 month ago

Great! The event is on the Argoproj calendar, and we coordinate in CNCF Slack. The next meeting is two Wednesdays from now at 8am eastern time.

No need to prepare anything really, just be prepared to answer questions about the PR. :-)

mpelekh commented 1 month ago

Could you open an Argo CD PR pointing to this commit so that we can run all Argo's tests?

@crenshaw-dev Sure. Here it is - https://github.com/argoproj/argo-cd/pull/20329.

crenshaw-dev commented 4 weeks ago

A couple things from the contributors meeting last week:

1) we should probably make this configurable via a flag from Argo CD; the more I think about it, the more I think we should have a quick opt-out option 2) if feasible, we should have batches process on a ticker or some max slice size; that'll help manage particularly high-churn spikes

mpelekh commented 2 weeks ago

Do we have a definitive answer yet for whether sync status/operation status are currently atomically updated vs. just very-quickly updated? Because if we're losing atomicity, that could be a big problem. If we're just slowing something down that used to be fast, I think that's relatively okay.

I provided the details in this comment - https://github.com/argoproj/argo-cd/pull/20329#issuecomment-2460145267

tl;dr The sync and operation status are not atomically updated. They just very quickly updated.

sonarcloud[bot] commented 1 week ago

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
2.6% Duplication on New Code

See analysis details on SonarQube Cloud

mpelekh commented 1 week ago

Thanks for the review @andrii-korotkov-verkada. I addressed the comments in this commit - https://github.com/argoproj/gitops-engine/pull/629/commits/7a53ecab1d0eab6197dd11221f7c0b9ed3bed738

I am going to rebase -i --autosquash it before merge.

argoproj / gitops-engine

fix: avoid resources lock contention utilizing channel #629

Quality Gate passed