argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
16.76k stars 5.08k forks source link

Optimize processing managed resources in getResourceTree, ideally from quadratic to linear #18929

Open andrii-korotkov-verkada opened 5 days ago

andrii-korotkov-verkada commented 5 days ago

Summary

getResourceTree has a loop through each of managed resources, calling IterateHierarchy on state cache to go over children, which ends up calling IterateHierarchy in gitops-engine, which has a loop over resources in a namespace (https://github.com/argoproj/gitops-engine/blob/master/pkg/cache/cluster.go#L973), which calls iterateChildren which has a similar loop (https://github.com/argoproj/gitops-engine/blob/master/pkg/cache/resource.go#L89). This is presumably since we only keep one-way track of child -> parent relationship, while effective traversal requires the opposite. Overall, this can result in a quadratic execution time of O(tree_size * namespace_resources_count). If we pre-construct the graph with parent -> child edges, this can be reduced to linear O(namespace_resources_count).

Motivation

getResourceTree seems to be the slowest part of reconciliation for large apps, e.g. see timing data from a build with https://github.com/argoproj/argo-cd/pull/18926: Screenshot 2024-07-03 at 7 16 57 PM Screenshot 2024-07-03 at 7 17 09 PM Getting resource tree took almost 4 minutes and was where almost all the reconciliation time went. It's an app with ~500 resources which don't even have children. And it's not the biggest cluster (staging). The biggest cluster (one of prod) can take 30-60 min to reconcile that app even with no changes. The algorithm doesn't have to be quadratic and can be improved to linear.

Proposal

Pre-construct a graph from namespace resources parent -> child and do a linear dfs from each of managed resources, avoiding visiting same vertex twice. gitops-engine changes may not even be needed, but may be good to have anyways. Overall, this would reduce the time complexity from quadratic to linear.

A sub-optimal fix can be to configure some resource groups and kinds as not having any children, so hierarchy iteration can be skipped for them.

andrii-korotkov-verkada commented 5 days ago

I'll try to implement this.

andrii-korotkov-verkada commented 5 days ago

isParentOf can't be directly used to construct a graph, since it'd result in quadratic time complexity. It does some backfill, not just UID-based check https://github.com/argoproj/gitops-engine/blob/master/pkg/cache/resource.go#L36-L52, so we need to have some map of nodes based on UID and based on kind + api version + name.

andrii-korotkov-verkada commented 5 days ago

Here's a gitops-engine PR https://github.com/argoproj/gitops-engine/pull/601. It's going to be the main PR with nearly all the logic, argo-cd PR would just call a new API.

andrii-korotkov-verkada commented 1 day ago

Looks really good on live cluster. ~300ms instead of ~4m to get process managed resources for the same application! Screenshot 2024-07-07 at 4 29 56 PM