argoproj / gitops-engine

Democratizing GitOps
https://pkg.go.dev/github.com/argoproj/gitops-engine?tab=subdirectories
Apache License 2.0
1.67k stars 251 forks source link

feat: More optimal IterateHierarchyV2 and iterateChildrenV2 [#600] #601

Closed andrii-korotkov-verkada closed 1 month ago

andrii-korotkov-verkada commented 2 months ago

Closes #600

The existing (effectively v1) implementations are suboptimal since they don't construct a graph before the iteration. They search for children by looking at all namespace resources and checking isParentOf, which can give O(tree_size * namespace_resources_count) time complexity. The v2 algorithms construct the graph and have O(namespace_resources_count) time complexity. See more details in the linked issues.

andrii-korotkov-verkada commented 2 months ago

Testing with ArgoCD https://github.com/argoproj/argo-cd/pull/18972

andrii-korotkov-verkada commented 2 months ago

Looks really good on live cluster. ~300ms instead of almost ~4m for the same application! Screenshot 2024-07-07 at 4 29 56 PM

andrii-korotkov-verkada commented 2 months ago

Here are some perf views of the system collected following https://github.com/argoproj/argo-cd/issues/13534#issuecomment-1562184630.

The build is from master on 2024/07/07 including https://github.com/argoproj/argo-cd/pull/18972, https://github.com/argoproj/argo-cd/pull/18694, https://github.com/argoproj/gitops-engine/pull/601, https://github.com/argoproj/gitops-engine/pull/603.

Screenshot 2024-07-09 at 3 47 18 PM Screenshot 2024-07-09 at 3 48 02 PM Screenshot 2024-07-09 at 3 48 39 PM Screenshot 2024-07-09 at 4 21 08 PM Screenshot 2024-07-09 at 9 35 32 PM Screenshot 2024-07-09 at 9 35 20 PM

codecov[bot] commented 1 month ago

Codecov Report

Attention: Patch coverage is 82.79570% with 16 lines in your changes missing coverage. Please review.

Project coverage is 58.38%. Comparing base (fa0e8d6) to head (905c87e). Report is 3 commits behind head on master.

Files Patch % Lines
pkg/cache/cluster.go 86.11% 6 Missing and 4 partials :warning:
pkg/cache/resource.go 71.42% 3 Missing and 3 partials :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #601 +/- ## ========================================== + Coverage 55.91% 58.38% +2.46% ========================================== Files 42 42 Lines 4900 5008 +108 ========================================== + Hits 2740 2924 +184 + Misses 1937 1840 -97 - Partials 223 244 +21 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

crenshaw-dev commented 1 month ago

Could you take a look at this PR? https://github.com/andrii-korotkov-verkada/gitops-engine/pull/1

I'll rebase, the force-push messed up the diff.

crenshaw-dev commented 1 month ago

Pushed.

andrii-korotkov-verkada commented 1 month ago

@crenshaw-dev, looks good! Thanks for improving the code. I've merged the changes.

crenshaw-dev commented 1 month ago

@andrii-korotkov-verkada thanks! I may have a few more as I continue digging through the code. Bear with me. I plan to stick with it this week until we get it merged, as long as you have time to keep working on it!

crenshaw-dev commented 1 month ago

More fun: https://github.com/andrii-korotkov-verkada/gitops-engine/pull/2

andrii-korotkov-verkada commented 1 month ago

Sounds good! Thanks for the help. I'd be pretty active on this.

crenshaw-dev commented 1 month ago

Probably the last graph building optimization: https://github.com/andrii-korotkov-verkada/gitops-engine/pull/3

crenshaw-dev commented 1 month ago

@andrii-korotkov-verkada up for considering this one? https://github.com/andrii-korotkov-verkada/gitops-engine/pull/4

I think we do risk missing parent relationships due to Resources lacking the correct namespace field. But so far unit tests, Argo CD e2e tests, and running in an Intuit system looks okay.

I think it would be worth the risk for cutting the memory use in half and saving ~30% execution time.

andrii-korotkov-verkada commented 1 month ago

In general it'd make sense to me since all other places use group. I'll merge it.

crenshaw-dev commented 1 month ago

Test results from Intuit:

Before, IterateHierarchy was taking about 10% of the application controller's time while refreshing apps. Now it takes around 1%. So roughly 6s before, now less than 1s out of a 60s profile.

Heap use has gone up by about 2x, but it wasn't memory-heavy before, and it hasn't really increased in a problematic way.

Steady-state CPU and memory is basically the same as before.

I used log metrics to measure 95th percentile reconciliation, getting the resource tree, and setting managed resources.

Reconciliation time is about the same, maaaaybe 25% faster. Getting the resource tree is about the same amount of time. Setting the managed resources now takes ~one fifth the time it did before, down to 200ms from 1000ms.

I think the vast majority of that improvement doesn't actually come from the graph pre-build optimization, but instead comes from avoiding iterating over all the managed resources (an optimization which doesn't actually depend on pre-building the graph).

In summary: my test showed no performance regressions, no functionality regressions, and maybe a slight performance improvement related to the new algorithm. I suspect the relatively small improvement is because we have relatively few resources per-namespace. Bigger namespaces will have a bigger CPU win and (probably) a higher memory cost.

andrii-korotkov-verkada commented 1 month ago

Thanks for all the testing and support! Yeah, I guess we just have quite large default namespace (which isn't best practice, but here we are), hence I saw larger wins. Longest running processor in the largest cluster went down from 30-60min to 1-2min.