chore: Optimize usage of locking in the cluster [#602]

argoproj / gitops-engine

Democratizing GitOps

https://pkg.go.dev/github.com/argoproj/gitops-engine?tab=subdirectories

Apache License 2.0

1.7k stars 264 forks source link

chore: Optimize usage of locking in the cluster [#602] #603

Closed andrii-korotkov-verkada closed 4 months ago

andrii-korotkov-verkada commented 4 months ago

Closes #602 Use read lock for handlers getting. Refactor updating resources in the cluster cache to only acquire lock for an update itself, not calling handlers.

andrii-korotkov-verkada commented 4 months ago

Not 100% sure it's safe, since resources and ns index can be updated by other threads while handlers are running. Maybe I'll add an additional read locking.

andrii-korotkov-verkada commented 4 months ago

Seems like watching of events is done for each API resource type separately. The only thing where namespace resources returned seem to used in practice in argo-cd is getAppRecursive. But the returned list shouldn't change even if some resources are being deleted, at most some fields would get modified.

Even with a read lock it's theoretically possible there's some race happening between releasing the write lock and acquiring the read lock. Note sure how much of a problem it would be though, thinking this through.

andrii-korotkov-verkada commented 4 months ago

At least there shouldn't be a race for the same key, so the worst case is namespace resources being updated a mix of fresher and out-of-date resources being used. Out-of-date is not a problem, since that'd essentially happen anyways with a write lock for a whole duration of the update and calling handlers, since new updates won't be processed yet. Fresher is probably okay too, since out-of-date objects that got removed shouldn't be referenced by fresher objects. So I think it's ok for purposes of getAppRecursive at least. I could be content with using read lock like it's used.

andrii-korotkov-verkada commented 4 months ago

Actually, nvm, the namespace resources view would always be same or fresher, since removal of objects happens on the same referenced map. So yeah, acquiring read lock before calling resource updated handlers seems good.

andrii-korotkov-verkada commented 4 months ago

Testing with ArgoCD https://github.com/argoproj/argo-cd/pull/18972

andrii-korotkov-verkada commented 4 months ago

Seems like this works in a conjunction with https://github.com/argoproj/argo-cd/pull/18694. I got an almost instant request of app refresh caused by config map update.

Screenshot 2024-07-07 at 4 47 23 PM

andrii-korotkov-verkada commented 4 months ago

Don't merge this yet, I'm investigating a potential bug that some updates may not be picked up by refresh, only hard refresh works sometimes.

andrii-korotkov-verkada commented 4 months ago

I've checked the code, but can't find where the mentioned issue may come from. namespace resources are only used to determine the app, and it's only for resource updates through watch, not for app refresh itself.

andrii-korotkov-verkada commented 4 months ago

The issue has apparently been in the incorrect usage of argocd.argoproj.io/manifest-generate-paths=.

andrii-korotkov-verkada commented 4 months ago

In addition, seems like app of apps in particular relies on managed apps having an explicit app.kubernetes.io/instance annotation, maybe regular apps do this too.

andrii-korotkov-verkada commented 4 months ago

Also, sounds like you can't ignore last-applied-configuration annotation in the app of apps, since it's used to compute the diff and is used after normalization.

andrii-korotkov-verkada commented 4 months ago

Updated the PR

andrii-korotkov-verkada commented 4 months ago

Here are some perf views of the system collected following https://github.com/argoproj/argo-cd/issues/13534#issuecomment-1562184630.

The build is from master on 2024/07/07 including https://github.com/argoproj/argo-cd/pull/18972, https://github.com/argoproj/argo-cd/pull/18694, https://github.com/argoproj/gitops-engine/pull/601, https://github.com/argoproj/gitops-engine/pull/603.

Screenshot 2024-07-09 at 3 47 18 PM Screenshot 2024-07-09 at 3 48 02 PM Screenshot 2024-07-09 at 4 21 08 PM Screenshot 2024-07-09 at 9 35 32 PM

crenshaw-dev commented 4 months ago

@andrii-korotkov-verkada I won't have time soon to review/test this. We can try to get help from other maintainers.

andrii-korotkov-verkada commented 4 months ago

I've added a benchmark with some goroutines doing processing of updates. Seems like a new locking doesn't benefit us and even has some regression when using less updates or having cheaper updates. I wonder if the wins I saw were from having for less time while iterating hierarchy more efficiently.

New BenchmarkProcessEvents-10              1    3399116791 ns/op    128337808 B/op   1931210 allocs/op
Old BenchmarkProcessEvents-10              1    3241700500 ns/op    128167704 B/op   1930942 allocs/op

andrii-korotkov-verkada commented 4 months ago

One thing it doesn't measure is what's the latency of waiting to process some updates in the worst case, only shows throughput.

andrii-korotkov-verkada commented 4 months ago

Testing without a locking PR shows nearly no waiting time for waiting for config map at least in staging. I might just close the PR, since I haven't really found a configuration of benchmark where acquiring a read lock after write lock is better than acquiring write lock for a whole duration, but there are configurations that show a consistent regression.