Closed andrii-korotkov-verkada closed 4 months ago
Not 100% sure it's safe, since resources and ns index can be updated by other threads while handlers are running. Maybe I'll add an additional read locking.
Seems like watching of events is done for each API resource type separately. The only thing where namespace resources returned seem to used in practice in argo-cd is getAppRecursive
. But the returned list shouldn't change even if some resources are being deleted, at most some fields would get modified.
Even with a read lock it's theoretically possible there's some race happening between releasing the write lock and acquiring the read lock. Note sure how much of a problem it would be though, thinking this through.
At least there shouldn't be a race for the same key, so the worst case is namespace resources being updated a mix of fresher and out-of-date resources being used. Out-of-date is not a problem, since that'd essentially happen anyways with a write lock for a whole duration of the update and calling handlers, since new updates won't be processed yet. Fresher is probably okay too, since out-of-date objects that got removed shouldn't be referenced by fresher objects. So I think it's ok for purposes of getAppRecursive
at least. I could be content with using read lock like it's used.
Actually, nvm, the namespace resources view would always be same or fresher, since removal of objects happens on the same referenced map. So yeah, acquiring read lock before calling resource updated handlers seems good.
Testing with ArgoCD https://github.com/argoproj/argo-cd/pull/18972
Seems like this works in a conjunction with https://github.com/argoproj/argo-cd/pull/18694. I got an almost instant request of app refresh caused by config map update.
Don't merge this yet, I'm investigating a potential bug that some updates may not be picked up by refresh, only hard refresh works sometimes.
I've checked the code, but can't find where the mentioned issue may come from. namespace resources are only used to determine the app, and it's only for resource updates through watch, not for app refresh itself.
The issue has apparently been in the incorrect usage of argocd.argoproj.io/manifest-generate-paths=.
In addition, seems like app of apps in particular relies on managed apps having an explicit app.kubernetes.io/instance
annotation, maybe regular apps do this too.
Also, sounds like you can't ignore last-applied-configuration
annotation in the app of apps, since it's used to compute the diff and is used after normalization.
Updated the PR
Here are some perf views of the system collected following https://github.com/argoproj/argo-cd/issues/13534#issuecomment-1562184630.
The build is from master on 2024/07/07 including https://github.com/argoproj/argo-cd/pull/18972, https://github.com/argoproj/argo-cd/pull/18694, https://github.com/argoproj/gitops-engine/pull/601, https://github.com/argoproj/gitops-engine/pull/603.
@andrii-korotkov-verkada I won't have time soon to review/test this. We can try to get help from other maintainers.
I've added a benchmark with some goroutines doing processing of updates. Seems like a new locking doesn't benefit us and even has some regression when using less updates or having cheaper updates. I wonder if the wins I saw were from having for less time while iterating hierarchy more efficiently.
New BenchmarkProcessEvents-10 1 3399116791 ns/op 128337808 B/op 1931210 allocs/op
Old BenchmarkProcessEvents-10 1 3241700500 ns/op 128167704 B/op 1930942 allocs/op
One thing it doesn't measure is what's the latency of waiting to process some updates in the worst case, only shows throughput.
Testing without a locking PR shows nearly no waiting time for waiting for config map at least in staging. I might just close the PR, since I haven't really found a configuration of benchmark where acquiring a read lock after write lock is better than acquiring write lock for a whole duration, but there are configurations that show a consistent regression.
Closes #602 Use read lock for handlers getting. Refactor updating resources in the cluster cache to only acquire lock for an update itself, not calling handlers.