argoproj / gitops-engine

Democratizing GitOps
https://pkg.go.dev/github.com/argoproj/gitops-engine?tab=subdirectories
Apache License 2.0
1.67k stars 251 forks source link

fix: deadlock on start missing watches #604

Closed agaudreault closed 1 month ago

agaudreault commented 1 month ago

Caused by https://github.com/argoproj/gitops-engine/pull/532 Fixes https://github.com/argoproj/argo-cd/issues/18467 Fixes https://github.com/argoproj/argo-cd/issues/18902 Fixes https://github.com/argoproj/argo-cd/issues/16950

jdroot commented 1 month ago

Your initial comment has the same issue linked twice.

This looks good to me and I believe would fix the issue. One concern I still have about startMissingWatching though is that it takes a lock and then makes network calls to Kubernetes. Typically that is not a great pattern, but I don't understand the code enough to feel comfortable re-architecting the lock usage. I was curious if you have any thoughts on that, since I think you know the code much better

agaudreault commented 1 month ago

This looks good to me and I believe would fix the issue. One concern I still have about startMissingWatching though is that it takes a lock and then makes network calls to Kubernetes. Typically that is not a great pattern, but I don't understand the code enough to feel comfortable re-architecting the lock usage. I was curious if you have any thoughts on that, since I think you know the code much better

I took a quick look at it and I expected it to be similar to the sync() method, which also holds a lock. I also don't know enough about the code to decide if it is necessary or not. This code is there for quite some time. Maybe @alexmt have a few pointers to give you if you decide to have a stab at it.

crenshaw-dev commented 1 month ago
crenshaw-dev commented 1 month ago

Okay, so putting it into plain English.

The cluster cache is being rebuilt. A bunch of goroutines have been kicked off to pull initial resource lists. Each of those takes out a lock on the list semaphore (max 50). So those 50 routines are churning along, listing resources.

Lo and behold! A CRD update event arrives. We take out the cluster lock and run startMissingWatches to make sure the new CRD's resources are being watched appropriately.

As part of startMissingWatches, we loop through any currently-unwatched APIs and load their initial state.

Loading the initial state involves listing resources. Listing resources requires acquiring the list semaphore. So we're stuck waiting for those other 50 lists to finish.

The other 50 lists finish, and they're ready to write their resource state to the cache. But the cache lock has been taken out to handle startMissingWatches.

So the initial cache loading is blocked waiting on the CRD event to be handled, and the CRD event handler code is stuck waiting for a the initial cache loading to free up the list semaphore. Deadlock.

crenshaw-dev commented 1 month ago

@agaudreault I think we should include this in 2.12, but probably not cherry-pick it back further. The memory regression is a bigger concern than deadlocks imo for already-stable releases.