fix: deadlock on start missing watches

agaudreault commented 1 month ago

The callback on listResources must not obtain locks from other callers.

Caused by https://github.com/argoproj/gitops-engine/pull/532 Fixes https://github.com/argoproj/argo-cd/issues/18467 Fixes https://github.com/argoproj/argo-cd/issues/18902 Fixes https://github.com/argoproj/argo-cd/issues/16950

jdroot commented 1 month ago

Your initial comment has the same issue linked twice.

This looks good to me and I believe would fix the issue. One concern I still have about startMissingWatching though is that it takes a lock and then makes network calls to Kubernetes. Typically that is not a great pattern, but I don't understand the code enough to feel comfortable re-architecting the lock usage. I was curious if you have any thoughts on that, since I think you know the code much better

agaudreault commented 1 month ago

This looks good to me and I believe would fix the issue. One concern I still have about startMissingWatching though is that it takes a lock and then makes network calls to Kubernetes. Typically that is not a great pattern, but I don't understand the code enough to feel comfortable re-architecting the lock usage. I was curious if you have any thoughts on that, since I think you know the code much better

I took a quick look at it and I expected it to be similar to the sync() method, which also holds a lock. I also don't know enough about the code to decide if it is necessary or not. This code is there for quite some time. Maybe @alexmt have a few pointers to give you if you decide to have a stab at it.

crenshaw-dev commented 1 month ago

clusterCache.sync kicks off a goroutine for every monitored API to populate the cache and start watches
- In that goroutine, we call clusterCache.listResources which acquires a piece of the N=50 list semaphore. That semaphore is intended to limit memory use spikes on cluster cache.
- After listing resources, we kick off clusterCache.watchEvents in a goroutine for the API.
- When running clusterCache.watchEvents, if we're not given a resource version, we load the initial state of the API.
- After listing items, we take out the cluster cache lock and replace the items list for that API.
- After releasing the lock, we set up a watch on the API.
- Handling an event for the API, if the event is for a CRD but is NOT a delete event, we take out the cluster lock and run clusterCache.startMissingWatches.
- clusterCache.startMissingWatches gets the current list of API resources and loops over them synchronously.
- For each API resource, if it's not already marked as handled in clusterCache.apisMeta, we call clusterCache.loadInitialState.
- clusterCache.loadInitialState calls clusterCache.listResources, which takes out a piece of the N=50 list semaphore. We're currently holdingclusterCache.lock. If the semaphore is full, we can't releaseclusterCache.lockIf all 50 list semaphores are taken by processes which needclusterCache.lock`, we're deadlocked.

crenshaw-dev commented 1 month ago

Okay, so putting it into plain English.

The cluster cache is being rebuilt. A bunch of goroutines have been kicked off to pull initial resource lists. Each of those takes out a lock on the list semaphore (max 50). So those 50 routines are churning along, listing resources.

Lo and behold! A CRD update event arrives. We take out the cluster lock and run startMissingWatches to make sure the new CRD's resources are being watched appropriately.

As part of startMissingWatches, we loop through any currently-unwatched APIs and load their initial state.

Loading the initial state involves listing resources. Listing resources requires acquiring the list semaphore. So we're stuck waiting for those other 50 lists to finish.

The other 50 lists finish, and they're ready to write their resource state to the cache. But the cache lock has been taken out to handle startMissingWatches.

So the initial cache loading is blocked waiting on the CRD event to be handled, and the CRD event handler code is stuck waiting for a the initial cache loading to free up the list semaphore. Deadlock.

crenshaw-dev commented 1 month ago

@agaudreault I think we should include this in 2.12, but probably not cherry-pick it back further. The memory regression is a bigger concern than deadlocks imo for already-stable releases.

argoproj / gitops-engine

fix: deadlock on start missing watches #604