cockroachdb / cockroach

CockroachDB - the open source, cloud-native distributed SQL database.
https://www.cockroachlabs.com
Other
29.52k stars 3.7k forks source link

sql: likely incorrect locking pattern in ConcurrentBufferGuard (contention mgt), possibly leading to rare deadlocks #83080

Open knz opened 2 years ago

knz commented 2 years ago

Describe the problem

In sql/contention/contentionutils/concurrent_buffer_guard.go we see the following pattern:

func (c *ConcurrentBufferGuard) syncRLocked() {
  // We upgrade the read-lock to a write-lock, then when we are done flushing,
  // the lock is downgraded to a read-lock.
  c.flushSyncLock.RUnlock()
  defer c.flushSyncLock.RLock()
  c.flushSyncLock.Lock()
  defer c.flushSyncLock.Unlock()
  c.syncLocked()
}

The call to RUnlock() is followed by Lock(). In between the two the mutex is not held. This syncRLocked() function is called from AtomicWrite, which (as the name implies) is meant to do its operation atomically. Because of the pattern in syncRLocked(), the atomicity is lost.

The result is that contention information may be lost, or the execution could deadlock.

Expected behavior

The logic in ConcurrentBufferGuard should be rewritten to avoid this loss of atomicity, possibly using a channel instead.

Jira issue: CRDB-16850

ajwerner commented 1 year ago

I read the code and I think think it's more or less sound. I don't think it can deadlock. All of the RLock callers will either terminate or unlock promptly on a Wait call. The atomicity here is not important. I don't think there's a bug here.

knz commented 1 year ago

The atomicity here is not important.

Can you say more? Why? If it's not important, why do we care about acquiring a read lock at all?

ajwerner commented 1 year ago

We're playing this fun game here where the writes are being performed concurrently by using an RLock and an atomic add to pick an index. This works because that atomic add is monotonic. The flush operation needs to be operated under full mutual exclusion, and the Lock gives us that because the other writers will use RLock. As long as a thread has the Lock it knows that it is safe to flush. The race in question is whether two goroutines might Flush and by the time the first goroutine flushes, the second one doesn't need to. If we had atomicity, then, I suppose, you might avoid an extra flush. This data structure has some extra synchronization by using that atomic: only the goroutine which increments the buffer size to a specific number will attempt to perform the flush. The only case where the race of two flushes can occur is if the ForceSync method is called, at which point, I don't think it matters if there's one extra flush.

knz commented 1 year ago

Ok that is interesting.

Why is this explanation not included in the source? Especially on the struct description, and the particular method at hand here.

ajwerner commented 1 year ago

@matthewtodd can I ask you to sprinkle commentary on how the ConcurrentBufferGuard works and why the re-locking is safe as the resolution to this issue?

matthewtodd commented 1 year ago

Yes, I'll add commentary then resolve this issue when done. Thanks, all.