couchbaselabs / Couchbase.Extensions

Extensions for Couchbase for .Net Core
Apache License 2.0
27 stars 14 forks source link

Couchbase.Extensions.Locks creating the same mutex twice #76

Closed RichCerone closed 4 years ago

RichCerone commented 4 years ago

Hi, I have been using the Couchbase.Extensions.Lock library for a while now and quite recently I have had a case where two processes were able to enter the critical section at the same time. My question is how resilient is the Couchbase.Extensions.Locks library? I have cases where I have over 200 processes trying to create a single mutex within Couchbase. In the past, when we had less processes running, we did not see this issue. I cannot share my code unfortunately, but I can share two logs which is where I saw the mutexes created/released:

mutex1: { "logType": "RepositoryService", "logLevel": "Info", "docDescription": "Mutex cluster-upgrade-mutex created.", "logEntryTimeStamp": "2020-06-10 12:45:43.749", "docCreated": "2020-06-10T12:45:43.7532289Z", "docUpdated": "2020-06-10T12:45:43.7532289Z", "docType": "logentry" } mutext2: { "logType": "RepositoryService", "logLevel": "Info", "docDescription": "Mutex cluster-upgrade-mutex created.", "logEntryTimeStamp": "2020-06-10 12:45:44.752", "docCreated": "2020-06-10T12:45:44.7683052Z", "docUpdated": "2020-06-10T12:45:44.7683052Z", "docType": "logentry" }

Notice that the two mutexes were created very close together. I also have when they ended:

mutex1: { "logType": "RepositoryService", "logLevel": "Info", "docDescription": "Releasing mutex cluster-upgrade-mutex.", "logEntryTimeStamp": "2020-06-10 12:45:45.763", "docCreated": "2020-06-10T12:45:45.7783962Z", "docUpdated": "2020-06-10T12:45:45.7783962Z", "docType": "logentry" } mutex2: { "logType": "RepositoryService", "logLevel": "Info", "docDescription": "Releasing mutex cluster-upgrade-mutex.", "logEntryTimeStamp": "2020-06-10 12:45:45.786", "docCreated": "2020-06-10T12:45:45.7891277Z", "docUpdated": "2020-06-10T12:45:45.7891277Z", "docType": "logentry" }

Notice that mutex2 was created before mutex1 was released.

My question is can this library handle hundreds of requests at once and can it handle it at the millisecond level? We have plans to extend the infrastructure that calls this mutex meaning we will have more processes and requests being made at once against creating this mutex and thus using this library.

brantburnett commented 4 years ago

How certain are you of the time synchronization between these disparate machines? Is it possible that they really didn't overlap but it just appears so due to clock sync?

The effectiveness of this library is primarily based on the effectiveness of Couchbase's atomic document operations. Each lock is represented by a single document, which therefore resides on a single data node and supports atomic operations. Taking out a lock is an insert operation, which will fail if the doc already exists.

The only time I'm aware of where there is risk is during node failover or rebalance operations. In that case, it's possible that the lock gets "lost" if it wasn't streamed to the backup node yet. Of course, it's always possible there's a bug somewhere in the library or in Couchbase Server, but not that I'm aware of.

RichCerone commented 4 years ago

I talked to my database team and they said rebalances on our cluster typically would only happen once a month. A node fail over would also be something we'd be made aware of. Is there any other possibility for a mutex to be created twice? While the logs may be a result of clock sync, the way our code works is that you need to create a mutex before updating a certain document's field to a specific status and we are seeing that document being updated more than once. Typically, the document would be updated and not retrieved again because its status would be altered where it would no be queried again within the critical section.

brantburnett commented 4 years ago

@ingenthr @jeffrymorris Are you guys aware of any issues with Insert or Upsert with CAS that may cause inconsistent behavior around these lock documents?

Code in question: https://github.com/couchbaselabs/Couchbase.Extensions/blob/master/src/Couchbase.Extensions.Locks/Internal/CouchbaseMutex.cs#L49

jeffrymorris commented 4 years ago

@brantburnett - not that I am aware of.

RichCerone commented 4 years ago

I am going to close this issue for now. We may have the found the problem on our end where an application is manipulating a document value inappropriately causing 2 processes to pick up the same document twice in 2 different mutexes.