apache / accumulo

Apache Accumulo
https://accumulo.apache.org
Apache License 2.0
1.06k stars 445 forks source link

Lock acquisition order may cause deadlock #3759

Open keith-turner opened 1 year ago

keith-turner commented 1 year ago

When closing a tablet two different locks are acquired in the following order

  1. Tablet.completeClose acquires the tablet lock
  2. During complete close a minor compaction is started which acquires the lock lock here.
  3. Then the tablet lock is acquired here again.

Tablet close acquire the tablet lock and then the log lock. Other operations like a write to the tablet acquire the locks in reverse order (log lock 1st and tablet lock 2nd)which has the potential to lead to dead lock. However it may be that state set during closing the tablet prevents these operation from running, so deadlock can not happen. This needs further investigation.

This issue was uncovered by changes #3747 which caused test to break as reported in #3757.

keith-turner commented 1 year ago

See this comment in slack for a stack trace where changes in #3747 are causing test to fail.

dlmarion commented 1 month ago

Re-opening due to possible issue found in #4751 . See https://github.com/apache/accumulo/pull/4751#issuecomment-2250181639.

keith-turner commented 1 month ago

Re-opening due to possible issue found in https://github.com/apache/accumulo/pull/4751 . See https://github.com/apache/accumulo/pull/4751#issuecomment-2250181639.

I plan to revert the commit related to #4751 and move this issue to 4.0.0 milestone for now because I can not see how to fix this w/o a major refactoring of the tablet close code (the comments on #4751 provide some insight into one refactoring that is needed, but that may be the tip of an ice berg). A major refactoring of the code could introduce new bugs and this bug probably has a low chance of happening, so it does not seem worth the risk to me in 2.1.X at the moment

The reason I am thinking moving this to 4.0.0 is because a lot of code was removed from the tablet server in 4.0.0 which makes a major refactoring of the Tablet code easier to do. The issue #3670 is already open for 4.0.0 and this bug could possibly be fixed as part of that cleanup or it could be done after that cleanup and would be much easier to do after a cleanup like #3670.

If we figure out a way to fix this in 2.1.X w/o rewriting large parts of the Tablet synchronization code, then this issue could be moved back to a 2.1.X milestone.

This bug involves two different kinds of locks, one is a Java lock object and the other is Java object monitor lock. I was curious if jstack could detect deadlocks between these different kinds of locks. Did a bit of searching and did not find anything definitive. So wrote the following test program and for this jstack did detect the deadlock. So its good to know if it does happen that jstack can detect it (at least w/ jdk 17).

public class TestLocks {
    public static void main(String[] args) throws Exception {
        ExecutorService executorService = Executors.newFixedThreadPool(2);

        var logLock = new ReentrantLock();
        var tablet = new HashMap<String,String>();

        var future1 = executorService.submit(()-> {
            // This simulates how Tablet close may acquire locks.
            synchronized (tablet) {
                Thread.sleep(5000);
                logLock.lock();
                System.out.println("1");
            }
            return null;
        });

        var future2 = executorService.submit(()-> {
            //This simulates how other code in Tablet outside of close will acquire locks
            logLock.lock();
            Thread.sleep(5000);
            synchronized (tablet){
                System.out.println("2");
            }
            return null;
        });

        future1.get();
        future2.get();
    }
}

The following is the output of running jstack against the above program using OpenJDK17, so at least that version will detect deadlock between different lock types.

Found one Java-level deadlock:
=============================
"pool-1-thread-1":
  waiting for ownable synchronizer 0x000000062ae3eed0, (a java.util.concurrent.locks.ReentrantLock$NonfairSync),
  which is held by "pool-1-thread-2"

"pool-1-thread-2":
  waiting to lock monitor 0x00007562c4000fe0 (object 0x000000062ae3efd8, a java.util.HashMap),
  which is held by "pool-1-thread-1"

Java stack information for the threads listed above:
===================================================
"pool-1-thread-1":
    at jdk.internal.misc.Unsafe.park(java.base@17.0.11/Native Method)
    - parking to wait for  <0x000000062ae3eed0> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
    at java.util.concurrent.locks.LockSupport.park(java.base@17.0.11/LockSupport.java:211)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@17.0.11/AbstractQueuedSynchronizer.java:715)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@17.0.11/AbstractQueuedSynchronizer.java:938)
    at java.util.concurrent.locks.ReentrantLock$Sync.lock(java.base@17.0.11/ReentrantLock.java:153)
    at java.util.concurrent.locks.ReentrantLock.lock(java.base@17.0.11/ReentrantLock.java:322)
    at org.apache.accumulo.server.TestLocks.lambda$main$0(TestLocks.java:19)
    - locked <0x000000062ae3efd8> (a java.util.HashMap)
    at org.apache.accumulo.server.TestLocks$$Lambda$14/0x0000756324001208.call(Unknown Source)
    at java.util.concurrent.FutureTask.run(java.base@17.0.11/FutureTask.java:264)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@17.0.11/ThreadPoolExecutor.java:1136)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@17.0.11/ThreadPoolExecutor.java:635)
    at java.lang.Thread.run(java.base@17.0.11/Thread.java:840)
"pool-1-thread-2":
    at org.apache.accumulo.server.TestLocks.lambda$main$1(TestLocks.java:31)
    - waiting to lock <0x000000062ae3efd8> (a java.util.HashMap)
    at org.apache.accumulo.server.TestLocks$$Lambda$16/0x0000756324001430.call(Unknown Source)
    at java.util.concurrent.FutureTask.run(java.base@17.0.11/FutureTask.java:264)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@17.0.11/ThreadPoolExecutor.java:1136)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@17.0.11/ThreadPoolExecutor.java:635)
    at java.lang.Thread.run(java.base@17.0.11/Thread.java:840)

Found 1 deadlock.
keith-turner commented 1 month ago

Reverted the changes from #4751 in 7efee59b45ccfd5010a45e30b148f6031b368417 and merged those changes up.