Closed lhotari closed 1 year ago
I it seems to be a leak in the "mutex" lock / unlock in MockZooKeeper
the thread dump contains this type of stacks too:
"metadata-store-2857-1" #4674 prio=5 os_prio=0 cpu=17.87ms elapsed=2969.28s tid=0x00007f2ac238b800 nid=0x5935 waiting on condition [0x00007f2a69fe1000]
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@11.0.13/Native Method)
- parking to wait for <0x00000000c592e390> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(java.base@11.0.13/LockSupport.java:194)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11.0.13/AbstractQueuedSynchronizer.java:885)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(java.base@11.0.13/AbstractQueuedSynchronizer.java:917)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@11.0.13/AbstractQueuedSynchronizer.java:1240)
at java.util.concurrent.locks.ReentrantLock.lock(java.base@11.0.13/ReentrantLock.java:267)
at org.apache.zookeeper.MockZooKeeper.delete(MockZooKeeper.java:825)
at org.apache.zookeeper.MockZooKeeper.multi(MockZooKeeper.java:957)
at org.apache.zookeeper.MockZooKeeper.multi(MockZooKeeper.java:936)
at org.apache.pulsar.metadata.impl.ZKMetadataStore.batchOperation(ZKMetadataStore.java:148)
at org.apache.pulsar.metadata.impl.batching.AbstractBatchedMetadataStore.flush(AbstractBatchedMetadataStore.java:111)
at org.apache.pulsar.metadata.impl.batching.AbstractBatchedMetadataStore$$Lambda$214/0x00000001003de440.run(Unknown Source)
at java.util.concurrent.Executors$RunnableAdapter.call(java.base@11.0.13/Executors.java:515)
at java.util.concurrent.FutureTask.runAndReset(java.base@11.0.13/FutureTask.java:305)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(java.base@11.0.13/ScheduledThreadPoolExecutor.java:305)
at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.13/ThreadPoolExecutor.java:1128)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.13/ThreadPoolExecutor.java:628)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(java.base@11.0.13/Thread.java:829)
Locked ownable synchronizers:
- <0x00000000c57f9248> (a java.util.concurrent.ThreadPoolExecutor$Worker)
"metadata-store-zk-session-watcher-2858-1" #4675 prio=5 os_prio=0 cpu=0.15ms elapsed=2969.28s tid=0x00007f2ac2254800 nid=0x5936 waiting on condition [0x00007f2a5eae8000]
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@11.0.13/Native Method)
- parking to wait for <0x00000000c592e390> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(java.base@11.0.13/LockSupport.java:194)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11.0.13/AbstractQueuedSynchronizer.java:885)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(java.base@11.0.13/AbstractQueuedSynchronizer.java:917)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@11.0.13/AbstractQueuedSynchronizer.java:1240)
at java.util.concurrent.locks.ReentrantLock.lock(java.base@11.0.13/ReentrantLock.java:267)
at org.apache.zookeeper.MockZooKeeper.lambda$exists$15(MockZooKeeper.java:644)
at org.apache.zookeeper.MockZooKeeper$$Lambda$325/0x0000000100500040.run(Unknown Source)
at com.google.common.util.concurrent.MoreExecutors$DirectExecutorService.execute(MoreExecutors.java:321)
at org.apache.zookeeper.MockZooKeeper.exists(MockZooKeeper.java:643)
at org.apache.pulsar.metadata.impl.ZKSessionWatcher.checkConnectionStatus(ZKSessionWatcher.java:89)
- locked <0x00000000c592e450> (a org.apache.pulsar.metadata.impl.ZKSessionWatcher)
at org.apache.pulsar.metadata.impl.ZKSessionWatcher$$Lambda$216/0x00000001003ddc40.run(Unknown Source)
at org.apache.pulsar.common.util.Runnables$CatchingAndLoggingRunnable.run(Runnables.java:53)
at java.util.concurrent.Executors$RunnableAdapter.call(java.base@11.0.13/Executors.java:515)
at java.util.concurrent.FutureTask.runAndReset(java.base@11.0.13/FutureTask.java:305)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(java.base@11.0.13/ScheduledThreadPoolExecutor.java:305)
at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.13/ThreadPoolExecutor.java:1128)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.13/ThreadPoolExecutor.java:628)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(java.base@11.0.13/Thread.java:829)
I'll work on a fix.
This was most likely a bug in my PR #13066 . I have addressed the lock leakage problem in the PR to prevent such issues in the future.
Recent failure in https://github.com/apache/pulsar/runs/5036754890?check_suite_focus=true#step:10:35
"main" #1 prio=5 os_prio=0 cpu=216695.77ms elapsed=6825.27s tid=0x00007f6c84028000 nid=0xc19 waiting on condition [0x00007f6c891cf000]
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@11.0.13/Native Method)
- parking to wait for <0x00000000ec09b138> (a java.util.concurrent.CompletableFuture$Signaller)
at java.util.concurrent.locks.LockSupport.park(java.base@11.0.13/LockSupport.java:194)
at java.util.concurrent.CompletableFuture$Signaller.block(java.base@11.0.13/CompletableFuture.java:1796)
at java.util.concurrent.ForkJoinPool.managedBlock(java.base@11.0.13/ForkJoinPool.java:3128)
at java.util.concurrent.CompletableFuture.waitingGet(java.base@11.0.13/CompletableFuture.java:1823)
at java.util.concurrent.CompletableFuture.join(java.base@11.0.13/CompletableFuture.java:2043)
at org.apache.pulsar.metadata.coordination.impl.LockManagerImpl.close(LockManagerImpl.java:163)
at org.apache.pulsar.broker.loadbalance.impl.SimpleLoadManagerImpl.stop(SimpleLoadManagerImpl.java:1447)
at org.apache.pulsar.broker.PulsarService.closeAsync(PulsarService.java:436)
at org.apache.pulsar.broker.PulsarService.close(PulsarService.java:367)
at org.apache.pulsar.client.api.ClientDeduplicationFailureTest.shutdown(ClientDeduplicationFailureTest.java:133)
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(java.base@11.0.13/Native Method)
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(java.base@11.0.13/NativeMethodAccessorImpl.java:62)
at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(java.base@11.0.13/DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(java.base@11.0.13/Method.java:566)
at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:132)
at org.testng.internal.MethodInvocationHelper.invokeMethodConsideringTimeout(MethodInvocationHelper.java:61)
at org.testng.internal.ConfigInvoker.invokeConfigurationMethod(ConfigInvoker.java:366)
at org.testng.internal.ConfigInvoker.invokeConfigurations(ConfigInvoker.java:320)
at org.testng.internal.TestInvoker.runConfigMethods(TestInvoker.java:701)
at org.testng.internal.TestInvoker.runAfterGroupsConfigurations(TestInvoker.java:677)
at org.testng.internal.TestInvoker.invokeMethod(TestInvoker.java:661)
at org.testng.internal.TestInvoker.invokeTestMethod(TestInvoker.java:174)
at org.testng.internal.MethodRunner.runInSequence(MethodRunner.java:46)
at org.testng.internal.TestInvoker$MethodInvocationAgent.invoke(TestInvoker.java:822)
at org.testng.internal.TestInvoker.invokeTestMethods(TestInvoker.java:147)
at org.testng.internal.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:146)
at org.testng.internal.TestMethodWorker.run(TestMethodWorker.java:128)
at org.testng.TestRunner$$Lambda$216/0x00000001003d4040.accept(Unknown Source)
at java.util.ArrayList.forEach(java.base@11.0.13/ArrayList.java:1541)
at org.testng.TestRunner.privateRun(TestRunner.java:764)
at org.testng.TestRunner.run(TestRunner.java:585)
at org.testng.SuiteRunner.runTest(SuiteRunner.java:384)
at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:378)
at org.testng.SuiteRunner.privateRun(SuiteRunner.java:337)
at org.testng.SuiteRunner.run(SuiteRunner.java:286)
at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:53)
at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:96)
at org.testng.TestNG.runSuitesSequentially(TestNG.java:1218)
at org.testng.TestNG.runSuitesLocally(TestNG.java:1140)
at org.testng.TestNG.runSuites(TestNG.java:1069)
at org.testng.TestNG.run(TestNG.java:1037)
at org.apache.maven.surefire.testng.TestNGExecutor.run(TestNGExecutor.java:135)
at org.apache.maven.surefire.testng.TestNGDirectoryTestSuite.executeSingleClass(TestNGDirectoryTestSuite.java:112)
at org.apache.maven.surefire.testng.TestNGDirectoryTestSuite.executeLazy(TestNGDirectoryTestSuite.java:123)
at org.apache.maven.surefire.testng.TestNGDirectoryTestSuite.execute(TestNGDirectoryTestSuite.java:90)
at org.apache.maven.surefire.testng.TestNGProvider.invoke(TestNGProvider.java:146)
at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
jstack.review shows a possible deadlock: https://jstack.review/?https://gist.github.com/lhotari/f2fa48e5c0800018032e460f75f1a8b4#tda_1_dump
The issue had no activity for 30 days, mark with Stale label.
The issue had no activity for 30 days, mark with Stale label.
Another one in PulsarFunctionTlsTest.tearDown
"main" #1 prio=5 os_prio=0 cpu=43749.23ms elapsed=3506.78s tid=0x00007f965c026e90 nid=0xac2 waiting on condition [0x00007f9660e26000]
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@17.0.4.1/Native Method)
- parking to wait for <0x0000100032218608> (a java.util.concurrent.CompletableFuture$Signaller)
at java.util.concurrent.locks.LockSupport.park(java.base@17.0.4.1/LockSupport.java:211)
at java.util.concurrent.CompletableFuture$Signaller.block(java.base@17.0.4.1/CompletableFuture.java:1864)
at java.util.concurrent.ForkJoinPool.unmanagedBlock(java.base@17.0.4.1/ForkJoinPool.java:3463)
at java.util.concurrent.ForkJoinPool.managedBlock(java.base@17.0.4.1/ForkJoinPool.java:3434)
at java.util.concurrent.CompletableFuture.waitingGet(java.base@17.0.4.1/CompletableFuture.java:1898)
at java.util.concurrent.CompletableFuture.join(java.base@17.0.4.1/CompletableFuture.java:2117)
at org.apache.pulsar.metadata.coordination.impl.LockManagerImpl.close(LockManagerImpl.java:163)
at org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl.stop(ModularLoadManagerImpl.java:983)
at org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerWrapper.stop(ModularLoadManagerWrapper.java:118)
at org.apache.pulsar.functions.worker.PulsarFunctionTlsTest.tearDown(PulsarFunctionTlsTest.java:193)
https://github.com/apache/pulsar/actions/runs/3044816917/jobs/4905919291#step:11:528
again
"main" #1 prio=5 os_prio=0 cpu=42816.91ms elapsed=3505.95s tid=0x00007f3190026f50 nid=0xad8 waiting on condition [0x00007f3194df1000]
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@17.0.4.1/Native Method)
- parking to wait for <0x0000100025e10828> (a java.util.concurrent.CompletableFuture$Signaller)
at java.util.concurrent.locks.LockSupport.park(java.base@17.0.4.1/LockSupport.java:211)
at java.util.concurrent.CompletableFuture$Signaller.block(java.base@17.0.4.1/CompletableFuture.java:1864)
at java.util.concurrent.ForkJoinPool.unmanagedBlock(java.base@17.0.4.1/ForkJoinPool.java:3463)
at java.util.concurrent.ForkJoinPool.managedBlock(java.base@17.0.4.1/ForkJoinPool.java:3434)
at java.util.concurrent.CompletableFuture.waitingGet(java.base@17.0.4.1/CompletableFuture.java:1898)
at java.util.concurrent.CompletableFuture.join(java.base@17.0.4.1/CompletableFuture.java:2117)
at org.apache.pulsar.metadata.coordination.impl.LockManagerImpl.close(LockManagerImpl.java:163)
at org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl.stop(ModularLoadManagerImpl.java:9[81](https://github.com/apache/pulsar/actions/runs/3110223307/jobs/5041516576#step:11:82))
at org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerWrapper.stop(ModularLoadManagerWrapper.java:118)
at org.apache.pulsar.functions.worker.PulsarFunctionTlsTest.tearDown(PulsarFunctionTlsTest.java:193)
https://github.com/apache/pulsar/actions/runs/3110223307/jobs/5041516576#step:11:79 in @codelipenghui PR build for #17804 . @mattisonchao do you have a chance to take a look?
again
"main" #1 prio=5 os_prio=0 cpu=94687.75ms elapsed=3512.39s tid=0x00007efee002aba0 nid=0xbc8 waiting on condition [0x00007efee61ec000]
java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@17.0.5/Native Method)
- parking to wait for <0x0000100034016560> (a java.util.concurrent.CompletableFuture$Signaller)
at java.util.concurrent.locks.LockSupport.park(java.base@17.0.5/LockSupport.java:211)
at java.util.concurrent.CompletableFuture$Signaller.block(java.base@17.0.5/CompletableFuture.java:1864)
at java.util.concurrent.ForkJoinPool.unmanagedBlock(java.base@17.0.5/ForkJoinPool.java:3463)
at java.util.concurrent.ForkJoinPool.managedBlock(java.base@17.0.5/ForkJoinPool.java:3434)
at java.util.concurrent.CompletableFuture.waitingGet(java.base@17.0.5/CompletableFuture.java:1898)
at java.util.concurrent.CompletableFuture.join(java.base@17.0.5/CompletableFuture.java:2[117](https://github.com/lhotari/pulsar/actions/runs/3739152576/jobs/6346246451#step:11:118))
at org.apache.pulsar.metadata.coordination.impl.LockManagerImpl.close(LockManagerImpl.java:163)
at org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl.stop(ModularLoadManagerImpl.java:982)
at org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerWrapper.stop(ModularLoadManagerWrapper.java:[118](https://github.com/lhotari/pulsar/actions/runs/3739152576/jobs/6346246451#step:11:119))
at org.apache.pulsar.functions.worker.PulsarFunctionTlsTest.tearDown(PulsarFunctionTlsTest.java:194)
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(java.base@17.0.5/Native Method)
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(java.base@17.0.5/NativeMethodAccessorImpl.java:77)
at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(java.base@17.0.5/DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(java.base@17.0.5/Method.java:568)
at org.testng.internal.invokers.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:139)
at org.testng.internal.invokers.MethodInvocationHelper.invokeMethodConsideringTimeout(MethodInvocationHelper.java:69)
at org.testng.internal.invokers.ConfigInvoker.invokeConfigurationMethod(ConfigInvoker.java:361)
at org.testng.internal.invokers.ConfigInvoker.invokeConfigurations(ConfigInvoker.java:296)
at org.testng.internal.invokers.TestInvoker.runConfigMethods(TestInvoker.java:823)
at org.testng.internal.invokers.TestInvoker.runAfterConfigurations(TestInvoker.java:792)
at org.testng.internal.invokers.TestInvoker.invokeMethod(TestInvoker.java:768)
at org.testng.internal.invokers.TestInvoker.invokeTestMethod(TestInvoker.java:221)
at org.testng.internal.invokers.MethodRunner.runInSequence(MethodRunner.java:50)
at org.testng.internal.invokers.TestInvoker$MethodInvocationAgent.invoke(TestInvoker.java:969)
at org.testng.internal.invokers.TestInvoker.invokeTestMethods(TestInvoker.java:194)
at org.testng.internal.invokers.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:148)
at org.testng.internal.invokers.TestMethodWorker.run(TestMethodWorker.java:128)
at org.testng.TestRunner$$Lambda$245/0x0000000800dc16d0.accept(Unknown Source)
at java.util.ArrayList.forEach(java.base@17.0.5/ArrayList.java:1511)
at org.testng.TestRunner.privateRun(TestRunner.java:829)
at org.testng.TestRunner.run(TestRunner.java:602)
at org.testng.SuiteRunner.runTest(SuiteRunner.java:437)
at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:431)
at org.testng.SuiteRunner.privateRun(SuiteRunner.java:391)
at org.testng.SuiteRunner.run(SuiteRunner.java:330)
at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:52)
at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:95)
at org.testng.TestNG.runSuitesSequentially(TestNG.java:1256)
at org.testng.TestNG.runSuitesLocally(TestNG.java:1176)
at org.testng.TestNG.runSuites(TestNG.java:1099)
at org.testng.TestNG.run(TestNG.java:1067)
at org.apache.maven.surefire.testng.TestNGExecutor.run(TestNGExecutor.java:135)
at org.apache.maven.surefire.testng.TestNGDirectoryTestSuite.executeSingleClass(TestNGDirectoryTestSuite.java:112)
at org.apache.maven.surefire.testng.TestNGDirectoryTestSuite.executeLazy(TestNGDirectoryTestSuite.java:[123](https://github.com/lhotari/pulsar/actions/runs/3739152576/jobs/6346246451#step:11:124))
at org.apache.maven.surefire.testng.TestNGDirectoryTestSuite.execute(TestNGDirectoryTestSuite.java:90)
at org.apache.maven.surefire.testng.TestNGProvider.invoke(TestNGProvider.java:146)
at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:[126](https://github.com/lhotari/pulsar/actions/runs/3739152576/jobs/6346246451#step:11:127))
at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
in https://github.com/lhotari/pulsar/actions/runs/3739152576/jobs/6346246451#step:11:117
Fixed by #19055
Describe the bug
LockManagerImpl.close sporadically hangs and causes the test run to timeout eventually.
Here's an example thread stack trace:
Full thread dump in jstack.review analyzer