JetBrains / lincheck

Framework for testing concurrent data structures
Mozilla Public License 2.0
545 stars 31 forks source link

Model checking deadlock #251

Closed durban closed 5 months ago

durban commented 6 months ago

I think I found a bug, which I can't minimize, but can always reproduce. To reproduce it, these steps are necessary (git and sbt are needed):

git clone https://github.com/durban/choam.git
cd choam
git checkout 998d514d
sbt "stressLinchk/testOnly dev.tauri.choam.data.TtrieModelTest"

Unfortunately there are a lot of things in that repo, and I haven't succeeded in reproducing the issue with a small test. The test in question is here: https://github.com/durban/choam/blob/998d514d88d4296db8f616a8ff68844c9874ca16/stress/stress-linchk/src/test/scala/dev/tauri/choam/data/TtrieModelTest.scala.

With this test, I get the following output:

==> X dev.tauri.choam.data.TtrieModelTest.Model checking test of Ttrie  64.609s org.jetbrains.kotlinx.lincheck.LincheckAssertionError: 
= The execution has hung, see the thread dump =
| ------------------------------------ |
|     Thread 1     |     Thread 2      |
| ------------------------------------ |
| insert(s, qVm)   |                   |
| ------------------------------------ |
| insert(eB, Fkaa) | insert(hQ, tQmWT) |
| ------------------------------------ |

Thread-0:
        jdk.internal.misc.Unsafe.park(Native Method)
        java.util.concurrent.locks.LockSupport.park(LockSupport.java:341)
        java.lang.Thread.run(Thread.java:840)
Thread-1:
        dev.tauri.choam.internal.mcas.emcas.EmcasThreadContextBase.recordCommitPlain(EmcasThreadContextBase.java:71)
        dev.tauri.choam.internal.mcas.emcas.EmcasThreadContext.recordCommit(EmcasThreadContext.scala:150)
        dev.tauri.choam.core.Rxn$InterpreterState.loop(Rxn.scala:1269)
        dev.tauri.choam.core.Rxn$InterpreterState.interpretWithContext(Rxn.scala:1550)
        dev.tauri.choam.core.Rxn$InterpreterState.interpret(Rxn.scala:1543)
        dev.tauri.choam.core.Rxn.unsafePerform(Rxn.scala:260)
        dev.tauri.choam.data.TtrieModelTest$TestState.insert(TtrieModelTest.scala:60)
        java.lang.Thread.run(Thread.java:840)

    at org.jetbrains.kotlinx.lincheck.LinChecker.check(LinChecker.kt:38)
    at org.jetbrains.kotlinx.lincheck.LinChecker$Companion.check(LinChecker.kt:144)
    at org.jetbrains.kotlinx.lincheck.LinChecker.check(LinChecker.kt)
    at dev.tauri.choam.data.TtrieModelTest.$anonfun$new$1(TtrieModelTest.scala:41)
    at dev.tauri.choam.BaseLinchkSpec.$anonfun$test$2(BaseLinchkSpec.scala:41)

With some debugging, I could confirm, that the test times out here: https://github.com/JetBrains/lincheck/blob/lincheck-2.23/src/jvm/main/org/jetbrains/kotlinx/lincheck/runner/ParallelThreadsRunner.kt#L288. However, increasing the timeout doesn't solve the issue. The 2 threads in question just seem to wait indefinitely: one of them (Thread-0 above) is blocked in LockSupport.park, the other (Thread-1 above) is spinning in ManagedStrategy.awaitTurn (ManagedStrategy.kt:406). The latter one is not visible above, but I've attached a full thread dump (threaddump.txt), here is the relevant part:

"FixedActiveThreadsExecutor@4577519-1" #25 prio=5 os_prio=0 cpu=8093.98ms elapsed=9.96s tid=0x00007f37d4523410 nid=0x3e45 runnable  [0x00007f38128f6000]
   java.lang.Thread.State: RUNNABLE
        at org.jetbrains.kotlinx.lincheck.strategy.managed.ManagedStrategy.awaitTurn(ManagedStrategy.kt:406)
        at org.jetbrains.kotlinx.lincheck.strategy.managed.ManagedStrategy.switchCurrentThread(ManagedStrategy.kt:416)
        at org.jetbrains.kotlinx.lincheck.strategy.managed.ManagedStrategy.switchCurrentThread$default(ManagedStrategy.kt:413)
        at org.jetbrains.kotlinx.lincheck.strategy.managed.ManagedStrategy.newSwitchPointRegular(ManagedStrategy.kt:325)
        at org.jetbrains.kotlinx.lincheck.strategy.managed.ManagedStrategy.newSwitchPoint(ManagedStrategy.kt:303)
        at org.jetbrains.kotlinx.lincheck.strategy.managed.ManagedStrategy.beforeSharedVariableRead$lincheck(ManagedStrategy.kt:489)
        at dev.tauri.choam.internal.mcas.emcas.GlobalContextBase.getCommitTs(GlobalContextBase.java:43)
        at dev.tauri.choam.internal.mcas.emcas.Emcas.retrieveFreshTs(Emcas.scala:676)
        at dev.tauri.choam.internal.mcas.emcas.Emcas.MCAS(Emcas.scala:611)
        at dev.tauri.choam.internal.mcas.emcas.Emcas.tryPerformDebug(Emcas.scala:710)
        at dev.tauri.choam.internal.mcas.emcas.Emcas.tryPerformInternal(Emcas.scala:704)
        at dev.tauri.choam.internal.mcas.emcas.EmcasThreadContext.tryPerformInternal(EmcasThreadContext.scala:119)
        at dev.tauri.choam.internal.mcas.Mcas$ThreadContext.tryPerform(Mcas.scala:169)
        at dev.tauri.choam.internal.mcas.Mcas$ThreadContext.tryPerform$(Mcas.scala:158)
        at dev.tauri.choam.internal.mcas.emcas.EmcasThreadContext.tryPerform(EmcasThreadContext.scala:26)
        at dev.tauri.choam.core.Rxn$InterpreterState.casLoop(Rxn.scala:1237)
        at dev.tauri.choam.core.Rxn$InterpreterState.loop(Rxn.scala:1265)
        at dev.tauri.choam.core.Rxn$InterpreterState.interpretWithContext(Rxn.scala:1550)
        at dev.tauri.choam.core.Rxn$InterpreterState.interpret(Rxn.scala:1543)
        at dev.tauri.choam.core.Rxn.unsafePerform(Rxn.scala:260)
        at dev.tauri.choam.data.TtrieModelTest$TestState.insert(TtrieModelTest.scala:60)
        at org.jetbrains.kotlinx.lincheck.runner.TestThreadExecution9.run(Unknown Source)
        at org.jetbrains.kotlinx.lincheck.runner.FixedActiveThreadsExecutor.testThreadRunnable$lambda-8(FixedActiveThreadsExecutor.kt:151)
        at org.jetbrains.kotlinx.lincheck.runner.FixedActiveThreadsExecutor$$Lambda$142/0x00007f381829d360.run(Unknown Source)
        at java.lang.Thread.run(java.base@17.0.9/Thread.java:840)

   Locked ownable synchronizers:
        - None

This might be a red herring, but looking at a heap dump, I see an org.jetbrains.kotlinx.lincheck.runner.FixedActiveThreadsExecutor with its results array having an IllegalStateException (with message Trying to switch the execution to thread 0,but only the following threads are eligible to switch: [1]) at index 0. Maybe this exception is thrown (probably here), and somehow this makes the two threads lock up? Although I'm not that familiar with the code...

ben-manes commented 6 months ago

It sounds like https://github.com/JetBrains/lincheck/issues/194? I pin to the older version because of this.

eupp commented 6 months ago

Hi @durban! Thank you for the bug report and the detailed investigation.

I was trying to reproduce the bug on the develop branch of the Lincheck, but it looks like the bug is already fixed on this branch. I did some bisecting and found out that the following commit fixes the bug (this line to be precise).

We will prepare a new release encompassing the latest changes from the develop branch, including the commit that fixes the bug you reported.

CC @ndkoval

durban commented 6 months ago

Sounds good, thank you!

eupp commented 5 months ago

Hi @durban ! Could you please re-check if the 2.24 release fixes the issue for you?

durban commented 5 months ago

Yes, 2.24 seems to work fine. Thanks!