JetBrains / lincheck

Framework for testing concurrent data structures
Mozilla Public License 2.0
545 stars 31 forks source link

`IndexOutOfBoundsException` from `ManagedStrategy.getConcurrentActorCausesBlocking` #275

Closed durban closed 3 months ago

durban commented 4 months ago

I think an IndexOutOfBoundsException might be somehow thrown by Lincheck. I see in the code, that stacktraces are filtered; maybe that hides the exact source. EDIT: It is indeed thrown by Linchek, see below.

Steps to reproduce (unfortunately I don't know how to minimize):

git clone https://github.com/durban/choam.git
cd choam
git checkout 703b8199
sbt "stressLinchk/testOnly *.TtrieModelTest"

The IndexOutOfBoundsException looks like this:

The number next to an exception name helps you find its stack trace provided after the interleaving section
---
Exception stack traces:
#1: java.lang.IndexOutOfBoundsException: Index -1 out of bounds for length 3
        at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:64)
        at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70)
        at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:266)
        at java.base/java.util.Objects.checkIndex(Objects.java:361)
        at java.base/java.util.ArrayList.get(ArrayList.java:427)

Now, this doesn't make a lot of sense, I suspect due to the stacktrace filtering. To the best of my knowledge, my code doesn't use ArrayList anywhere.

For reference, more output:

dev.tauri.choam.data.TtrieModelTest:
==> i dev.tauri.choam.data.TtrieModelTest.Model checking Ttrie.apply ignored 0.0s
==> X dev.tauri.choam.data.TtrieModelTest.Model checking Ttrie.skipListBased  30.698s java.lang.IllegalStateException: Non-determinism found. Probably caused by non-deterministic code (WeakHashMap, Object.hashCode, etc).
== Reporting the first execution without execution trace ==
= Invalid execution results =
| --------------------------------------------------------------------------------------------- |
|                     Thread 1                     |                  Thread 2                  |
| --------------------------------------------------------------------------------------------- |
| insertIfAbsent(Ve, hF1iwXTfVYhRYXG): None        |                                            |
| removeKey(ez): false                             |                                            |
| --------------------------------------------------------------------------------------------- |
| removeKey(zE): false                             | removeKey(2W8): false [3,0]                |
| lookup(q3_): None [1,0]                          | lookup(Pvx): None [3,1]                    |
| lookup(hDjR): IndexOutOfBoundsException #1 [2,0] | insert(S3Qwv, t4m5fQnVDGfvpd8): None [3,2] |
| --------------------------------------------------------------------------------------------- |
| insert(7bSl3, OOfbSFKGKWr0c x): None             |                                            |
| --------------------------------------------------------------------------------------------- |

---
All operations above the horizontal line | ----- | happen before those below the line
---
Values in "[..]" brackets indicate the number of completed operations
in each of the parallel threads seen at the beginning of the current operation
---
The number next to an exception name helps you find its stack trace provided after the interleaving section
---
Exception stack traces:
#1: java.lang.IndexOutOfBoundsException: Index -1 out of bounds for length 3
        at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:64)
        at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70)
        at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:266)
        at java.base/java.util.Objects.checkIndex(Objects.java:361)
        at java.base/java.util.ArrayList.get(ArrayList.java:427)

== Reporting the second execution ==
= Invalid execution results =
| -------------------------------------------------------------------------------------- |
|                 Thread 1                  |                  Thread 2                  |
| -------------------------------------------------------------------------------------- |
| insertIfAbsent(Ve, hF1iwXTfVYhRYXG): None |                                            |
| removeKey(ez): false                      |                                            |
| -------------------------------------------------------------------------------------- |
| removeKey(zE): false                      | removeKey(2W8): false [3,0]                |
| lookup(q3_): None [1,0]                   | lookup(Pvx): None [3,1]                    |
| lookup(hDjR): None [2,0]                  | insert(S3Qwv, t4m5fQnVDGfvpd8): None [3,2] |
| -------------------------------------------------------------------------------------- |
| insert(7bSl3, OOfbSFKGKWr0c x): None      |                                            |
| -------------------------------------------------------------------------------------- |

---
All operations above the horizontal line | ----- | happen before those below the line
---
Values in "[..]" brackets indicate the number of completed operations
in each of the parallel threads seen at the beginning of the current operation
---

The following interleaving leads to the error:
| -------------------------------------------------------------------------------- |
|                 Thread 1                  |               Thread 2               |
| -------------------------------------------------------------------------------- |
| insertIfAbsent(Ve, hF1iwXTfVYhRYXG): None |                                      |
| removeKey(ez): false                      |                                      |
| -------------------------------------------------------------------------------- |
| removeKey(zE): false                      |                                      |
| lookup(q3_): None                         |                                      |
| lookup(hDjR): None                        |                                      |
|                                           | removeKey(2W8): false                |
|                                           | lookup(Pvx): None                    |
|                                           | insert(S3Qwv, t4m5fQnVDGfvpd8): None |
| -------------------------------------------------------------------------------- |
| insert(7bSl3, OOfbSFKGKWr0c x): None      |                                      |
| -------------------------------------------------------------------------------- |

Detailed trace:
...
durban commented 4 months ago

I've locally modified Lincheck to not filter stacktraces. This way I've got the full stacktrace of the IndexOutOfBoundsException:

The number next to an exception name helps you find its stack trace provided after the interleaving section
---
Exception stack traces:
#1: java.lang.IndexOutOfBoundsException: Index -1 out of bounds for length 3
        at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:64)
        at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70)
        at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:266)
        at java.base/java.util.Objects.checkIndex(Objects.java:361)
        at java.base/java.util.ArrayList.get(ArrayList.java:427)
        at org.jetbrains.kotlinx.lincheck.strategy.managed.ManagedStrategy.getConcurrentActorCausesBlocking(ManagedStrategy.kt:263)
        at org.jetbrains.kotlinx.lincheck.strategy.managed.ManagedStrategy.failIfObstructionFreedomIsRequired(ManagedStrategy.kt:250)
        at org.jetbrains.kotlinx.lincheck.strategy.managed.ManagedStrategy.newSwitchPointRegular(ManagedStrategy.kt:327)
        at org.jetbrains.kotlinx.lincheck.strategy.managed.ManagedStrategy.newSwitchPoint(ManagedStrategy.kt:312)
        at org.jetbrains.kotlinx.lincheck.strategy.managed.ManagedStrategy.beforeAtomicMethodCall$lincheck(ManagedStrategy.kt:537)
        at org.jetbrains.kotlinx.lincheck.tran$f*rmed.java.util.concurrent.atomic.AtomicReference.getAcquire(AtomicReference.java:334)
        at dev.tauri.choam.internal.skiplist.SkipListMap.walkRight(SkipListMap.scala:104)
        at dev.tauri.choam.internal.skiplist.SkipListMap.findPredecessor(SkipListMap.scala:808)
        at dev.tauri.choam.internal.skiplist.SkipListMap.doRemove(SkipListMap.scala:602)
        at dev.tauri.choam.internal.skiplist.SkipListMap.remove(SkipListMap.scala:176)
        at dev.tauri.choam.data.Ttrie.unsafeDelRef(Ttrie.scala:150)
        at dev.tauri.choam.data.Ttrie.$anonfun$cleanupLater$2(Ttrie.scala:160)
        at dev.tauri.choam.data.Ttrie.$anonfun$cleanupLater$2$adapted(Ttrie.scala:160)
        at dev.tauri.choam.core.Rxn$InterpreterState.loop(Rxn.scala:1400)
        at dev.tauri.choam.core.Rxn$InterpreterState.interpretWithContext(Rxn.scala:1552)
        at dev.tauri.choam.core.Rxn$InterpreterState.interpret(Rxn.scala:1545)
        at dev.tauri.choam.core.Rxn.unsafePerform(Rxn.scala:260)
        at dev.tauri.choam.data.TtrieModelTest$AbstractTestState.lookup(TtrieModelTest.scala:71)
        at org.jetbrains.kotlinx.lincheck.runner.TestThreadExecution21.run(Unknown Source)
        at org.jetbrains.kotlinx.lincheck.runner.FixedActiveThreadsExecutor.testThreadRunnable$lambda$8(FixedActiveThreadsExecutor.kt:151)
        at java.base/java.lang.Thread.run(Thread.java:840)

So it is indeed coming from Lincheck itself.

ndkoval commented 4 months ago

Hi @durban! Could you please check whether the issue still occurs with Lincheck 2.26?

durban commented 4 months ago

I've tried 2.26, and it fails the same way. The (unfiltered) stacktrace is this:

The number next to an exception name helps you find its stack trace provided after the interleaving section
---
Exception stack traces:
#1: java.lang.IndexOutOfBoundsException: Index -1 out of bounds for length 3
        at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:64)
        at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70)
        at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:266)
        at java.base/java.util.Objects.checkIndex(Objects.java:361)
        at java.base/java.util.ArrayList.get(ArrayList.java:427)
        at org.jetbrains.kotlinx.lincheck.strategy.managed.ManagedStrategy.getConcurrentActorCausesBlocking(ManagedStrategy.kt:267)
        at org.jetbrains.kotlinx.lincheck.strategy.managed.ManagedStrategy.failIfObstructionFreedomIsRequired(ManagedStrategy.kt:254)
        at org.jetbrains.kotlinx.lincheck.strategy.managed.ManagedStrategy.newSwitchPointRegular(ManagedStrategy.kt:330)
        at org.jetbrains.kotlinx.lincheck.strategy.managed.ManagedStrategy.newSwitchPoint(ManagedStrategy.kt:315)
        at org.jetbrains.kotlinx.lincheck.strategy.managed.ManagedStrategy.beforeAtomicMethodCall$lincheck(ManagedStrategy.kt:538)
        at org.jetbrains.kotlinx.lincheck.tran$f*rmed.java.util.concurrent.atomic.AtomicReference.getAcquire(AtomicReference.java:334)
        at dev.tauri.choam.internal.skiplist.SkipListMap.walkRight(SkipListMap.scala:104)
        at dev.tauri.choam.internal.skiplist.SkipListMap.findPredecessor(SkipListMap.scala:808)
        at dev.tauri.choam.internal.skiplist.SkipListMap.doRemove(SkipListMap.scala:602)
        at dev.tauri.choam.internal.skiplist.SkipListMap.remove(SkipListMap.scala:176)
        at dev.tauri.choam.data.Ttrie.unsafeDelRef(Ttrie.scala:150)
        at dev.tauri.choam.data.Ttrie.$anonfun$cleanupLater$2(Ttrie.scala:160)
        at dev.tauri.choam.data.Ttrie.$anonfun$cleanupLater$2$adapted(Ttrie.scala:160)
        at dev.tauri.choam.core.Rxn$InterpreterState.loop(Rxn.scala:1400)
        at dev.tauri.choam.core.Rxn$InterpreterState.interpretWithContext(Rxn.scala:1552)
        at dev.tauri.choam.core.Rxn$InterpreterState.interpret(Rxn.scala:1545)
        at dev.tauri.choam.core.Rxn.unsafePerform(Rxn.scala:260)
        at dev.tauri.choam.data.TtrieModelTest$AbstractTestState.lookup(TtrieModelTest.scala:71)
        at org.jetbrains.kotlinx.lincheck.runner.TestThreadExecution21.run(Unknown Source)
        at org.jetbrains.kotlinx.lincheck.runner.FixedActiveThreadsExecutor.testThreadRunnable$lambda$8(FixedActiveThreadsExecutor.kt:151)
        at java.base/java.lang.Thread.run(Thread.java:840)

So essentially the same as before, although some line numbers changed.

eupp commented 4 months ago

Hi @durban !

PR #277 should fix the IndexOutOfBoundsException. But even with this fix, the test still fails (I attached the log: output.txt).

As you might see, the log contains the following line:

java.lang.IllegalStateException: Non-determinism found. Probably caused by non-deterministic code (WeakHashMap, Object.hashCode, etc).

I suspect it might be caused by the usage of WeakReference-s (there are plenty of them in the trace).

Nevertheless, it can be seen, that the first issue detected by Lincheck is an active spin lock:

= The algorithm should be non-blocking, but an active lock is detected =

This could be a sign of an actual live-lock, or a false positive bug report.

No matter what, you can effectively disable live-lock detection by setting larger value for parameter hangingDetectionThreshold in ModelCheckingOptions. Setting value 200 is enough to make the test pass:

def defaultModelCheckingOptions(): ModelCheckingOptions = {
    ....
    increaseTimeout(new ModelCheckingOptions())
          .addGuarantee(assumedAtomic)
          .checkObstructionFreedom(true)
          // this is the "fast" configuration from the Lincheck paper:
          .iterations(30) // scenarios
          .threads(2)
          .actorsPerThread(3) // operations per thread
          .invocationsPerIteration(1000) // run each scenario this much time (FIXME)
          // end of "fast" configuration
          .actorsBefore(2) // so that we don't work with empty data structures
          .actorsAfter(1) // to have a chance of detecting inconsistent state left
          .hangingDetectionThreshold(200)
}

If you think this live-lock is false positive, please report back so we can investigate further. The hack with disabling lock detection should make the test green for now.

eupp commented 4 months ago

For weak references support, we added a new tracking issue #279

durban commented 4 months ago

Thanks for looking into this! I've looked at the output.txt you've attached, and some things are not clear to me:

The Non-determinism found: I'm definitely using weakrefs, but they shouldn't be externally visible. I thought that model checking inspects the results of the @Operations. Does it also check the internals of a data structure? Also, if you look at the original output above, it contains the operation results, e.g., removeKey(ez): false. In the output.txt, there are no results, e.g., lookup(zE). Is this normal?

There is also this part in the log:

java.lang.IllegalArgumentException: Internal error
    at org.jetbrains.kotlinx.lincheck.strategy.managed.ManagedStrategy$ReplayModeLoopDetectorHelper.onNextExecution(ManagedStrategy.kt:1197)
    at org.jetbrains.kotlinx.lincheck.strategy.managed.ManagedStrategy$LoopDetector.onNextExecutionPoint(ManagedStrategy.kt:1051)
    at org.jetbrains.kotlinx.lincheck.strategy.managed.ManagedStrategy$LoopDetector.onThreadFinish(ManagedStrategy.kt:1010)
    at org.jetbrains.kotlinx.lincheck.strategy.managed.ManagedStrategy.onFinish(ManagedStrategy.kt:379)
    at org.jetbrains.kotlinx.lincheck.strategy.managed.ManagedStrategyRunner.onFinish(ManagedStrategy.kt:1251)
    at org.jetbrains.kotlinx.lincheck.runner.TestThreadExecution26.run(Unknown Source)
    at org.jetbrains.kotlinx.lincheck.runner.FixedActiveThreadsExecutor.testThreadRunnable$lambda$8(FixedActiveThreadsExecutor.kt:151)
    at java.base/java.lang.Thread.run(Thread.java:829)

This looks like it shouldn't happen, regardless of the behavior of the tested code. Could this be the reason for the Non-determinism found also?

Finally about the active lock is detected: there should be no spinlocks, but I'll look into it. For the hangingDetectionThreshold, what does the value 200 mean?

eupp commented 4 months ago

The Non-determinism found: I'm definitely using weakrefs, but they shouldn't be externally visible. I thought that model checking inspects the results of the @Operations. Does it also check the internals of a data structure?

Model checking mode generally relies on the fact that it can deterministically reproduce any execution of the program. It stores some data structure internally, that allows to do that, and the bytecode transformations ensure that we track all the events in the program that may cause non-determinism (like calls to Random, thread switches, etc). But for now we do not support WeakReference-s. Their methods are non-deterministic too (because their results depend on GC). We will fix this problem later under #279.

Also, if you look at the original output above, it contains the operation results, e.g., removeKey(ez): false. In the output.txt, there are no results, e.g., lookup(zE). Is this normal?

What happens with your test, is that the model checker discovers what it thinks is an active lock. In general, when the MC discovers some bug, it attempts to run the same interleaving leading to a bug yet another time -- but this time collecting the program trace on the road. Because of non-determinism it cannot replay the same execution, and thus the second time it obtains different results.

For the hangingDetectionThreshold, what does the value 200 mean?

At least for now, the active lock detector suspects that program entered active lock whenever same operation hits specific code location N times (it usually happens when the code spins in a loop). The hangingDetectionThreshold sets this N parameter.

durban commented 4 months ago

Thanks, that makes a lot of things much more clear to me. I'll try to look into what could be this spinlock (or spinlock-like thing).

eupp commented 4 months ago

Hi @durban !

For your test, Lincheck actually reported false positive active lock due to a bug. PR #281 fixes this bug. Thanks to @ndkoval for spotting the root cause of the bug.

durban commented 4 months ago

Thanks, that's good to hear! I'll recheck when the fix is released.

eupp commented 3 months ago

The new release lincheck-2.27 should fix this issue. @durban please re-open it if the problem persists.