JetBrains / lincheck

Framework for testing concurrent data structures
Mozilla Public License 2.0
572 stars 33 forks source link

model checking hangs in 2.16 on jdk 19 #130

Closed ben-manes closed 4 months ago

ben-manes commented 1 year ago

When upgrading from 2.15, a previously successful test hangs and times out on CI. I was able to reproduce this locally on jdk19 (both 11 and 17 passed). Using jps for the pid and jstack for the stacktrace, I could see that it was always doing work within lincheck methods. Sorry that I cannot provide more insights.

JAVA_VERSION=19 ./gradlew caffeine:lincheckTest

https://github.com/ben-manes/caffeine/actions/runs/3528817208/jobs/5919342189

ben-manes commented 1 year ago

Interestingly this passed in 20-ea

https://github.com/ben-manes/caffeine/actions/runs/3529461192/jobs/5920532925

ben-manes commented 1 year ago

I profiled and found that it is throwing NoClassDefFoundError and I presume retrying indefinitely. The reason appears to be a failed transform,

Could not initialize class org.jetbrains.kotlinx.lincheck.tran$f*rmed.java.util.concurrent.ForkJoinPool

void java.lang.Error.(String) void java.lang.LinkageError.(String) void java.lang.NoClassDefFoundError.(String) void com.github.benmanes.caffeine.cache.BoundedLocalCache.performCleanUp(Runnable) void com.github.benmanes.caffeine.cache.BoundedLocalCache$PerformCleanupTask.run() void com.github.benmanes.caffeine.lincheck.AbstractLincheckCacheTest$$Lambda$3254+0x00000008029f7140.74472028.execute(Runnable)

If I remove the offending code then the test passes. In this case it is an optimization to more aggressively resubmit onto the common pool if work remains, else wait for the next triggering action. I don't know why other JDK versions wouldn't suffer this problem and only on jdk19 with 2.16 does it break.

/**
  * Performs the maintenance work, blocking until the lock is acquired.
  *
  * @param task an additional pending task to run, or {@code null} if not present
  */
void performCleanUp(@Nullable Runnable task) {
  evictionLock.lock();
  try {
    maintenance(task);
  } finally {
    evictionLock.unlock();
  }
  if ((drainStatusOpaque() == REQUIRED) && (executor == ForkJoinPool.commonPool())) {
    scheduleDrainBuffers();
  }
}
ben-manes commented 1 year ago

Oh, I missed this initialization exception for the static load,

Exception java.lang.IllegalAccessError: class org.jetbrains.kotlinx.lincheck.tran$f*rmed.java.util.concurrent.ForkJoinPool (in unnamed module @0x1add17cc) cannot access class jdk.internal.vm.SharedThreadContainer (in module java.base) because module java.base does not export jdk.internal.vm to unnamed module @0x1add17cc [in thread "FixedActiveThreadsExecutor@849727439-1"] 1

I can reproduce eagerly to fail the test by adding ForkJoinPool.commonPool() to the class constructor, which leads to opening jdk.internal.vm and jdk.internal.access so that the tests pass.

alefedor commented 1 year ago

Hi @ben-manes !

So, in short, it seems the error is caused by lazy static initialization in concurrent threads leading to a deadlock. Interesting. This behaviour is likely due to cyclic class dependencies, though it is yet unclear why exactly this happens

ben-manes commented 1 year ago

Hi @alefedor

The summary is that ForkJoinPool now imports classes from jdk.internal.access and jdk.internal.vm. This is likely something to do with virtual threads. When asm rewrites to a shaded instance the module system restricts access to these classes, which fails the initialization of the common pool. As that is a static instance this happens during classloading, the entire ForkJoinPool class fails to be loaded and a linkage error is thrown. Unfortunately, Lincheck swallows all throwable exceptions blindly, whereas Error types should always be propagated and never handled directly. This results in the test retrying forever and not reporting what went wrong.

The important fix is to always rethrow java.lang.Error types so that the test terminates early and reports the failure, as this exception type is meant to indicate that the JVM is in an unrecoverable state. Then if additional modules are required in future runtimes, users can quickly diagnose and make the necessary changes.

The user change is to include add-opens directives for jdk.internal.vm and jdk.internal.access. As ForkJoinPool is listed in your transformations, it seems that this should be in your README for Java 9+. This is less pressing as merely outdated documentation.

ndkoval commented 8 months ago

The problem should be resolved with the changes for #136. I've also created an issue about propagating Errors (#258).

ndkoval commented 4 months ago

Hi, @ben-manes! It has taken a while to address the issue, but the recent 2.30 release should've fixed it. Could you please check?

ben-manes commented 4 months ago

oh yes, this is fixed. I am on 2.29 now as I ran into a bug where Lincheck self terminates when its hung,

Gradle suite > Gradle test > com.github.benmanes.caffeine.lincheck.CaffeineLincheckTest$BoundedLincheckTest > modelCheckingTest FAILED
    org.jetbrains.kotlinx.lincheck.LincheckAssertionError:
    = The execution has hung, see the thread dump =

I reported it under https://github.com/JetBrains/lincheck/issues/311 but my simplification was a misdirect, so I'll update that issue title.