eclipse-openj9 / openj9

Eclipse OpenJ9: A Java Virtual Machine for OpenJDK that's optimized for small footprint, fast start-up, and high throughput. Builds on Eclipse OMR (https://github.com/eclipse/omr) and combines with the Extensions for OpenJDK for OpenJ9 repo.
Other
3.28k stars 721 forks source link

CRIU cmdLineTester_criu_nonPortableRestore_5_FAILED j9mm.107 ASSERTION FAILED at omr/gc/base/ParallelDispatcher.cpp:176: ((false && ((worker_status_reserved == _statusTable[workerID]) || ((0 == _threadsToReserve) && (worker_status_dying == _statusTable[workerID]))))) #17463

Closed JasonFengJ9 closed 1 year ago

JasonFengJ9 commented 1 year ago

Failure link

From an internal build(ubu20s390x-svl-rt1-1):

java version "11.0.20" 2023-07-18
IBM Semeru Runtime Certified Edition 11.0.20+2 (build 11.0.20+2)
Eclipse OpenJ9 VM 11.0.20+2 (build master-1b94cba4c, JRE 11 Linux s390x-64-Bit Compressed References 20230526_565 (JIT enabled, AOT enabled)
OpenJ9   - 1b94cba4c
OMR      - 779c51b95
JCL      - b111539de8 based on jdk-11.0.20+2)

Rerun in Grinder - Change TARGET to run only the failed test targets.

Optional info

Failure output (captured from console output)

[2023-05-27T02:24:15.520Z] variation: -Xgcpolicy:gencon -Xgcthreads64 -XX:CheckpointGCThreads=1
[2023-05-27T02:24:15.520Z] JVM_OPTIONS:  -Xgcpolicy:gencon -Xgcthreads64 -XX:CheckpointGCThreads=1 

[2023-05-27T02:24:30.509Z] Testing: Create CRIU checkpoint image and restore three times - testSystemNanoTimeJitPostCheckpointCompile
[2023-05-27T02:24:30.509Z] Test start time: 2023/05/27 02:24:29 Coordinated Universal Time
[2023-05-27T02:24:30.509Z] Running command: bash /home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_s390x_linux_testList_1/aqa-tests/TKG/../../jvmtest/functional/cmdLineTests/criu/criuScript.sh /home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_s390x_linux_testList_1/aqa-tests/TKG/../../jvmtest/functional/cmdLineTests/criu /home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_s390x_linux_testList_1/openjdkbinary/j2sdk-image/bin/java " -Xgcpolicy:gencon -Xgcthreads64 -XX:CheckpointGCThreads=1 " "org.openj9.criu.TimeChangeTest testSystemNanoTimeJitPostCheckpointCompile" 3 3 false
[2023-05-27T02:24:30.509Z] Time spent starting: 1 milliseconds
[2023-05-27T02:24:38.647Z] Time spent executing: 7510 milliseconds
[2023-05-27T02:24:38.647Z] Test result: FAILED
[2023-05-27T02:24:38.647Z] Output from test:
[2023-05-27T02:24:38.647Z]  [OUT] start running script
[2023-05-27T02:24:38.647Z]  [OUT] export GLIBC_TUNABLES=glibc.cpu.hwcaps=-XSAVEC,-XSAVE,-AVX2,-ERMS,-AVX,-AVX_Fast_Unaligned_Load
[2023-05-27T02:24:38.647Z]  [OUT] export LD_BIND_NOT=on
[2023-05-27T02:24:38.647Z]  [OUT] System.nanoTime() before CRIU checkpoint: 1685154149521110440
[2023-05-27T02:24:38.647Z]  [OUT] Performing CRIUSupport.checkpointJVM(), current thread name: main, Sat May 27 02:24:29 UTC 2023, System.currentTimeMillis(): 1685154269816, System.nanoTime(): 1685154149557836678
[2023-05-27T02:24:38.647Z]  [OUT] 02:24:30.229 0x225fe00    j9mm.107    *   ** ASSERTION FAILED ** at ../../../../../../omr/gc/base/ParallelDispatcher.cpp:176: ((false && ((worker_status_reserved == _statusTable[workerID]) || ((0 == _threadsToReserve) && (worker_status_dying == _statusTable[workerID])))))
[2023-05-27T02:24:38.647Z]  [OUT] JVMDUMP039I Processing dump event "traceassert", detail "" at 2023/05/27 02:24:30 - please wait.
[2023-05-27T02:24:38.647Z]  [OUT] JVMDUMP032I JVM requested System dump using '/home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_s390x_linux_testList_1/aqa-tests/TKG/output_16851539523513/cmdLineTester_criu_nonPortableRestore_5/core.20230527.022430.16004.0001.dmp' in response to an event
[2023-05-27T02:24:38.647Z]  [OUT] JVMDUMP010I System dump written to /home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_s390x_linux_testList_1/aqa-tests/TKG/output_16851539523513/cmdLineTester_criu_nonPortableRestore_5/core.20230527.022430.16004.0001.dmp
[2023-05-27T02:24:38.647Z]  [OUT] JVMDUMP032I JVM requested Java dump using '/home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_s390x_linux_testList_1/aqa-tests/TKG/output_16851539523513/cmdLineTester_criu_nonPortableRestore_5/javacore.20230527.022430.16004.0002.txt' in response to an event
[2023-05-27T02:24:38.647Z]  [OUT] JVMDUMP012E Error in Java dump: /home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_s390x_linux_testList_1/aqa-tests/TKG/output_16851539523513/cmdLineTester_criu_nonPortableRestore_5/javacore.20230527.022430.16004.0002.txt
[2023-05-27T02:24:38.647Z]  [OUT] JVMDUMP032I JVM requested Snap dump using '/home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_s390x_linux_testList_1/aqa-tests/TKG/output_16851539523513/cmdLineTester_criu_nonPortableRestore_5/Snap.20230527.022430.16004.0003.trc' in response to an event
[2023-05-27T02:24:38.647Z]  [OUT] JVMDUMP010I Snap dump written to /home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_s390x_linux_testList_1/aqa-tests/TKG/output_16851539523513/cmdLineTester_criu_nonPortableRestore_5/Snap.20230527.022430.16004.0003.trc
[2023-05-27T02:24:38.647Z]  [OUT] JVMDUMP013I Processed dump event "traceassert", detail "".
[2023-05-27T02:24:38.647Z]  [OUT] Removed testOutput file
[2023-05-27T02:24:38.647Z]  [OUT] finished script
[2023-05-27T02:24:38.647Z]  [ERR] Error (criu/protobuf.c:72): Unexpected EOF on (empty-image)
[2023-05-27T02:24:38.647Z]  [ERR] Error (criu/protobuf.c:72): Unexpected EOF on (empty-image)
[2023-05-27T02:24:38.647Z]  [ERR] Error (criu/protobuf.c:72): Unexpected EOF on (empty-image)
[2023-05-27T02:24:38.647Z] >> Success condition was not found: [Output match: Killed]
[2023-05-27T02:24:38.647Z] >> Required condition was found: [Output match: System.nanoTime() before CRIU checkpoint:]
[2023-05-27T02:24:38.647Z] >> Success condition was not found: [Output match: PASSED: System.nanoTime() after CRIU restore:]
[2023-05-27T02:24:38.647Z] >> Failure condition was not found: [Output match: CRIU is not enabled]
[2023-05-27T02:24:38.647Z] >> Failure condition was not found: [Output match: Operation not permitted]
[2023-05-27T02:24:38.647Z] >> Failure condition was not found: [Output match: FAILED: System.nanoTime() after CRIU restore:]
[2023-05-27T02:24:38.647Z] >> Success condition was not found: [Output match: Thread pid mismatch]
[2023-05-27T02:24:38.647Z] >> Success condition was not found: [Output match: do not match expected]
[2023-05-27T02:24:38.647Z] >> Success condition was not found: [Output match: Unable to create a thread:]
[2023-05-27T02:24:38.647Z] >> Failure condition was not found: [Output match: Could not dump the JVM processes, err=-70]

[2023-05-27T02:27:15.669Z] ---TEST RESULTS---
[2023-05-27T02:27:15.669Z] Number of PASSED tests: 40 out of 41
[2023-05-27T02:27:15.669Z] Number of FAILED tests: 1 out of 41
[2023-05-27T02:27:15.669Z] 
[2023-05-27T02:27:15.669Z] ---SUMMARY OF FAILED TESTS---
[2023-05-27T02:27:15.669Z] Create CRIU checkpoint image and restore three times - testSystemNanoTimeJitPostCheckpointCompile
[2023-05-27T02:27:15.669Z] -----------------------------
[2023-05-27T02:27:15.669Z] 
[2023-05-27T02:27:15.669Z] -----------------------------------
[2023-05-27T02:27:15.669Z] cmdLineTester_criu_nonPortableRestore_5_FAILED

50x internal grinder - 1/50 failed

FYI @tajila

dmitripivkine commented 1 year ago

@RSalman Please take a look.

RSalman commented 1 year ago

There's a GC triggered while in the middle of shutting down GC threads during checkpoint. Dispatcher is attempting to use threads that are being shutdown. The code was written with assumption that a GC wouldn't be triggered during checkpoint. There was a suspension this could occur while discussing https://github.com/eclipse-openj9/openj9/pull/16653 (had discussion offline with @amicic @dmitripivkine). A potential workaround is to force single threaded task dispatch during thread pool contract, similar to what happens when there's GC during VM shutdown.

02:24:30.223895439 *0x0000000000000000 j9mm.51             Event       SystemGC end: newspace=124632/2359296 oldspace=5897000/6291456 loa=315392/315392
02:24:30.223898584 *0x0000000002092500 j9vm.372            Entry      >Releasing exclusive VM Access
02:24:30.223899749  0x0000000002092500 j9vm.375            Event       Exclusive VM Access queue is empty, resetting exclusive access state and notifying all halted threads. Changing exclusiveAccessState to J9_XACCESS_NONE.
02:24:30.223935814  0x0000000002092500 j9vm.376            Exit       <Released exclusive VM Access
02:24:30.224487049  0x0000000002092500 j9mm.771            Entry      >contractThreadPool Entry: gcThreadCount: 64, requested newThreadCount: 1
02:24:30.224487397  0x0000000002092500 j9mm.772            Event       Attempt to shutdown GC threads
02:24:30.228115489 *0x0000000002220400 j9vm.544            Event       JNIinv DetachCurrentThread
02:24:30.228123206 *0x000000000222c100 j9vm.544            Event       JNIinv DetachCurrentThread
02:24:30.228126662 *0x0000000002230c00 j9vm.544            Event       JNIinv DetachCurrentThread
02:24:30.228129556 *0x0000000002233200 j9vm.544            Event       JNIinv DetachCurrentThread
02:24:30.228133709 *0x0000000002262400 j9vm.544            Event       JNIinv DetachCurrentThread
02:24:30.228143352 *0x0000000002229b00 j9vm.544            Event       JNIinv DetachCurrentThread
02:24:30.228159806 *0x00000000021f3700 j9vm.544            Event       JNIinv DetachCurrentThread
02:24:30.228175144 *0x00000000021ea000 j9vm.544            Event       JNIinv DetachCurrentThread
02:24:30.228175692 *0x0000000002258d00 j9vm.544            Event       JNIinv DetachCurrentThread
02:24:30.228180082 *0x00000000021f1200 j9vm.544            Event       JNIinv DetachCurrentThread
02:24:30.228184674 *0x0000000002208b00 j9vm.544            Event       JNIinv DetachCurrentThread
02:24:30.228188448 *0x00000000021f8300 j9vm.544            Event       JNIinv DetachCurrentThread
02:24:30.229465075 *0x000000000225fe00 j9mm.107            Assert     * ** ASSERTION FAILED ** at ../../../../../../omr/gc/base/ParallelDispatcher.cpp:176: ((false && ((worker_status_reserved == _statusTable[workerID]) || ((0 == _threadsToReserve) && (worker_status_dying == _statusTable[workerID])))))
> !mm_paralleldispatcher 0x000003FF8005F5C0
MM_ParallelDispatcher at 0x3ff8005f5c0 {
  Fields for MM_Base:
  Fields for MM_BaseVirtual:
    0x8: const U8* _typeId = !j9x 0x000003FF8447F3EA // "MM_ParallelDispatcher"
  Fields for MM_ParallelDispatcher:
    0x10: class MM_Task* _task = !mm_parallelscavengetask 0x000003FF7FFFCD10
...
    0x20: U64 _threadShutdownCount = 0x0000000000000020 (32)
    0x28: struct J9Thread** _threadTable = !j9x 0x000003FF8005F6B0
    0x30: U64* _statusTable = !j9x 0x000003FF8005F900
    0x38: void** _taskTable = !j9x 0x000003FF8005FB50
... 0x58: bool _workerThreadsReservedForGC = true
    0x59: bool _inShutdown = true
    0x60: U64 _threadCountMaximum = 0x0000000000000040 (64)
    0x68: U64 _threadCount = 0x0000000000000040 (64)
    0x70: U64 _activeThreadCount = 0x0000000000000040 (64)
    0x78: U64 _threadsToReserve = 0x000000000000003F (63)
..  0x98: U64 _poolMaxCapacity = 0x0000000000000040 (64)
}
RSalman commented 1 year ago

I suppose JNIinv DetachCurrentThread is triggering a GC?

RSalman commented 1 year ago

@JasonFengJ9 any idea how to avoid these during grinder runs:

10:30:32   [OUT] Exception in thread "main" org.eclipse.openj9.criu.SystemCheckpointException: The JVM attempted to load libcriu.so but was unable to: 1
10:30:32   [OUT]    at openj9.criu/org.eclipse.openj9.criu.CRIUSupport.checkpointJVMImpl(Native Method)
10:30:32   [OUT]    at openj9.criu/org.eclipse.openj9.criu.CRIUSupport.checkpointJVM(CRIUSupport.java:658)
10:30:32   [OUT]    at org.openj9.criu.CRIUTestUtils.checkPointJVM(CRIUTestUtils.java:77)
10:30:32   [OUT]    at org.openj9.criu.CRIUTestUtils.checkPointJVM(CRIUTestUtils.java:65)
10:30:32   [OUT]    at org.openj9.criu.TimeChangeTest.testSystemNanoTimeJitPreCheckpointCompile(TimeChangeTest.java:106)
10:30:32   [OUT]    at org.openj9.criu.TimeChangeTest.main(TimeChangeTest.java:52)
JasonFengJ9 commented 1 year ago

any idea how to avoid these during grinder runs: The JVM attempted to load libcriu.so but was unable to: 1

This error is a known issue (I think it is machine setup related)

My workaround is to re-launch with more machines, or at the same machine where the failure was reported initinally.

tajila commented 1 year ago

@JasonFengJ9 any idea how to avoid these during grinder runs:

should only happen when the machine doesnt have criu installed

tajila commented 1 year ago

I suppose JNIinv DetachCurrentThread is triggering a GC?

Possibly, I've always wondered whether GC threads need to have java objects. I think thats the issue. Anything with a java object needs to run the thread cleanup code. As soon as you run java code there is always a chance a GC can be triggered.

RSalman commented 1 year ago

@JasonFengJ9 could you please try out this fix and see if the issue shows up: https://github.com/eclipse/omr/pull/7012. Seems like you had some more luck with the grinder, I'm not getting any reproducibility.

RSalman commented 1 year ago

Fairly confident this change fixes the issue. I've ran some grinders, no issues with the fix... but also no issue without the fix. So I can't defiantly say this fixes the issue.

JasonFengJ9 commented 1 year ago

could you please try out the fix and see if the issue shows up: https://github.com/eclipse/omr/pull/7012. Seems like you had some more luck with the grinder, I'm not getting any reproducibility.

Missed this notification, will give it a try.

JasonFengJ9 commented 1 year ago

Not able to reproduce the failure 200x grinder with https://github.com/eclipse/omr/pull/7012, and 200x grinder using a recent nightly build In addition, this assertion didn't appear in recent nightly builds either.

@RSalman I think https://github.com/eclipse/omr/pull/7012 can be merged. This issue could be re-opened if the failure occurs again.