Closed pshipton closed 8 months ago
@TobiAjila pls help direct this.
https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_sanity.functional_aarch64_linux_OMR_testList_1/433
cmdLineTester_criu_nonPortableRestore_5 -Xgcpolicy:optthruput
Testing: Create and Restore Criu Checkpoint Image once - MethodTypeDeadlockTest
Test start time: 2024/01/24 05:58:21 Eastern Standard Time
Running command: bash /home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_aarch64_linux_OMR_testList_1/aqa-tests/TKG/../../jvmtest/functional/cmdLineTests/criu/criuScript.sh /home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_aarch64_linux_OMR_testList_1/aqa-tests/TKG/../../jvmtest/functional/cmdLineTests/criu /home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_aarch64_linux_OMR_testList_1/jdkbinary/j2sdk-image/bin/java " -Xgcpolicy:optthruput -XX:+ThrowOnDelayedCheckpointOperation -XX:sleepMillisecondsForNotCheckpointSafe=20 -Xtrace:print=j9vm.731 --add-opens java.base/jdk.internal.misc=ALL-UNNAMED --add-exports java.base/openj9.internal.criu=ALL-UNNAMED" org.openj9.criu.DeadlockTest MethodTypeDeadlockTest 1
Time spent starting: 4 milliseconds
Time spent executing: 3499 milliseconds
Test result: FAILED
Output from test:
[OUT] start running script
[OUT] export GLIBC_TUNABLES=glibc.cpu.hwcaps=-XSAVEC,-XSAVE,-AVX2,-ERMS,-AVX,-AVX_Fast_Unaligned_Load
[OUT] export LD_BIND_NOT=on
[OUT] /home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_aarch64_linux_OMR_testList_1/jdkbinary/j2sdk-image/bin/java -XX:+EnableCRIUSupport -Xgcpolicy:optthruput -XX:+ThrowOnDelayedCheckpointOperation -XX:sleepMillisecondsForNotCheckpointSafe=20 -Xtrace:print=j9vm.731 --add-opens java.base/jdk.internal.misc=ALL-UNNAMED --add-exports java.base/openj9.internal.criu=ALL-UNNAMED -cp /home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_aarch64_linux_OMR_testList_1/aqa-tests/TKG/../../jvmtest/functional/cmdLineTests/criu/criu.jar org.openj9.criu.DeadlockTest MethodTypeDeadlockTest 1
[OUT] Pre-checkpoint
[OUT] Unhandled exception
[OUT] Type=Segmentation error vmState=0x00000000
[OUT] J9Generic_Signal_Number=00000018 Signal_Number=0000000b Error_Value=00000000 Signal_Code=00000001
[OUT] Handler1=0000FFFF9D782F44 Handler2=0000FFFF9D6E7B60 InaccessibleAddress=0000412474736580
[OUT] R0=0000000089383CC0 R1=0000000000000008 R2=0000000000000000 R3=0000000000000004
[OUT] R4=0000FFFF4E87DAA0 R5=0000000000000000 R6=0000FFFF9D091B08 R7=FFFFFF80FFFFFFE0
[OUT] R8=0000FFFF4E87DC80 R9=0000FFFF4E87DC80 R10=0000FFFF4E87DC60 R11=0000000000000011
[OUT] R12=0000FFFF4E87DC80 R13=0000FFFF4E87DC80 R14=0000000000048108 R15=00000000890B06E8
[OUT] R16=0000FFFF97F700B0 R17=0000FFFF9DDD5480 R18=0000FFFF4E87E300 R19=0000000000208468
[OUT] R20=0000FFFF9D91FB40 R21=0000000000000000 R22=0000FFFF9D973810 R23=0000FFFF4E87DC80
[OUT] R24=0000FFFF9D08F000 R25=00000000FFFFFFE0 R26=0000412474736554 R27=0000000000000004
[OUT] R28=0000FFFF9DCC1F90 R29=0000FFFF4E87D870 R30=0000FFFF9D051258 R31=0000FFFF4E87D870
[OUT] PC=0000FFFF9D04F2EC SP=0000FFFF4E87D870 PSTATE=0000000020001000
[OUT] V0 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[OUT] V1 007200650073002e (f: 7536686.000000, d: 1.602190e-306)
[OUT] V2 004100540045004d (f: 4522061.000000, d: 1.891455e-307)
[OUT] V3 000000490000002d (f: 45.000000, d: 1.549057e-312)
[OUT] V4 9728f76acff8b19e (f: 3489182208.000000, d: -4.174938e-197)
[OUT] V5 00000000a64fe1c0 (f: 2790253056.000000, d: 1.378568e-314)
[OUT] V6 94446f0194446f01 (f: 2487512832.000000, d: -4.855772e-211)
[OUT] V7 000e1781000e1781 (f: 923521.000000, d: 1.959708e-308)
[OUT] V8 000003c10000745f (f: 29791.000000, d: 2.039238e-311)
[OUT] V9 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[OUT] V10 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[OUT] V11 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[OUT] V12 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[OUT] V13 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[OUT] V14 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[OUT] V15 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[OUT] V16 0000000044000000 (f: 1140850688.000000, d: 5.636551e-315)
[OUT] V17 0000000400000000 (f: 0.000000, d: 8.487983e-314)
[OUT] V18 0000000000000400 (f: 1024.000000, d: 5.059232e-321)
[OUT] V19 3f9eb851eb851eb8 (f: 3951369984.000000, d: 3.000000e-02)
[OUT] V20 3fb1eb851eb851ec (f: 515396064.000000, d: 7.000000e-02)
[OUT] V21 0000000000000008 (f: 8.000000, d: 3.952525e-323)
[OUT] V22 3f0000003f800000 (f: 1065353216.000000, d: 3.051759e-05)
[OUT] V23 3fc999999999999a (f: 2576980480.000000, d: 2.000000e-01)
[OUT] V24 3fd6666666666666 (f: 1717986944.000000, d: 3.500000e-01)
[OUT] V25 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[OUT] V26 0000000000000001 (f: 1.000000, d: 4.940656e-324)
[OUT] V27 000000000000000a (f: 10.000000, d: 4.940656e-323)
[OUT] V28 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[OUT] V29 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[OUT] V30 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[OUT] V31 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[OUT] Module=/home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_aarch64_linux_OMR_testList_1/jdkbinary/j2sdk-image/lib/default/libj9trc29.so
[OUT] Module_base_address=0000FFFF9D040000
[OUT] Target=2_90_20240124_448 (Linux 4.18.0-348.7.1.el8_5.aarch64)
[OUT] CPU=aarch64 (8 logical CPUs) (0x1dc1b0000 RAM)
[OUT] ----------- Stack Backtrace -----------
[OUT] traceV+0xdc (0x0000FFFF9D04F2EC [libj9trc29.so+0xf2ec])
[OUT] doTracePoint+0x768 (0x0000FFFF9D051258 [libj9trc29.so+0x11258])
[OUT] omrTrace+0x6c (0x0000FFFF9D0517BC [libj9trc29.so+0x117bc])
[OUT] walkStackFrames+0x94 (0x0000FFFF9D7C3014 [libj9vm29.so+0x73014])
[OUT] Java_java_security_AccessController_getAccSnapshot+0x140 (0x0000FFFF97EE8DF0 [libjclse29.so+0x18df0])
[OUT] (0x0000FFFF7CFC34D0 [<unknown>+0x0])
[OUT] ---------------------------------------
[OUT] JVMDUMP039I Processing dump event "gpf", detail "" at 2024/01/24 05:58:21 - please wait.
[OUT] JVMDUMP032I JVM requested System dump using '/home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_aarch64_linux_OMR_testList_1/aqa-tests/TKG/output_17060928875291/cmdLineTester_criu_nonPortableRestore_5/core.20240124.055821.2624260.0001.dmp' in response to an event
[OUT] JVMDUMP010I System dump written to /home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_aarch64_linux_OMR_testList_1/aqa-tests/TKG/output_17060928875291/cmdLineTester_criu_nonPortableRestore_5/core.20240124.055821.2624260.0001.dmp
[OUT] JVMDUMP032I JVM requested Java dump using '/home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_aarch64_linux_OMR_testList_1/aqa-tests/TKG/output_17060928875291/cmdLineTester_criu_nonPortableRestore_5/javacore.20240124.055821.2624260.0002.txt' in response to an event
[OUT] Performing CRIUSupport.checkpointJVM(), current thread name: main, Wed Jan 24 05:58:21 EST 2024, System.currentTimeMillis(): 1706093901922, System.nanoTime(): 1801422638095851
[OUT] JVMDUMP010I Java dump written to /home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_aarch64_linux_OMR_testList_1/aqa-tests/TKG/output_17060928875291/cmdLineTester_criu_nonPortableRestore_5/javacore.20240124.055821.2624260.0002.txt
[OUT] JVMDUMP032I JVM requested Snap dump using '/home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_aarch64_linux_OMR_testList_1/aqa-tests/TKG/output_17060928875291/cmdLineTester_criu_nonPortableRestore_5/Snap.20240124.055821.2624260.0003.trc' in response to an event
[OUT] JVMDUMP010I Snap dump written to /home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_aarch64_linux_OMR_testList_1/aqa-tests/TKG/output_17060928875291/cmdLineTester_criu_nonPortableRestore_5/Snap.20240124.055821.2624260.0003.trc
[OUT] JVMDUMP032I JVM requested JIT dump using '/home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_aarch64_linux_OMR_testList_1/aqa-tests/TKG/output_17060928875291/cmdLineTester_criu_nonPortableRestore_5/jitdump.20240124.055821.2624260.0004.dmp' in response to an event
[OUT] JVMDUMP051I JIT dump occurred in 'Thread-5' thread 0x0000000000208400
[OUT] JVMDUMP010I JIT dump written to /home/jenkins/workspace/Test_openjdk17_j9_sanity.functional_aarch64_linux_OMR_testList_1/aqa-tests/TKG/output_17060928875291/cmdLineTester_criu_nonPortableRestore_5/jitdump.20240124.055821.2624260.0004.dmp
[OUT] JVMDUMP013I Processed dump event "gpf", detail "".
[OUT] Error (criu/protobuf.c:72): Unexpected EOF on (empty-image)
[OUT] Removed test output files
[OUT] finished script
>> Success condition was not found: [Output match: User requested Java dump using]
>> Success condition was not found: [Output match: TEST PASSED]
>> Failure condition was not found: [Output match: TEST FAILED]
>> Required condition was found: [Output match: Pre-checkpoint]
>> Success condition was not found: [Output match: Checkpoint blocked because thread]
>> Required condition was not found: [Output match: Killed]
>> Failure condition was not found: [Output match: CRIU is not enabled]
>> Failure condition was not found: [Output match: Operation not permitted]
>> Success condition was not found: [Output match: Thread pid mismatch]
>> Success condition was not found: [Output match: do not match expected]
>> Success condition was not found: [Output match: Unable to create a thread:]
>> Failure condition was not found: [Output match: TEST FAILED]
>> Failure condition was not found: [Output match: Could not dump the JVM processes, err=-70]
This may be a regression.
@ThanHenderson can you please take a look at this
I reproduced a segmentation fault for cmdLineTester_criu_nonPortableRestore_8
on my personal x86-64_linux
fyre dev machine. It is very intermittent with an observed 2/1100
failures. These failures had different stack traces than what is reported here.
For cmdLineTester_criu_nonPortableRestore_5
, I tried running 1000 iterations twice, both times an iteration hung before completing all iterations.
I'll investigate to see if this is indeed a recent regression or just such an intermittent problem that it hadn't shown up for a while.
This is an issue with the tracing mechanism, and doesn't show up when using -Xtrace:none
(passed >10,000 iterations successfully). I'm unsure exactly under which context it is happening, but I've narrowed it down to that. I'll have more insight later today.
This is what a normal UtThreadData
looks like
UtThreadData at 0x7fffcc004eb0 {
Fields for UtThreadData:
0x0: struct UtDataHeader header = !utdataheader 0x00007FFFCC004EB0
0x10: const void* id = !j9x 0x000000000001E600
0x18: const void* synonym1 = !j9x 0x00007FFFF00D2E90
0x20: const void* synonym2 = !j9x 0x000000000001F020
0x28: const U8* name = !j9x 0x00007FFFCC004F50 // "(unnamed thread)"
0x30: U8 currentOutputMask = 0x02 (2)
0x38: struct UtTraceBuffer* trcBuf = !uttracebuffer 0x0000000000000000
0x40: void* external = !j9x 0x0000000000000000
0x48: I32 suspendResume = 0x00000000 (0)
0x4c: I32 recursion = 0x00000001 (1)
0x50: I32 indent = 0x00000000 (0)
}
But in the failing cases the UtThreadData
is initialized as the first argument here https://github.com/eclipse-openj9/openj9/blob/5d03ee9d07d21881fae750fa5c69565325ffa364/runtime/rastrace/trclog.c#L2140
produces garbage data, similar to what is seen below, which causes problems at various points downstream in the tracing pipeline
UtThreadData at 0x7ffde81b0 {
Fields for UtThreadData:
0x0: struct UtDataHeader header = !utdataheader 0x00000007FFDE81B0
0x10: const void* id = !j9x 0x07130003000A1700
0x18: const void* synonym1 = !j9x 0x0100011500071400
0x20: const void* synonym2 = !j9x 0x0600014901000178
0x28: const U8* name = !j9x 0x00013E74696E693C
0x30: U8 currentOutputMask = 0x03 (3)
0x38: struct UtTraceBuffer* trcBuf = !uttracebuffer 0x694C0F000165646F
0x40: void* external = !j9x 0x7265626D754E656E
0x48: I32 suspendResume = 0x6C626154 (1818386772)
0x4c: I32 recursion = 0x12000165 (301990245)
0x50: I32 indent = 0x61636F4C (1633906508)
}
After a long period of testing and a pseudo-binary search of commits not included in 0.43 (which doesn't crash), I've identified https://github.com/eclipse-openj9/openj9/pull/18745 as the commit that introduces this regression.
After a long period of testing and a pseudo-binary search of commits not included in 0.43 (which doesn't crash), I've identified https://github.com/eclipse-openj9/openj9/pull/18745 as the commit that introduces this regression.
I will have a look it.
After a long period of testing and a pseudo-binary search of commits not included in 0.43 (which doesn't crash), I've identified #18745 as the commit that introduces this regression.
This is actually not the case. After that commit, they just seem to happen more frequently. Running the commit before for ~50,000 iterations I observed the failure 4 times.
The MethodTypeDeadlockTest
crash might be related to the two UT_MODULE_LOADED(J9_UTINTERFACE_FROM_VM(vm))
calls within j9vm
module.
One is https://github.com/eclipse-openj9/openj9/blob/43f37078c4fc4229e6a42e19893ffc080c61f90a/runtime/vm/jvminit.c#L2878 The other is https://github.com/eclipse-openj9/openj9/blob/43f37078c4fc4229e6a42e19893ffc080c61f90a/runtime/vm/CRIUHelpers.cpp#L903 which should be removed at
I was able to reproduce the crash in a fyre x64 Linux image ~10/1000 runs, removing the UT_MODULE_LOADED()
call within CRIUHelpers.cpp
hasn't crashed in 2000+ runs.
@ThanHenderson could you try a build in your test environment?
https://openj9-jenkins.osuosl.org/job/Test_openjdk11_j9_sanity.functional_aarch64_linux_Nightly_testList_0/588/ cmdLineTester_criu_nonPortableRestore_8
-Xgcpolicy:gencon -Xgcthreads1
https://openj9-artifactory.osuosl.org/artifactory/ci-openj9/Test/Test_openjdk11_j9_sanity.functional_aarch64_linux_Nightly_testList_0/588/functional_test_output.tar.gz