Open Mesbah-Alam opened 3 years ago
Started a 5x grinder at internal Jenkins, Grinder/14023/
@zl-wang fyi
interesting ... crash in replaceTrampoline
@zl-wang : Please note, the problem was reproduced 3/5 times in hyc-runtimes-jenkins/Grinder/14023 .
@zl-wang : Just an FYI..
Another JIT crash is observed while running the same test in Aarch64 as well :
https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder_SM/43/tapResults/
Core file: https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder_SM/43/artifact/system_test_output.tar.gz
Please advise if this requires a different issue.
Error details :
LT stderr Unhandled exception
LT stderr Type=Segmentation error vmState=0x00000000
LT stderr J9Generic_Signal_Number=00000018 Signal_Number=0000000b Error_Value=00000000 Signal_Code=00000001
LT stderr Handler1=0000FFFF9AE9C8C0 Handler2=0000FFFF9AE143F0 InaccessibleAddress=0000000000000020
LT stderr R0=0000FFFF76FC64D0 R1=0000FFFF76FC64E0 R2=0000FFFF700A1860 R3=0000FFFF70170BC0
LT stderr R4=0000FFFF81AA6D7C R5=0000000000000000 R6=0000000000000000 R7=0000FFFF9AB077E0
LT stderr R8=0000FFFF75859F30 R9=FFFFFFFFFFFFFFB0 R10=0000FFFF81AA6D7C R11=0000FFFF76ED5E5C
LT stderr R12=7420303D65746975 R13=0000000000000038 R14=000000000000003F R15=0000FFFF4F5F5520
LT stderr R16=0000FFFF9ADEA118 R17=0000FFFF9BCEB428 R18=0000000000000001 R19=0000FFFF76FC64D0
LT stderr R20=0000FFFF700A1860 R21=0000000000000000 R22=0000FFFF81AA6D7C R23=0000000000000000
LT stderr R24=0000FFFF81AA6D80 R25=0000000000000000 R26=000000068065C8E8 R27=0000000000000000
LT stderr R28=0000000000000009 R29=0000000000000000 R30=0000FFFF9A86E254 R31=0000FFFF4F5F5230
LT stderr PC=0000FFFF9A86E258 SP=0000FFFF4F5F5230 PSTATE=0000000020000000
LT stderr V0 0000000000000005 (f: 5.000000, d: 2.470328e-323)
LT stderr V1 000000003f800000 (f: 1065353216.000000, d: 5.263544e-315)
LT stderr V2 401ffd8ad3e009d4 (f: 3554675200.000000, d: 7.997600e+00)
LT stderr V3 3fe62e42fee00000 (f: 4276092928.000000, d: 6.931472e-01)
LT stderr V4 3d60d81cd340e21c (f: 3544244736.000000, d: 4.787407e-13)
LT stderr V5 3ea737781f0259fa (f: 520247808.000000, d: 6.919109e-07)
LT stderr V6 40162e42fee00000 (f: 4276092928.000000, d: 5.545177e+00)
LT stderr V7 3fe5555555555593 (f: 1431655808.000000, d: 6.666667e-01)
LT stderr V8 0000000000000000 (f: 0.000000, d: 0.000000e+00)
LT stderr V9 3f50624dd2f1a9fc (f: 3539053056.000000, d: 1.000000e-03)
LT stderr V10 0000000000000000 (f: 0.000000, d: 0.000000e+00)
LT stderr V11 0000000000000000 (f: 0.000000, d: 0.000000e+00)
LT stderr V12 0000000000000000 (f: 0.000000, d: 0.000000e+00)
LT stderr V13 0000000000000000 (f: 0.000000, d: 0.000000e+00)
LT stderr V14 0000000000000000 (f: 0.000000, d: 0.000000e+00)
LT stderr V15 0000000000000000 (f: 0.000000, d: 0.000000e+00)
LT stderr V16 3fd999999997fa04 (f: 2576873984.000000, d: 4.000000e-01)
LT stderr V17 3fd2492494229359 (f: 2485293824.000000, d: 2.857143e-01)
LT stderr V18 3fcc71c51d8e78af (f: 495876256.000000, d: 2.222220e-01)
LT stderr V19 3fc7466496cb03de (f: 2529887232.000000, d: 1.818357e-01)
LT stderr V20 0000ffff4f5f53d0 (f: 1331647488.000000, d: 1.390657e-309)
LT stderr V21 0000ffff4f5f53c0 (f: 1331647488.000000, d: 1.390657e-309)
LT stderr V22 0000ffff4f5f53b0 (f: 1331647360.000000, d: 1.390657e-309)
LT stderr V23 00000000000000f0 (f: 240.000000, d: 1.185758e-321)
LT stderr V24 0000000000000000 (f: 0.000000, d: 0.000000e+00)
LT stderr V25 0000000000000000 (f: 0.000000, d: 0.000000e+00)
LT stderr V26 0000000000000000 (f: 0.000000, d: 0.000000e+00)
LT stderr V27 0000000000000000 (f: 0.000000, d: 0.000000e+00)
LT stderr V28 3ff0000000000000 (f: 0.000000, d: 1.000000e+00)
LT stderr V29 0000000000000000 (f: 0.000000, d: 0.000000e+00)
LT stderr V30 31f54fdd7f73bf3b (f: 2138291968.000000, d: 4.940656e-68)
LT stderr V31 000000007f800000 (f: 2139095040.000000, d: 1.056853e-314)
LT stderr Module=/home/jenkins/workspace/Grinder_SM/openjdkbinary/j2sdk-image/lib/default/libj9jit29.so
LT stderr Module_base_address=0000FFFF9A161000
LT stderr Target=2_90_20210302_677 (Linux 4.15.0-34-generic)
LT stderr CPU=aarch64 (64 logical CPUs) (0x1f699a1000 RAM)
LT stderr ----------- Stack Backtrace -----------
LT stderr (0x0000FFFF9AE11048 [libj9prt29.so+0x26048])
LT stderr ---------------------------------------
LT stderr JVMDUMP039I Processing dump event "gpf", detail "" at 2021/03/04 02:10:45 - please wait.
LT stderr JVMDUMP032I JVM requested System dump using '/home/jenkins/workspace/Grinder_SM/openjdk-tests/TKG/output_16148166009214/system_custom_0/20210304-001000-MixedLoadTest/results/core.20210304.021045.8957.0001.dmp' in response to an event
LT 02:10:47.353 - Thread completed. Suite=0 thread=1
LT 02:10:51.124 - Completed 100.0%. Number of tests started=1028103 (+1546)
LT 02:10:52.629 - Thread completed. Suite=0 thread=9
LT stderr JVMDUMP010I System dump written to /home/jenkins/workspace/Grinder_SM/openjdk-tests/TKG/output_16148166009214/system_custom_0/20210304-001000-MixedLoadTest/results/core.20210304.021045.8957.0001.dmp
LT stderr JVMDUMP032I JVM requested Java dump using '/home/jenkins/workspace/Grinder_SM/openjdk-tests/TKG/output_16148166009214/system_custom_0/20210304-001000-MixedLoadTest/results/javacore.20210304.021045.8957.0002.txt' in response to an event
STF 02:10:55.692 - Found dump at: /home/jenkins/workspace/Grinder_SM/openjdk-tests/TKG/output_16148166009214/system_custom_0/20210304-001000-MixedLoadTest/results/javacore.20210304.021045.8957.0002.txt
STF 02:10:55.694 - Found dump at: /home/jenkins/workspace/Grinder_SM/openjdk-tests/TKG/output_16148166009214/system_custom_0/20210304-001000-MixedLoadTest/results/core.20210304.021045.8957.0001.dmp
LT stderr javacore file generated - /home/jenkins/workspace/Grinder_SM/openjdk-tests/TKG/output_16148166009214/system_custom_0/20210304-001000-MixedLoadTest/results/javacore.20210304.021045.8957.0002.txt
LT stderr core file generated - /home/jenkins/workspace/Grinder_SM/openjdk-tests/TKG/output_16148166009214/system_custom_0/20210304-001000-MixedLoadTest/results/core.20210304.021045.8957.0001.dmp
LT 02:11:03.601 - Thread completed. Suite=0 thread=4
LT 02:11:08.099 - Thread completed. Suite=0 thread=3
LT stderr JVMDUMP010I Java dump written to /home/jenkins/workspace/Grinder_SM/openjdk-tests/TKG/output_16148166009214/system_custom_0/20210304-001000-MixedLoadTest/results/javacore.20210304.021045.8957.0002.txt
LT stderr JVMDUMP032I JVM requested Snap dump using '/home/jenkins/workspace/Grinder_SM/openjdk-tests/TKG/output_16148166009214/system_custom_0/20210304-001000-MixedLoadTest/results/Snap.20210304.021045.8957.0003.trc' in response to an event
LT stderr JVMDUMP010I Snap dump written to /home/jenkins/workspace/Grinder_SM/openjdk-tests/TKG/output_16148166009214/system_custom_0/20210304-001000-MixedLoadTest/results/Snap.20210304.021045.8957.0003.trc
LT stderr JVMDUMP032I JVM requested JIT dump using '/home/jenkins/workspace/Grinder_SM/openjdk-tests/TKG/output_16148166009214/system_custom_0/20210304-001000-MixedLoadTest/results/jitdump.20210304.021045.8957.0004.dmp' in response to an event
LT stderr JVMDUMP013I Processed dump event "gpf", detail "".
STF 02:11:09.232 - Found dump at: /home/jenkins/workspace/Grinder_SM/openjdk-tests/TKG/output_16148166009214/system_custom_0/20210304-001000-MixedLoadTest/results/Snap.20210304.021045.8957.0003.trc
Separate issue please ... I didn't see this one's back trace, and thus not sure if it crashed at similar place.
Core: https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder_SM/38/artifact/system_test_output.tar.gz
There's no corefile in here. From the test output I see: https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder_SM/38/consoleText
[2021-03-03T19:01:59.906Z] LT stderr JVMDUMP032I JVM requested System dump using '/home/jenkins/workspace/Grinder_SM/openjdk-tests/TKG/output_16147980324343/system_custom_0/20210303-130033-MixedLoadTest/results/core.20210303.130158.30147044.0001.dmp' in response to an event
[2021-03-03T19:01:59.906Z] LT stderr JVMDUMP012E Error in System dump: cannot find core file: "A file or directory in the path name does not exist.". check "ulimit -Hc" is set high enough
So it seems like there's a problem with the machine config that is preventing the generation of a corefile.
@IBMJimmyk :
You are right, the core file was missing in that zip.
Could you please visit the internal Grinder/14023/console ?
If you scroll down to the bottom, you will see a link that ends with Grinder/14023/system_test_output.tar.gz
, that will contain the core files of this defect too.
The problem is occuring due to a crash inside of OMR::CodeCache::replaceTrampoline
. Near the start of replaceTrampoline
is a lookup into _resolvedMethodHT
:
https://github.com/eclipse/omr/blob/165a4a81709504806f14bd8e540d97962cafe5f5/compiler/runtime/OMRCodeCache.cpp#L614
The method can't be found in _resolvedMethodHT
so NULL is returned and stored to entry
. A bit later on there are load/stores after dereferencing entry
which causes the crash.
The method is supposed to get added to _resolvedMethodHT
from inside J9::CodeCache::resolveHashEntry
:
https://github.com/eclipse/openj9/blob/e749042001fcb7e9b20f74d20e7f85a12538d476/runtime/compiler/runtime/J9CodeCache.cpp#L299
But, this never happens. This reason is that J9::CodeCache::adjustTrampolineReservation
doesn't call resolveHashEntry
:
https://github.com/eclipse/openj9/blob/e749042001fcb7e9b20f74d20e7f85a12538d476/runtime/compiler/runtime/J9CodeCache.cpp#L569-L587
If you look at those lines resolveHashEntry
is called if the method is found in _unresolvedMethodHT
but is not inside _resolvedMethodHT
. As it turns out, the lookup into _unresolvedMethodHT
also can't find the entry. But this turns out to be a mistake due to a slight difference in how the method was added to the hashtable and how the lookup was performed.
The lookup inside adjustTrampolineReservation
uses the the Constant Pool pointer cp
and the Constant Pool Index cpIndex
to do the look up. cpIndex
has the type int32_t
but it was orginally passed in as an unsigned 18-bit value. That's well within the positive range of an int32_t
. So here, cpIndex
is positive.
Example values:
cp = 0x1814EC0
cpIndex = 0x20000 // decimal value of 131072
key = 0x3029d800000 // generated by running a hash function on the cp and cpIndex
The problem is that when the entry was added to _unresolvedMethodHT
, a negative cpIndex
was used.
The cpIndex
to create the entry is read from the symref here:
https://github.com/eclipse/openj9/blob/e749042001fcb7e9b20f74d20e7f85a12538d476/runtime/compiler/env/VMJ9.cpp#L5405
The cpIndex
is stored as an signed 18-bit bit field. 0x20000
in binary is 10 0000 0000 0000 0000
. This means that 0x20000
is interpreted as -131072 and sign extended when cast to a I_32
.
The cp
and cpIndex
are passed to J9::CodeCache::reserveUnresolvedTrampoline
to create the entry in _unresolvedMethodHT
:
https://github.com/eclipse/openj9/blob/e749042001fcb7e9b20f74d20e7f85a12538d476/runtime/compiler/runtime/J9CodeCache.cpp#L670
Example values:
cp = 0x1814EC0
cpIndex = 0xFFFE0000 // decimal value of -131072
key = 0xfffffcfd62800000 // generated by running a hash function on the cp and cpIndex
The key is different. Due the original 18-bit cpIndex
being sign extended when used to add the entry and being zero extended when used to look up the entry, the entry can not be found.
I am currently looking at making a fix where the sign on cpIndex
is consistent for all access (add, delete, lookup) to the hashtables.
This is a common issue so it should be a problem on X and Z as well not just Power and Arm.
After going through the codebase, I suspect the fix can be limited to just one location: https://github.com/eclipse/openj9/blob/e749042001fcb7e9b20f74d20e7f85a12538d476/runtime/compiler/env/VMJ9.cpp#L5405
This location was also mentioned in the previous comment. The line can be changed to:
I_32 cpIndex = symRef->getCPIndexForVM();
This will mask out the sign extension. From what I can tell, all other accesses to the hashtable assume an unsigned cpIndex
. This change will bring this location in line with the others.
I have a test build with the proposed fix and I am currently grinding the test to verify.
nice finding ...
Good work diagnosing this problem @IBMJimmyk !
I've opened a PR that hopefully fixes the issue here: https://github.com/eclipse/openj9/pull/12415
The fix is pretty much the one I listed earlier. At first I thought there were some issues with how relocations are handled that could lead to a similar problem. But, it turns out that the problems with the relocations are a separate problem so they shouldn't hold up fixing this one.
Test
MixedLoadTest -test-args="timeLimit=30m" Variation: Mode554 JVM_OPTIONS: --add-opens java.base/java.lang=ALL-UNNAMED --add-opens java.base/java.io=ALL-UNNAMED -Xcompressedrefs -Xgcpolicy:balanced -XXgc:tarokEnableExpensiveAssertions,fvtest_tarokPGCRotateCollectors -Xjit:count=0,optlevel=hot,gcOnResolve,rtResolve
Java Version
Error details copied from console output
To reproduce
Run grinder with following settings: