eclipse-openj9 / openj9

Eclipse OpenJ9: A Java Virtual Machine for OpenJDK that's optimized for small footprint, fast start-up, and high throughput. Builds on Eclipse OMR (https://github.com/eclipse/omr) and combines with the Extensions for OpenJDK for OpenJ9 repo.
Other
3.27k stars 720 forks source link

JDK19 AIX serviceability_jvmti_j9_0_FAILED serviceability/jvmti/FieldAccessWatch/FieldAccessWatch.java Segmentation error vmState=0x00000000 #16750

Closed JasonFengJ9 closed 1 year ago

JasonFengJ9 commented 1 year ago

Failure link

From an internal build(paix817):

openjdk version "19.0.2" 2023-01-17
IBM Semeru Runtime Open Edition 19.0.2+7 (build 19.0.2+7)
Eclipse OpenJ9 VM 19.0.2+7 (build master-042864b42, JRE 19 AIX ppc64-64-Bit Compressed References 20230117_46 (JIT enabled, AOT enabled)
OpenJ9   - 042864b42
OMR      - b86560818
JCL      - edc950934d based on jdk-19.0.2+7)

Rerun in Grinder - Change TARGET to run only the failed test targets.

Optional info

Failure output (captured from console output)

[2023-02-18T19:05:43.974Z] variation: Mode150
[2023-02-18T19:05:44.314Z] JVM_OPTIONS:  -XX:+UseCompressedOops 

[2023-02-18T19:07:16.581Z] TEST: serviceability/jvmti/FieldAccessWatch/FieldAccessWatch.java

[2023-02-18T19:07:16.583Z] STDERR:
[2023-02-18T19:07:16.583Z] Unhandled exception
[2023-02-18T19:07:16.583Z] Type=Segmentation error vmState=0x00000000
[2023-02-18T19:07:16.583Z] J9Generic_Signal_Number=00000018 Signal_Number=0000000b Error_Value=00000000 Signal_Code=00000032
[2023-02-18T19:07:16.583Z] Handler1=09001000A0C4FEE0 Handler2=09001000A0C69278
[2023-02-18T19:07:16.583Z] R0=0000000000000001 R1=0000010022FD7E50 R2=09001000A0C53C60 R3=FFE5259000000000
[2023-02-18T19:07:16.583Z] R4=000000003020C4F0 R5=09001000A0C4EF08 R6=0000010022D906C8 R7=000000003020C500
[2023-02-18T19:07:16.583Z] R8=0000000000000001 R9=0000000000000038 R10=0000010022D906C8 R11=0000000000000000
[2023-02-18T19:07:16.583Z] R12=FFFFFFFFFFED3F54 R13=0000010022FE4800 R14=0005F4FE223A3665 R15=09001000A08B7748
[2023-02-18T19:07:16.583Z] R16=00000100103FF9F8 R17=000000003020C4F0 R18=000000003020C500 R19=000000003020C4F8
[2023-02-18T19:07:16.583Z] R20=000001001012D5D0 R21=000000000000004C R22=0000010022D906C8 R23=0000010022FD8328
[2023-02-18T19:07:16.583Z] R24=0000000000000038 R25=FFFFFFFFFFFFECF0 R26=0000010022D907E0 R27=0000000030206700
[2023-02-18T19:07:16.583Z] R28=09001000A07DF9A8 R29=0000010022FD8300 R30=0000000000000000 R31=0000000030206700
[2023-02-18T19:07:16.583Z] IAR=0900000004438A2C LR=0900000006442358 MSR=A00000000200D032 CTR=09000000044389E0
[2023-02-18T19:07:16.583Z] CR=4448282300000004 FPSCR=8202000000000000 XER=0000000482020000
[2023-02-18T19:07:16.583Z] FPR0 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2023-02-18T19:07:16.583Z] FPR1 c3e0000000000000 (f: 0.000000, d: -9.223372e+18)
[2023-02-18T19:07:16.583Z] FPR2 41cdcd6500000000 (f: 0.000000, d: 1.000000e+09)
[2023-02-18T19:07:16.583Z] FPR3 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2023-02-18T19:07:16.583Z] FPR4 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2023-02-18T19:07:16.583Z] FPR5 c3e0000000000000 (f: 0.000000, d: -9.223372e+18)
[2023-02-18T19:07:16.583Z] FPR6 4119aee400000000 (f: 0.000000, d: 4.207930e+05)
[2023-02-18T19:07:16.583Z] FPR7 412e848000000000 (f: 0.000000, d: 1.000000e+06)
[2023-02-18T19:07:16.583Z] FPR8 4026000000000000 (f: 0.000000, d: 1.100000e+01)
[2023-02-18T19:07:16.583Z] FPR9 4530000000000000 (f: 0.000000, d: 1.934281e+25)
[2023-02-18T19:07:16.583Z] FPR10 412e848000000000 (f: 0.000000, d: 1.000000e+06)
[2023-02-18T19:07:16.583Z] FPR11 43300000000f4240 (f: 1000000.000000, d: 4.503600e+15)
[2023-02-18T19:07:16.583Z] FPR12 4530000000000000 (f: 0.000000, d: 1.934281e+25)
[2023-02-18T19:07:16.583Z] FPR13 09001000a0701e80 (f: 2691702272.000000, d: 2.490733e-265)
[2023-02-18T19:07:16.583Z] FPR14 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2023-02-18T19:07:16.583Z] FPR15 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2023-02-18T19:07:16.583Z] FPR16 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2023-02-18T19:07:16.583Z] FPR17 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2023-02-18T19:07:16.583Z] FPR18 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2023-02-18T19:07:16.583Z] FPR19 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2023-02-18T19:07:16.583Z] FPR20 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2023-02-18T19:07:16.583Z] FPR21 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2023-02-18T19:07:16.583Z] FPR22 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2023-02-18T19:07:16.583Z] FPR23 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2023-02-18T19:07:16.583Z] FPR24 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2023-02-18T19:07:16.583Z] FPR25 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2023-02-18T19:07:16.583Z] FPR26 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2023-02-18T19:07:16.583Z] FPR27 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2023-02-18T19:07:16.583Z] FPR28 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2023-02-18T19:07:16.583Z] FPR29 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2023-02-18T19:07:16.583Z] FPR30 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2023-02-18T19:07:16.583Z] FPR31 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2023-02-18T19:07:16.583Z] Module=/home/jenkins/workspace/Test_openjdk19_j9_extended.openjdk_ppc64_aix/openjdkbinary/j2sdk-image/lib/default/libj9vm29.so
[2023-02-18T19:07:16.583Z] Module_base_address=09000000043DE000
[2023-02-18T19:07:16.583Z] Target=2_90_20230117_46 (AIX 7.1)
[2023-02-18T19:07:16.583Z] CPU=ppc64 (16 logical CPUs) (0x200000000 RAM)
[2023-02-18T19:07:16.583Z] ----------- Stack Backtrace -----------
[2023-02-18T19:07:16.583Z] setWatchers+0x54 (0x0900000006442358 [libFieldAccessWatch.so+0x358])
[2023-02-18T19:07:16.583Z] onFieldModification+0x54 (0x0900000006442A78 [libFieldAccessWatch.so+0xa78])
[2023-02-18T19:07:16.583Z] jvmtiHookFieldModification+0x24c (0x0900000004039EF0 [libj9jvmti29.so+0x1bef0])
[2023-02-18T19:07:16.583Z] _ZL14J9HookDispatchPP15J9HookInterfacemPv+0x1e0 (0x0900000003C45764 [libj9hookable29.so+0x764])
[2023-02-18T19:07:16.583Z] old_slow_jitReportInstanceFieldWrite+0x198 (0x090000000486AAFC [libj9jit29.so+0xfdafc])
[2023-02-18T19:07:16.583Z] (0x090000000483B82C [libj9jit29.so+0xce82c])
[2023-02-18T19:07:16.583Z] sendResolveInvokeDynamic+0x36c (0x0900000004454750 [libj9vm29.so+0x76750])
[2023-02-18T19:07:16.583Z] resolveInvokeDynamic+0x158 (0x090000000447943C [libj9vm29.so+0x9b43c])
[2023-02-18T19:07:16.583Z] (0x09000000045D2690 [libj9vm29.so+0x1f4690])
[2023-02-18T19:07:16.583Z] (0x0900000004468AE8 [libj9vm29.so+0x8aae8])
[2023-02-18T19:07:16.583Z] runJavaThread+0x268 (0x0900000004450D4C [libj9vm29.so+0x72d4c])
[2023-02-18T19:07:16.583Z] _ZL23javaProtectedThreadProcP13J9PortLibraryPv+0x11c (0x09000000043E0F00 [libj9vm29.so+0x2f00])
[2023-02-18T19:07:16.583Z] omrsig_protect+0x4cc (0x0900000003D9F5F0 [libj9prt29.so+0x5c5f0])
[2023-02-18T19:07:16.583Z] javaThreadProc+0x68 (0x09000000043E0D6C [libj9vm29.so+0x2d6c])
[2023-02-18T19:07:16.583Z] thread_wrapper+0x558 (0x0900000003E088BC [libj9thr29.so+0x58bc])
[2023-02-18T19:07:16.583Z] _pthread_body+0xf0 (0x0900000000571E14 [libpthreads.a+0x3e14])
[2023-02-18T19:07:16.583Z] ---------------------------------------
[2023-02-18T19:07:16.583Z] JVMDUMP039I Processing dump event "gpf", detail "" at 2023/02/18 14:06:56 - please wait.

[2023-02-18T19:14:21.765Z] serviceability_jvmti_j9_0_FAILED

50x internal grinder - reproduced twice 1 2

DanHeidinga commented 1 year ago

fyi @gita-omr @zl-wang can you take a look?

zl-wang commented 1 year ago

@IBMJimmyk i got impression that jimmy did fieldWatch support in p PicBuilder ... so, please take a look at this one.

vij-singh commented 1 year ago

@IBMJimmyk Hi Jimmy - is there any new news on this Java 19 issue?

IBMJimmyk commented 1 year ago

Unfortunately, I haven't had a chance to take a look at this yet.

0xdaryl commented 1 year ago

@IBMJimmyk : please do so soon. We need to understand if this is something that must be fixed for JDK 19.

IBMJimmyk commented 1 year ago
[2023-02-18T19:07:16.583Z] setWatchers+0x54 (0x0900000006442358 [libFieldAccessWatch.so+0x358])
[2023-02-18T19:07:16.583Z] onFieldModification+0x54 (0x0900000006442A78 [libFieldAccessWatch.so+0xa78])

What is libFieldAccessWatch.so and where does it come from? I don't seem to see it in either the build or the artifacts from the failing test.

IBMJimmyk commented 1 year ago

I was able to talk with Lan Xia. The test picks up the libFieldAccessWatch.so from the testimage tar file associated with each grinder.

I have the corefile and was able to open it up in kca. The backtrace looks like this:

(kca) where
0x0900000003fafa2c {libj9vm29.so}{getObjectClass} [0x10023047e50]
0x0900000006ca4358 {libFieldAccessWatch.so}{setWatchers} [0x10023047ee0]
0x0900000006ca4a78 {libFieldAccessWatch.so}{onFieldModification} [0x10023047fa0]
0x090000000624aef0 {libj9jvmti29.so}{jvmtiHookFieldModification} [0x10023048020]
0x09000000043c0764 {libj9hookable29.so}{_ZL14J9HookDispatchPP15J9HookInterfacemPv} [0x10023048140]
0x090000000459bafc {libj9jit29.so}{old_slow_jitReportInstanceFieldWrite} [0x10023048290]
0x090000000456c82c {libj9jit29.so}{jitReportInstanceFieldWrite} [0x10023048370]
0x0900000003fcb750 {libj9vm29.so}{sendResolveInvokeDynamic} [0x10023048bf0]
0x0900000003ff043c {libj9vm29.so}{resolveInvokeDynamic} [0x10023048d30]
0x0900000004149690 {libj9vm29.so}{debugBytecodeLoopCompressed} [0x10023048e00]
0x0900000003fdfae8 {libj9vm29.so}{c_cInterpreter} [0x10023049650]

Failure is fairly early into getObjectClass:

0x900000003faf9e0 {libj9vm29.so}{getObjectClass} +0               7c0802a6 mflr      r0
0x900000003faf9e4 {libj9vm29.so}{getObjectClass} +1               f821ff71 stdu      r1, -0x90(r1)
0x900000003faf9e8 {libj9vm29.so}{getObjectClass} +2               f80100a0 std       r0, 0xa0(r1)
0x900000003faf9ec {libj9vm29.so}{getObjectClass} +3               fbe10088 std       r31, 0x88(r1)
0x900000003faf9f0 {libj9vm29.so}{getObjectClass} +4               607f0000 ori       r31, r3, 0    //r31 = 0x301bb200 so r3 was the same
0x900000003faf9f4 {libj9vm29.so}{getObjectClass} +5               fbc10080 std       r30, 0x80(r1)
0x900000003faf9f8 {libj9vm29.so}{getObjectClass} +6               fba10078 std       r29, 0x78(r1)
0x900000003faf9fc {libj9vm29.so}{getObjectClass} +7               3bc00000 li        r30, 0
0x900000003fafa00 {libj9vm29.so}{getObjectClass} +8               609d0000 ori       r29, r4, 0   //r29 comes from r4, r4 was 0x3020c4c0
0x900000003fafa04 {libj9vm29.so}{getObjectClass} +9               f8410028 std       r2, 0x28(r1)
0x900000003fafa08 {libj9vm29.so}{getObjectClass} +10              fbc30a50 std       r30, 0xa50(r3)
0x900000003fafa0c {libj9vm29.so}{getObjectClass} +11              e8030098 ld        r0, 0x98(r3)  //r0 = 0x20
0x900000003fafa10 {libj9vm29.so}{getObjectClass} +12              28200020 cmpldi    r0, 0x20
0x900000003fafa14 {libj9vm29.so}{getObjectClass} +13              4082015c bne       0x900000003fafb70 C>> +100 // this branch is not taken
0x900000003fafa18 {libj9vm29.so}{getObjectClass} +14              e81f00a0 ld        r0, 0xa0(r31) <<< ^+111
0x900000003fafa1c {libj9vm29.so}{getObjectClass} +15              e87d0000 ld        r3, 0(r29)      //----- r3 = 0xe004efc000000000. r29 was 0x3020c4c0
0x900000003fafa20 {libj9vm29.so}{getObjectClass} +16              2c200000 cmpdi     r0, 0
0x900000003fafa24 {libj9vm29.so}{getObjectClass} +17              41820138 beq       0x900000003fafb5c C>> +95
0x900000003fafa28 {libj9vm29.so}{getObjectClass} +18              eba10078 ld        r29, 0x78(r1)
0x900000003fafa2c {libj9vm29.so}{getObjectClass} +19              80030000 lwz       r0, 0(r3)       //<<<--------- CRASH here. r3 = 0xe004efc000000000

(kca) (0xe004efc000000000)/llx
100-Unable to locate memory for the range 0xe004efc000000000...0xe004efc000000007

I am currently trying to work out what getObjectClass was in the middle of doing when it crashed. I am also trying to work my way up the stack to figure out where the value went bad.

IBMJimmyk commented 1 year ago

I think the problem is a mismatch between a store with stw and a load with ld.

0x10010b92378 {sun/.../BytecodeDescriptor.parseMethod} +98                       e8701c28 ld        r3, 0x1c28(r16) JITTOC 0x300dc1b0 = 0xe004efc0 Ptr  <<< ^+2605
0x10010b9237c {sun/.../BytecodeDescriptor.parseMethod} +99                       e8630000 ld        r3, 0(r3)
0x10010b92380 {sun/.../BytecodeDescriptor.parseMethod} +100                      e8101c30 ld        r0, 0x1c30(r16) JITTOC 0x10010b94b50 {sun/.../BytecodeDescriptor.parseMethod} +2648
0x10010b92384 {sun/.../BytecodeDescriptor.parseMethod} +101                      804b0000 lwz       r2, 0(r11)
0x10010b92388 {sun/.../BytecodeDescriptor.parseMethod} +102                      5442002e rlwinm    r2, r2, 0, 0, 0x17
0x10010b9238c {sun/.../BytecodeDescriptor.parseMethod} +103                      80420024 lwz       r2, 0x24(r2)
0x10010b92390 {sun/.../BytecodeDescriptor.parseMethod} +104                      70420100 andi.     r2, r2, 0x100
0x10010b92394 {sun/.../BytecodeDescriptor.parseMethod} +105                      408226dc bne       0x10010b94a70 C>> +2592

0x10010b94a70 {sun/.../BytecodeDescriptor.parseMethod} +2592                      7c630278 xor       r3, r3, r0 <<< +105
0x10010b94a74 {sun/.../BytecodeDescriptor.parseMethod} +2593                      7c600278 xor       r0, r3, r0
0x10010b94a78 {sun/.../BytecodeDescriptor.parseMethod} +2594                      7c630278 xor       r3, r3, r0
0x10010b94a7c {sun/.../BytecodeDescriptor.parseMethod} +2595                      61640000 ori       r4, r11, 0
0x10010b94a80 {sun/.../BytecodeDescriptor.parseMethod} +2596                      900e0190 stw       r0, 0x190(r14)
0x10010b94a84 {sun/.../BytecodeDescriptor.parseMethod} +2597                      38ae0190 addi      r5, r14, 0x190
0x10010b94a88 {sun/.../BytecodeDescriptor.parseMethod} +2598          0:8         481d9409 bl        0x10010d6de90 Trampoline {libj9jit29.so}{jitReportInstanceFieldWrite} +0    // putfield 12 java/util/ArrayList.elementData ([Ljava/lang/Object;)

At +98, the value 0x300dc1b0 is loaded from the TOC into r3. It is a pointer to the value 0xe004efc0 which is put into r3 at +99.

(kca) (0x300dc1b0)/llx
%3 = 0x00000000300dc1b0: 00000000e004efc0

r0 and r3 are swapped then r0 (containing 0xe004efc0) is stored to 0x190(r14). This is done with stw. This is the word size store that will eventually lead to a mismatch. r5 is given the value of r14+0x190 (0x3020c690) which is the address that was just stored to. This will be used to read back the data later.

// r14 is currently 0x3020c500 at this point
(kca) (0x3020c500+0x190)/X
%7 = 0x000000003020c690: e004efc0

We then go to save registers as we transition to VM code

0x90000000456c5a4 {libj9jit29.so}{jitReportInstanceFieldWrite} +0               f8010130 std       r0, 0x130(r1)
0x90000000456c5a8 {libj9jit29.so}{jitReportInstanceFieldWrite} +1               7c0802a6 mflr      r0
0x90000000456c5ac {libj9jit29.so}{jitReportInstanceFieldWrite} +2               f8010238 std       r0, 0x238(r1)
0x90000000456c5b0 {libj9jit29.so}{jitReportInstanceFieldWrite} +3               f8410140 std       r2, 0x140(r1)
0x90000000456c5b4 {libj9jit29.so}{jitReportInstanceFieldWrite} +4               e84f01a8 ld        r2, 0x1a8(r15)
0x90000000456c5b8 {libj9jit29.so}{jitReportInstanceFieldWrite} +5               f9cf0020 std       r14, 0x20(r15)
0x90000000456c5bc {libj9jit29.so}{jitReportInstanceFieldWrite} +6               7c000026 mfcr      r0
0x90000000456c5c0 {libj9jit29.so}{jitReportInstanceFieldWrite} +7               f8010230 std       r0, 0x230(r1)
0x90000000456c5c4 {libj9jit29.so}{jitReportInstanceFieldWrite} +8               f8610148 std       r3, 0x148(r1)
0x90000000456c5c8 {libj9jit29.so}{jitReportInstanceFieldWrite} +9               f8810150 std       r4, 0x150(r1)
0x90000000456c5cc {libj9jit29.so}{jitReportInstanceFieldWrite} +10              f8a10158 std       r5, 0x158(r1) // r5 is still 0x3020c690 here
//and so on

jitReportInstanceFieldWrite eventually calls old_slow_jitReportInstanceFieldWrite. Which reads the registers the JIT stored to the stack to see what is in r5.

0x90000000459b998 {libj9jit29.so}{old_slow_jitReportInstanceFieldWrite} +14              e8830258 ld        r4, 0x258(r3)  //r4 = 0x10023048c60, J9VMThread.entryLocalStorage
0x90000000459b99c {libj9jit29.so}{old_slow_jitReportInstanceFieldWrite} +15              eba30008 ld        r29, 8(r3)
0x90000000459b9a0 {libj9jit29.so}{old_slow_jitReportInstanceFieldWrite} +16              88c50000 lbz       r6, 0(r5)
0x90000000459b9a4 {libj9jit29.so}{old_slow_jitReportInstanceFieldWrite} +17              88650001 lbz       r3, 1(r5)
0x90000000459b9a8 {libj9jit29.so}{old_slow_jitReportInstanceFieldWrite} +18              88050002 lbz       r0, 2(r5)   //r0 = 0x5
0x90000000459b9ac {libj9jit29.so}{old_slow_jitReportInstanceFieldWrite} +19              54c61838 slwi      r6, r6, 3
0x90000000459b9b0 {libj9jit29.so}{old_slow_jitReportInstanceFieldWrite} +20              54651838 slwi      r5, r3, 3
0x90000000459b9b4 {libj9jit29.so}{old_slow_jitReportInstanceFieldWrite} +21              54031838 slwi      r3, r0, 3   //r3 = 0x28
0x90000000459b9b8 {libj9jit29.so}{old_slow_jitReportInstanceFieldWrite} +22              e8840008 ld        r4, 8(r4)      //r4 = 0x100230484a0, jitGlobalStorageBase
0x90000000459b9bc {libj9jit29.so}{old_slow_jitReportInstanceFieldWrite} +23              881d23ce lbz       r0, 0x23ce(r29)
0x90000000459b9c0 {libj9jit29.so}{old_slow_jitReportInstanceFieldWrite} +24              70000001 andi.     r0, r0, 1
0x90000000459b9c4 {libj9jit29.so}{old_slow_jitReportInstanceFieldWrite} +25              7f84302a ldx       r28, r4, r6
0x90000000459b9c8 {libj9jit29.so}{old_slow_jitReportInstanceFieldWrite} +26              7f64282a ldx       r27, r4, r5
0x90000000459b9cc {libj9jit29.so}{old_slow_jitReportInstanceFieldWrite} +27              7f44182a ldx       r26, r4, r3  //r26 = 0x3020c690, this is valuePointer

In the codebase it looks like this: https://github.com/eclipse-openj9/openj9/blob/59bdae92b1d3026620b2601275d0b3da14690334/runtime/codert_vm/cnathelp.cpp#L3832 Keep in mind that r26 (0x3020c690) is still pointing to word size data.

0x90000000459bad8 {libj9jit29.so}{old_slow_jitReportInstanceFieldWrite} +94              e8da0000 ld        r6, 0(r26)
0x90000000459badc {libj9jit29.so}{old_slow_jitReportInstanceFieldWrite} +95              f8c10098 std       r6, 0x98(r1)  //U_64, newValue, 0xe004efc000000000

A bit later on we forget that and load doubleword data from r26. It gets stored to the stack so we can verify the problem.

//r1 is 0x0000010023048290 at this point
(kca) (0x0000010023048290+0x98)/llx
%2 = 0x0000010023048328: e004efc000000000

The original 0xe004efc0 we started with has now become 0xe004efc000000000. This is passed through a bunch of calls until it reaches the code I posted earlier:

0x900000003fafa1c {libj9vm29.so}{getObjectClass} +15              e87d0000 ld        r3, 0(r29)      //----- r3 = 0xe004efc000000000. r29 was 0x3020c4c0
0x900000003fafa20 {libj9vm29.so}{getObjectClass} +16              2c200000 cmpdi     r0, 0
0x900000003fafa24 {libj9vm29.so}{getObjectClass} +17              41820138 beq       0x900000003fafb5c C>> +95
0x900000003fafa28 {libj9vm29.so}{getObjectClass} +18              eba10078 ld        r29, 0x78(r1)
0x900000003fafa2c {libj9vm29.so}{getObjectClass} +19              80030000 lwz       r0, 0(r3)       //<<<--------- CRASH here. r3 = 0xe004efc000000000
(kca) (0xe004efc000000000)/llx
100-Unable to locate memory for the range 0xe004efc000000000...0xe004efc000000007

(kca) (0xe004efc0)/X
%4 = 0x00000000e004efc0: 30043a00

The key problem was the stw back here:

0x10010b94a80 {sun/.../BytecodeDescriptor.parseMethod} +2596                      900e0190 stw       r0, 0x190(r14)

I'm not sure why it does that and everywhere else seems to treat it as doubleword data. Right now I don't actually know what that value is or what it is be used for. So I don't exactly know how that load is being generated yet.

zl-wang commented 1 year ago

I am speculating here, but hopefully can be a tip for understanding: the stw is for store the object reference of the instance field on stack, passing to jitReportInstanceFieldWrite. take a log to see how that object reference is represented in the IL. I suspected it has something to do with compressed/de-compressed store (even if they have the same value ... i.e. shift-0). compressed/de-compressed should be stw/std respectively regardless the real value. PPC64LE can get away with it so far, because it is little endian such that ld just got the correct value (assuming the high-order 4-byte being 0). maybe Z (BE too) has the same issue or they already fixed it.

IBMJimmyk commented 1 year ago

I tried looking at the jitdump file that came with the crash and I think I see the same problem.

The trees look like this:

n5735n    treetop                                                                             [0xa0000004075f050] bci=[0,5,169] rc=0 vc=6633 vn=- li=255 udi=- nc=1
n5736n      ardbar  java/util/ArrayList.DEFAULTCAPACITY_EMPTY_ELEMENTDATA [Ljava/lang/Object;[#882  final Static] [flags 0xa0307 0x0 ]  [0xa0000004075f0a0] bci=[0,5,169] rc=2 vc=6633 vn=- >
n5737n        loadaddr  unknown class object[#883  Static] [flags 0x40018307 0x0 ]            [0xa0000004075f0f0] bci=[0,5,169] rc=1 vc=6633 vn=- li=255 udi=- nc=0
n5738n    treetop                                                                             [0xa0000004075f140] bci=[0,8,169] rc=0 vc=6633 vn=- li=255 udi=- nc=1
n5739n      awrtbari  java/util/ArrayList.elementData [Ljava/lang/Object;[#885  Shadow +16] [flags 0x607 0x0 ] (skipWrtBar )  [0xa0000004075f190] bci=[0,8,169] rc=2 vc=6633 vn=- li=255 udi>
n40n          ==>new
n8008n        l2i                                                                             [0xa00000060e2b6d0] bci=[0,5,169] rc=1 vc=0 vn=- li=- udi=- nc=1
n8007n          a2l                                                                           [0xa00000060e2b680] bci=[0,5,169] rc=1 vc=0 vn=- li=- udi=- nc=1
n5736n            ==>ardbar
n40n          ==>new

The awrtbari is the important node to look at.

The code to generate the call to jitReportInstanceFieldWrite is here. https://github.com/eclipse-openj9/openj9/blob/b9c0de5dc2fc13b07bc46dde42763d59a277dd42/runtime/compiler/p/codegen/J9TreeEvaluator.cpp#L1099-L1105

Compressed refs is enabled so TR::Compiler->om.sizeofReferenceField() returns a value of 4 on line 1099. At line 1100, the element size of 4 is used to get a storeOp of stw At line 1105, that stw is used to create the problematic store.

The load back in the VM eventually occurs here: https://github.com/eclipse-openj9/openj9/blob/b9c0de5dc2fc13b07bc46dde42763d59a277dd42/runtime/vm/jnimisc.cpp#L424

It assumes the jobject obj points to a 64bit value representing an object when it doesn't.

IBMJimmyk commented 1 year ago

So the problem looks to be that JIT'd code puts a compressed address on the stack and then passes it to the VM which thinks it is an uncompressed address. My proposed fix is to detect this case inside generateReportFieldAccessOutlinedInstructions and store a decompressed address to the stack instead.

I've written a first draft of the changes required for this and am currently in the middle of testing it.

0xdaryl commented 1 year ago

@IBMJimmyk : any update on how the testing of your proposed fix went? Thanks.

IBMJimmyk commented 1 year ago

My new fix seems to work and the original failure doesn't seem to happen anymore. But, in my 50 runs grinder I get this new error: https://hyc-runtimes-jenkins.swg-devops.com/job/Grinder/31598/consoleText

[2023-03-14T00:13:36.815Z] Unhandled exception
[2023-03-14T00:13:36.815Z] Type=Segmentation error vmState=0x00040000
[2023-03-14T00:13:36.815Z] J9Generic_Signal_Number=00000018 Signal_Number=0000000b Error_Value=00000000 Signal_Code=00000032
[2023-03-14T00:13:36.815Z] Handler1=09001000A0A28A10 Handler2=09001000A09E1608
[2023-03-14T00:13:36.815Z] R0=0000000000000002 R1=0000010022A6CBE0 R2=09001000A066BB18 R3=0000000030072D00
[2023-03-14T00:13:36.815Z] R4=FFF0AE20E00506E0 R5=0000000000000010 R6=000000000000AE78 R7=0000000000000000
[2023-03-14T00:13:36.815Z] R8=00000000000000C8 R9=0000010022A6D138 R10=0000010022A6D050 R11=0000010022A6CF00
[2023-03-14T00:13:36.815Z] R12=0900000003E84A4C R13=0000010022A79800 R14=00000000301F29A0 R15=0000010022A6E4F0
[2023-03-14T00:13:36.815Z] R16=0000000000000001 R17=09001000A090D948 R18=0000000000000000 R19=00000100227C3F06
[2023-03-14T00:13:36.815Z] R20=0000000000000000 R21=0000000000000000 R22=0000000000000000 R23=0000010022A6D138
[2023-03-14T00:13:36.815Z] R24=0000000030244E00 R25=0000010022A6D050 R26=000001001012B410 R27=0000000000000000
[2023-03-14T00:13:36.815Z] R28=0000010022A6CCF0 R29=0000000030244E00 R30=00000000FFDE5328 R31=00000000301EC800
[2023-03-14T00:13:36.815Z] IAR=0900000008CCDD28 LR=0900000008CCCFFC MSR=A00000000200D032 CTR=090000000059C314
[2023-03-14T00:13:36.815Z] CR=4200022B60000000 FPSCR=8200000000000000 XER=6000000082000000
[2023-03-14T00:13:36.815Z] FPR0 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2023-03-14T00:13:36.815Z] FPR1 c3e0000000000000 (f: 0.000000, d: -9.223372e+18)
[2023-03-14T00:13:36.815Z] FPR2 3ff0000000000000 (f: 0.000000, d: 1.000000e+00)
[2023-03-14T00:13:36.815Z] FPR3 433000000006a518 (f: 435480.000000, d: 4.503600e+15)
[2023-03-14T00:13:36.815Z] FPR4 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2023-03-14T00:13:36.815Z] FPR5 4330000000000000 (f: 0.000000, d: 4.503600e+15)
[2023-03-14T00:13:36.815Z] FPR6 411a946000000000 (f: 0.000000, d: 4.354800e+05)
[2023-03-14T00:13:36.815Z] FPR7 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2023-03-14T00:13:36.815Z] FPR8 3f6069c0ff9e20ca (f: 4288553216.000000, d: 2.003552e-03)
[2023-03-14T00:13:36.815Z] FPR9 402ecaabbae65cc0 (f: 3135659264.000000, d: 1.539584e+01)
[2023-03-14T00:13:36.816Z] FPR10 412e848000000000 (f: 0.000000, d: 1.000000e+06)
[2023-03-14T00:13:36.816Z] FPR11 43300000000f4240 (f: 1000000.000000, d: 4.503600e+15)
[2023-03-14T00:13:36.816Z] FPR12 4530000000000000 (f: 0.000000, d: 1.934281e+25)
[2023-03-14T00:13:36.816Z] FPR13 0000000000000001 (f: 1.000000, d: 4.940656e-324)
[2023-03-14T00:13:36.816Z] FPR14 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2023-03-14T00:13:36.816Z] FPR15 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2023-03-14T00:13:36.816Z] FPR16 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2023-03-14T00:13:36.816Z] FPR17 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2023-03-14T00:13:36.816Z] FPR18 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2023-03-14T00:13:36.816Z] FPR19 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2023-03-14T00:13:36.816Z] FPR20 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2023-03-14T00:13:36.816Z] FPR21 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2023-03-14T00:13:36.816Z] FPR22 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2023-03-14T00:13:36.816Z] FPR23 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2023-03-14T00:13:36.816Z] FPR24 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2023-03-14T00:13:36.816Z] FPR25 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2023-03-14T00:13:36.816Z] FPR26 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2023-03-14T00:13:36.816Z] FPR27 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2023-03-14T00:13:36.816Z] FPR28 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2023-03-14T00:13:36.816Z] FPR29 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2023-03-14T00:13:36.816Z] FPR30 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2023-03-14T00:13:36.816Z] FPR31 0000000000000000 (f: 0.000000, d: 0.000000e+00)
[2023-03-14T00:13:36.816Z] Module=/home/jenkins/workspace/Grinder/openjdkbinary/j2sdk-image/lib/default/libj9jvmti29.so
[2023-03-14T00:13:36.816Z] Module_base_address=0900000008CB9000
[2023-03-14T00:13:36.816Z] Target=2_90_20230313_67 (AIX 7.1)
[2023-03-14T00:13:36.816Z] CPU=ppc64 (8 logical CPUs) (0x200000000 RAM)
[2023-03-14T00:13:36.816Z] ----------- Stack Backtrace -----------
[2023-03-14T00:13:36.816Z] IPRA.$jvmtiInternalGetStackTrace+0x1b0 (0x0900000008CF5694 [libj9jvmti29.so+0x3c694])
[2023-03-14T00:13:36.816Z] jvmtiGetStackTrace+0x2c0 (0x0900000008CF5344 [libj9jvmti29.so+0x3c344])
[2023-03-14T00:13:36.816Z] IPRA.$_ZL17print_stack_traceP9_jvmtiEnvP7JNIEnv_P8_jobject+0x80 (0x090000000D573764 [libSuspendResume1.so+0x6764])
[2023-03-14T00:13:36.816Z] IPRA.$_ZL33test_get_thread_list_stack_tracesP7JNIEnv_biPP8_jobject+0xcc (0x090000000D571DD0 [libSuspendResume1.so+0x4dd0])
[2023-03-14T00:13:36.816Z] IPRA.$_ZL32test_jvmti_functions_for_threadsP7JNIEnv_biPP8_jobject+0xe0 (0x090000000D570084 [libSuspendResume1.so+0x3084])
[2023-03-14T00:13:36.816Z] Java_SuspendResume1_TestSuspendResume+0xd0 (0x090000000D56FA94 [libSuspendResume1.so+0x2a94])
[2023-03-14T00:13:36.816Z] (0x09000000087A253C [libj9vm29.so+0xe053c])
[2023-03-14T00:13:36.816Z] ffi_call+0x110 (0x09000000087A1E74 [libj9vm29.so+0xdfe74])
[2023-03-14T00:13:36.816Z] (0x09000000089105F0 [libj9vm29.so+0x24e5f0])
[2023-03-14T00:13:36.816Z] (0x090000000874EFE8 [libj9vm29.so+0x8cfe8])
[2023-03-14T00:13:36.816Z] runJavaThread+0x28c (0x0900000008736C50 [libj9vm29.so+0x74c50])
[2023-03-14T00:13:36.816Z] _ZL23javaProtectedThreadProcP13J9PortLibraryPv+0x11c (0x09000000086C5000 [libj9vm29.so+0x3000])
[2023-03-14T00:13:36.816Z] omrsig_protect+0x4fc (0x0900000003E1C1E0 [libj9prt29.so+0x5f1e0])
[2023-03-14T00:13:36.816Z] javaThreadProc+0x70 (0x09000000086C4E74 [libj9vm29.so+0x2e74])
[2023-03-14T00:13:36.816Z] thread_wrapper+0x14c (0x0900000003E85590 [libj9thr29.so+0x5590])
[2023-03-14T00:13:36.816Z] _pthread_body+0xf0 (0x0900000000571E14 [libpthreads.a+0x3e14])

[2023-03-14T00:14:46.795Z] serviceability_jvmti_j9_0_FAILED

I just got this so I haven't had a chance to look into it yet.

zl-wang commented 1 year ago

i suspected libSuspendResume1.so is also a shared library built on the fly for this specific test case. IPRA (Inter-procedural Register Assigner) actively kicked in the build. It crashed during unwinding the stack. I can smell some dangers here. maybe try disabling IPRA for this .so build.

IBMJimmyk commented 1 year ago

I opened a PR for the fix for the original segfault issue: https://github.com/eclipse-openj9/openj9/pull/16925

As for the issue with libSuspendResume1.so, I am currently looking at figuring out how to disable IPRA and seeing what happens.

IBMJimmyk commented 1 year ago

The problem with passing the wrong value to the VM was slightly different than I initially thought it would be. The JIT was correctly trying to pass an uncompressed address to the VM but just happened to use the wrong store instruction. It used stw instead of std. A fix has been checked in to handle this case and the JIT will correctly use a std instruction now. So the original segfault should be gone.

As for the issue with libSuspendResume1.so. I have a build with IPRA disabled. But, I haven't been able to repoduce the problem either with a unmodified build (that includes my store fix) or with the build with IPRA disabled.

IBMJimmyk commented 1 year ago

I tried running grinders to reproduce the issue in libSuspendResume1.so but it hasn't shown up. I did 150 runs of the build it initially appeared in and 150 runs with the latest code. As such, I don't think there's anything I can do about it. Since the fix for the original problem has been merged in, I think this issue can be closed now.

vij-singh commented 1 year ago

@JasonFengJ9 @zl-wang Are you guys ok good with closing this issue now?

zl-wang commented 1 year ago

@vij-singh yes, i suggested closing it.

JasonFengJ9 commented 1 year ago

Closing as per https://github.com/eclipse-openj9/openj9/issues/16750#issuecomment-1476228934 https://github.com/eclipse-openj9/openj9/issues/16750#issuecomment-1477823490