eclipse-openj9 / openj9

Eclipse OpenJ9: A Java Virtual Machine for OpenJDK that's optimized for small footprint, fast start-up, and high throughput. Builds on Eclipse OMR (https://github.com/eclipse/omr) and combines with the Extensions for OpenJDK for OpenJ9 repo.
Other
3.28k stars 721 forks source link

TestJlmRemoteThreadAuth crash LT1 vmState=0x00000000 #18149

Closed pshipton closed 1 year ago

pshipton commented 1 year ago

https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_sanity.system_aarch64_linux_Nightly_testList_1/531 TestJlmRemoteThreadAuth_0

STF 20:26:31.687 -   o Process LT1 has crashed unexpectedly

https://openj9-artifactory.osuosl.org/artifactory/ci-openj9/Test/Test_openjdk17_j9_sanity.system_aarch64_linux_Nightly_testList_1/531/system_test_output.tar.gz

0SECTION       TITLE subcomponent dump routine
NULL           ===============================
1TICHARSET     UTF-8
1TISIGINFO     Dump Event "gpf" (00002000) received
1TIDATETIMEUTC Date: 2023/09/15 at 00:26:29:384 (UTC)
1TIDATETIME    Date: 2023/09/14 at 20:26:29:384
1TITIMEZONE    Timezone: UTC-4 (EDT)
1TINANOTIME    System nanotime: 1703831458358335
1TIFILENAME    Javacore filename:    /home/jenkins/workspace/Test_openjdk17_j9_sanity.system_aarch64_linux_Nightly_testList_1/aqa-tests/TKG/output_16947355463817/TestJlmRemoteThreadAuth_0/20230914-202346-TestJlmRemoteThreadAuth/results/javacore.20230914.202628.2165240.0002.txt
1TIREQFLAGS    Request Flags: 0x41 (exclusive+preempt)
1TIPREPSTATE   Prep State: 0x80 (trace_disabled)
1TIPREPINFO    Exclusive VM access not taken: data may not be consistent across javacore sections
NULL           ------------------------------------------------------------------------
0SECTION       GPINFO subcomponent dump routine
NULL           ================================
2XHOSLEVEL     OS Level         : Linux 4.18.0-348.7.1.el8_5.aarch64
2XHCPUS        Processors -
3XHCPUARCH       Architecture   : aarch64
3XHNUMCPUS       How Many       : 8
3XHNUMASUP       NUMA is either not supported or has been disabled by user
NULL
1XHEXCPCODE    J9Generic_Signal_Number: 00000018
1XHEXCPCODE    Signal_Number: 0000000B
1XHEXCPCODE    Error_Value: 00000000
1XHEXCPCODE    Signal_Code: 00000002
1XHEXCPCODE    Handler1: 0000FFFFAD5B2DA4
1XHEXCPCODE    Handler2: 0000FFFFAD517AF0
1XHEXCPCODE    InaccessibleAddress: 00000000009772B0
NULL
NULL
1XHREGISTERS   Registers:
2XHREGISTER      R0: 0000000000000016
2XHREGISTER      R1: 0000FFFFAD608DFC
2XHREGISTER      R2: 0000000000000000
2XHREGISTER      R3: 0000000000977D00
2XHREGISTER      R4: 0000000000971A00
2XHREGISTER      R5: 0000000000977CA0
2XHREGISTER      R6: 0000000000000000
2XHREGISTER      R7: 0000000000000001
2XHREGISTER      R8: 0000000000000008
2XHREGISTER      R9: 0000FFFF8E79311C
2XHREGISTER      R10: 0000FFFF8D0DE100
2XHREGISTER      R11: 00000000F028CDB0
2XHREGISTER      R12: 00000000F028CDB0
2XHREGISTER      R13: 00000000F028CD90
2XHREGISTER      R14: 00000000000062A0
2XHREGISTER      R15: 0000000000000000
2XHREGISTER      R16: 0000FFFFAD7A02C8
2XHREGISTER      R17: 0000FFFFADC04940
2XHREGISTER      R18: 00000000F0106D50
2XHREGISTER      R19: 0000000000971900
2XHREGISTER      R20: 0000FFFFAD79F000
2XHREGISTER      R21: 00000000FF0FC1C8
2XHREGISTER      R22: 0000000000443750
2XHREGISTER      R23: 0000000000971900
2XHREGISTER      R24: 0000000000000000
2XHREGISTER      R25: 0000FFFF8D0DE590
2XHREGISTER      R26: 0000FFFFADD60000
2XHREGISTER      R27: 0000FFFF8D0DF1E0
2XHREGISTER      R28: 0000FFFF8D0DF1E0
2XHREGISTER      R29: 0000000000971900
2XHREGISTER      R30: 00000000009772B0
2XHREGISTER      R31: 0000FFFF8D0DE540
2XHREGISTER      V0: 0000000000977CB8
2XHREGISTER      V1: FFFFFFFFFFFFFFFF
2XHREGISTER      V2: E3BEA6451A118789
2XHREGISTER      V3: 0000000000000098
2XHREGISTER      V4: 8A77D375292B0A98
2XHREGISTER      V5: 068960FE10A91930
2XHREGISTER      V6: 852471A5F420DD40
2XHREGISTER      V7: 0000000000000098
2XHREGISTER      V8: 0000FFFE54048D70
2XHREGISTER      V9: 0000000000977298
2XHREGISTER      V10: 0000000000000020
2XHREGISTER      V11: 0000000000977288
2XHREGISTER      V12: 000000000000000C
2XHREGISTER      V13: 0000000000977278
2XHREGISTER      V14: 0000FFFF8D0DE150
2XHREGISTER      V15: 0000000000000001
2XHREGISTER      V16: 0000000000000028
2XHREGISTER      V17: 00000000000000B0
2XHREGISTER      V18: 00BFD04B41499B65
2XHREGISTER      V19: C200000000000000
2XHREGISTER      V20: C79ED91518C6E4C1
2XHREGISTER      V21: C2EAF0E46BB84949
2XHREGISTER      V22: F6B7F9EF07E50CE1
2XHREGISTER      V23: 172CDE009048FBFF
2XHREGISTER      V24: F0B59F6E338E1372
2XHREGISTER      V25: A8E8601E3B7F6C62
2XHREGISTER      V26: C7A3A2C877735869
2XHREGISTER      V27: 138D24252925B968
2XHREGISTER      V28: 3A1B90B1E3789AFD
2XHREGISTER      V29: B6F206C0B3BC0443
2XHREGISTER      V30: 4C32ECDF3D33029A
2XHREGISTER      V31: 8B6966E65C612DE2
2XHREGISTER      PC: 00000000009772B0
2XHREGISTER      SP: 0000FFFF8D0DE540
2XHREGISTER      PSTATE: 0000000060001000
NULL
1XHFLAGS       VM flags:0000000000000000
pshipton commented 1 year ago

@knn-k fyi

knn-k commented 1 year ago

PC = R30 (LR) = InaccessibleAddress = 0x009772B0 It seems it loaded a wrong address into LR, and returned to that address.

knn-k commented 1 year ago

I ran the following Grinder jobs (30x2). https://openj9-jenkins.osuosl.org/job/Grinder/2902/ https://openj9-jenkins.osuosl.org/job/Grinder/2904/

I don't see the SEGV above, and there are different exceptions instead.

CL2 stderr java.rmi.UnmarshalException: Error unmarshaling return header; nested exception is: 
CL2 stderr  java.io.EOFException
CL2 stderr  at java.rmi/sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:255)
CL2 stderr  at java.rmi/sun.rmi.server.UnicastRef.invoke(UnicastRef.java:165)
CL2 stderr java.rmi.ConnectException: Connection refused to host: 147.28.142.201; nested exception is: 
CL2 stderr  java.net.ConnectException: Connection refused
CL2 stderr  at java.rmi/sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:626)
CL2 stderr  at java.rmi/sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:217)
Akira1Saitoh commented 1 year ago

Looks like SEGV was actually happening in the Grinder jobs.

[2023-09-16T10:24:24.166Z] STF 06:24:23.260 - Found dump at: /home/jenkins/workspace/Grinder/aqa-tests/TKG/output_16948592869997/TestJlmRemoteThreadAuth_0/20230916-061447-TestJlmRemoteThreadAuth/results/javacore.20230916.062422.2223525.0002.txt [2023-09-16T10:24:24.166Z] STF 06:24:23.261 - Found dump at: /home/jenkins/workspace/Grinder/aqa-tests/TKG/output_16948592869997/TestJlmRemoteThreadAuth_0/20230916-061447-TestJlmRemoteThreadAuth/results/core.20230916.062422.2223525.0001.dmp [2023-09-16T10:24:24.166Z] CL2 j> 2023/09/16 06:24:23.487 Error unmarshaling return header; nested exception is: [2023-09-16T10:24:24.166Z] CL2 java.io.EOFException [2023-09-16T10:24:24.166Z] CL2 stderr javacore file generated - /home/jenkins/workspace/Grinder/aqa-tests/TKG/output_16948592869997/TestJlmRemoteThreadAuth_0/20230916-061447-TestJlmRemoteThreadAuth/results/javacore.20230916.062422.2223525.0002.txt [2023-09-16T10:24:24.166Z] CL2 stderr core file generated - /home/jenkins/workspace/Grinder/aqa-tests/TKG/output_16948592869997/TestJlmRemoteThreadAuth_0/20230916-061447-TestJlmRemoteThreadAuth/results/core.20230916.062422.2223525.0001.dmp

Akira1Saitoh commented 1 year ago

It seems that the link register value saved into the native stack in c_cInterpreter gets overwritten by jitReleaseVMAccess helper called from JNI invocation sequence. When returning from c_cInterpreter, an incorrect value is loaded into the link register. Thus, it crashes on ret instruction.

(gdb) disassemble c_cInterpreter 
Dump of assembler code for function c_cInterpreter:
   0x0000fffff793144c <+0>:     sub     sp, sp, #0x3a0
   0x0000fffff7931450 <+4>:     stp     x19, x20, [sp]
   0x0000fffff7931454 <+8>:     stp     x21, x22, [sp, #16]
   0x0000fffff7931458 <+12>:    stp     x23, x24, [sp, #32]
   0x0000fffff793145c <+16>:    stp     x25, x26, [sp, #48]
   0x0000fffff7931460 <+20>:    stp     x27, x28, [sp, #64]
   0x0000fffff7931464 <+24>:    stp     x29, x30, [sp, #80]    <--- LR (x30) saved into 0xffffd1dbf208
   0x0000fffff7931468 <+28>:    stp     d8, d9, [sp, #96]
   0x0000fffff793146c <+32>:    stp     d10, d11, [sp, #112]
   0x0000fffff7931470 <+36>:    stp     d12, d13, [sp, #128]
   0x0000fffff7931474 <+40>:    stp     d14, d15, [sp, #144]
   0x0000fffff7931478 <+44>:    mov     x19, x0
   0x0000fffff793147c <+48>:    ldr     x27, [x19, #600]
   0x0000fffff7931480 <+52>:    add     x28, sp, #0xa0
   0x0000fffff7931484 <+56>:    str     x28, [x27, #8]
   0x0000fffff7931488 <+60>:    add     x28, sp, #0x1a0
   0x0000fffff793148c <+64>:    str     x28, [x27, #48]
   0x0000fffff7931490 <+68>:    mov     x0, x19
   0x0000fffff7931494 <+72>:    ldr     x27, [x19, #8]
   0x0000fffff7931498 <+76>:    ldr     x28, [x27, #168]
   0x0000fffff793149c <+80>:    blr     x28
(gdb) disassemble jitReleaseVMAccess
Dump of assembler code for function jitReleaseVMAccess:
   0x0000fffff66fba3c <+0>:     stp     x29, x30, [sp, #392]
   0x0000fffff66fba40 <+4>:     stp     x0, x1, [sp, #160]          <--- this instruction overwrites 0xffffd1dbf208
   0x0000fffff66fba44 <+8>:     stp     x2, x3, [sp, #176]

The location where the link register is saved in c_cInterpreter is J9CInterpreterStackFrame.preservedGPRs. The location where the GPR is saved in jitReleaseVMAccess is J9CInterpreterStackFrame.jitGPRs. https://github.com/eclipse-openj9/openj9/blob/20fb92b38cc4756d487fe8e2d70c32cd10ffb0fe/runtime/oti/j9nonbuilder.h#L6294-L6297

The reason why jitReleaseVMAccess overwrites preservedGPRs is that the native stack pointer is modified in JNI invocation sequence in order to pass the JNI arguments by the stack. The native stack pointer is modified before calling out to jitReleaseVMAccess. In this case, sp is subtracted by 80 and [sp, #160] points the location where c_cInterpreter saved the link register.

https://github.com/eclipse-openj9/openj9/blob/20fb92b38cc4756d487fe8e2d70c32cd10ffb0fe/runtime/compiler/aarch64/codegen/ARM64JNILinkage.cpp#L1061-L1111

Akira1Saitoh commented 1 year ago

On power, J9CInterpreterStackFrame has a space for JNI outgoing arguments, thus I think power does not have this problem. https://github.com/eclipse-openj9/openj9/blob/20fb92b38cc4756d487fe8e2d70c32cd10ffb0fe/runtime/oti/j9nonbuilder.h#L6226-L6241

On AArch64, J9CInterpreterStackFrame does not have such space. https://github.com/eclipse-openj9/openj9/blob/20fb92b38cc4756d487fe8e2d70c32cd10ffb0fe/runtime/oti/j9nonbuilder.h#L6294-L6297

Akira1Saitoh commented 1 year ago

x avoids this problem by implementing jitReleaseVMAccess helper differently than other helpers.

https://github.com/eclipse-openj9/openj9/blob/32a61988c5d34c32992429391be95b50256426a8/runtime/compiler/x/runtime/X86Codert.nasm#L1006C1-L1082

Akira1Saitoh commented 1 year ago

Opened #18227.

knn-k commented 1 year ago

I merged #18227 yesterday.

knn-k commented 1 year ago

Nightly sanity.system failed earlier this week in TestJlmRemoteThreadAuth_1. https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_sanity.system_aarch64_linux_Nightly/546/ I think it should be handled separately.

CL2 stderr java.rmi.UnmarshalException: Error unmarshaling return header; nested exception is: 
CL2 stderr  java.io.EOFException
CL2 stderr  at java.rmi/sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:255)
CL2 stderr  at java.rmi/sun.rmi.server.UnicastRef.invoke(UnicastRef.java:165)
CL2 stderr  at jdk.remoteref/jdk.jmx.remote.internal.rmi.PRef.invoke(Unknown Source)
CL2 stderr  at java.management.rmi/javax.management.remote.rmi.RMIConnectionImpl_Stub.invoke(RMIConnectionImpl_Stub.java:419)
CL2 stderr  at java.management.rmi/javax.management.remote.rmi.RMIConnector$RemoteMBeanServerConnection.invoke(RMIConnector.java:1021)
CL2 stderr  at net.adoptopenjdk.test.jlm.resources.ThreadData.writeData(ThreadData.java:549)
CL2 stderr  at net.adoptopenjdk.test.jlm.remote.ThreadProfiler.getStatsViaServer(ThreadProfiler.java:199)
CL2 stderr  at net.adoptopenjdk.test.jlm.remote.ThreadProfiler.main(ThreadProfiler.java:99)
CL2 stderr Caused by: java.io.EOFException
CL2 stderr  at java.base/java.io.DataInputStream.readUnsignedByte(DataInputStream.java:290)
CL2 stderr  at java.base/java.io.DataInputStream.readByte(DataInputStream.java:268)
CL2 stderr  at java.rmi/sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:241)
CL2 stderr  ... 7 more
knn-k commented 1 year ago

I opened Issue #18280 for the UnmarshalException above.

knn-k commented 1 year ago

I think this issue can be closed now. Recent sanity.system jobs are not failing with TestJlmRemoteThreadAuth since #18227 was merged, except for the #18280 case above.