Closed pshipton closed 1 year ago
@knn-k fyi
PC = R30 (LR) = InaccessibleAddress = 0x009772B0 It seems it loaded a wrong address into LR, and returned to that address.
I ran the following Grinder jobs (30x2). https://openj9-jenkins.osuosl.org/job/Grinder/2902/ https://openj9-jenkins.osuosl.org/job/Grinder/2904/
I don't see the SEGV above, and there are different exceptions instead.
CL2 stderr java.rmi.UnmarshalException: Error unmarshaling return header; nested exception is:
CL2 stderr java.io.EOFException
CL2 stderr at java.rmi/sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:255)
CL2 stderr at java.rmi/sun.rmi.server.UnicastRef.invoke(UnicastRef.java:165)
CL2 stderr java.rmi.ConnectException: Connection refused to host: 147.28.142.201; nested exception is:
CL2 stderr java.net.ConnectException: Connection refused
CL2 stderr at java.rmi/sun.rmi.transport.tcp.TCPEndpoint.newSocket(TCPEndpoint.java:626)
CL2 stderr at java.rmi/sun.rmi.transport.tcp.TCPChannel.createConnection(TCPChannel.java:217)
Looks like SEGV was actually happening in the Grinder jobs.
[2023-09-16T10:24:24.166Z] STF 06:24:23.260 - Found dump at: /home/jenkins/workspace/Grinder/aqa-tests/TKG/output_16948592869997/TestJlmRemoteThreadAuth_0/20230916-061447-TestJlmRemoteThreadAuth/results/javacore.20230916.062422.2223525.0002.txt [2023-09-16T10:24:24.166Z] STF 06:24:23.261 - Found dump at: /home/jenkins/workspace/Grinder/aqa-tests/TKG/output_16948592869997/TestJlmRemoteThreadAuth_0/20230916-061447-TestJlmRemoteThreadAuth/results/core.20230916.062422.2223525.0001.dmp [2023-09-16T10:24:24.166Z] CL2 j> 2023/09/16 06:24:23.487 Error unmarshaling return header; nested exception is: [2023-09-16T10:24:24.166Z] CL2 java.io.EOFException [2023-09-16T10:24:24.166Z] CL2 stderr javacore file generated - /home/jenkins/workspace/Grinder/aqa-tests/TKG/output_16948592869997/TestJlmRemoteThreadAuth_0/20230916-061447-TestJlmRemoteThreadAuth/results/javacore.20230916.062422.2223525.0002.txt [2023-09-16T10:24:24.166Z] CL2 stderr core file generated - /home/jenkins/workspace/Grinder/aqa-tests/TKG/output_16948592869997/TestJlmRemoteThreadAuth_0/20230916-061447-TestJlmRemoteThreadAuth/results/core.20230916.062422.2223525.0001.dmp
It seems that the link register value saved into the native stack in c_cInterpreter
gets overwritten by jitReleaseVMAccess
helper called from JNI invocation sequence. When returning from c_cInterpreter
, an incorrect value is loaded into the link register. Thus, it crashes on ret
instruction.
(gdb) disassemble c_cInterpreter
Dump of assembler code for function c_cInterpreter:
0x0000fffff793144c <+0>: sub sp, sp, #0x3a0
0x0000fffff7931450 <+4>: stp x19, x20, [sp]
0x0000fffff7931454 <+8>: stp x21, x22, [sp, #16]
0x0000fffff7931458 <+12>: stp x23, x24, [sp, #32]
0x0000fffff793145c <+16>: stp x25, x26, [sp, #48]
0x0000fffff7931460 <+20>: stp x27, x28, [sp, #64]
0x0000fffff7931464 <+24>: stp x29, x30, [sp, #80] <--- LR (x30) saved into 0xffffd1dbf208
0x0000fffff7931468 <+28>: stp d8, d9, [sp, #96]
0x0000fffff793146c <+32>: stp d10, d11, [sp, #112]
0x0000fffff7931470 <+36>: stp d12, d13, [sp, #128]
0x0000fffff7931474 <+40>: stp d14, d15, [sp, #144]
0x0000fffff7931478 <+44>: mov x19, x0
0x0000fffff793147c <+48>: ldr x27, [x19, #600]
0x0000fffff7931480 <+52>: add x28, sp, #0xa0
0x0000fffff7931484 <+56>: str x28, [x27, #8]
0x0000fffff7931488 <+60>: add x28, sp, #0x1a0
0x0000fffff793148c <+64>: str x28, [x27, #48]
0x0000fffff7931490 <+68>: mov x0, x19
0x0000fffff7931494 <+72>: ldr x27, [x19, #8]
0x0000fffff7931498 <+76>: ldr x28, [x27, #168]
0x0000fffff793149c <+80>: blr x28
(gdb) disassemble jitReleaseVMAccess
Dump of assembler code for function jitReleaseVMAccess:
0x0000fffff66fba3c <+0>: stp x29, x30, [sp, #392]
0x0000fffff66fba40 <+4>: stp x0, x1, [sp, #160] <--- this instruction overwrites 0xffffd1dbf208
0x0000fffff66fba44 <+8>: stp x2, x3, [sp, #176]
The location where the link register is saved in c_cInterpreter
is J9CInterpreterStackFrame.preservedGPRs
. The location where the GPR is saved in jitReleaseVMAccess
is J9CInterpreterStackFrame.jitGPRs
.
https://github.com/eclipse-openj9/openj9/blob/20fb92b38cc4756d487fe8e2d70c32cd10ffb0fe/runtime/oti/j9nonbuilder.h#L6294-L6297
The reason why jitReleaseVMAccess
overwrites preservedGPRs
is that the native stack pointer is modified in JNI invocation sequence in order to pass the JNI arguments by the stack. The native stack pointer is modified before calling out to jitReleaseVMAccess
. In this case, sp
is subtracted by 80 and [sp, #160]
points the location where c_cInterpreter
saved the link register.
On power, J9CInterpreterStackFrame
has a space for JNI outgoing arguments, thus I think power does not have this problem.
https://github.com/eclipse-openj9/openj9/blob/20fb92b38cc4756d487fe8e2d70c32cd10ffb0fe/runtime/oti/j9nonbuilder.h#L6226-L6241
On AArch64, J9CInterpreterStackFrame
does not have such space.
https://github.com/eclipse-openj9/openj9/blob/20fb92b38cc4756d487fe8e2d70c32cd10ffb0fe/runtime/oti/j9nonbuilder.h#L6294-L6297
x avoids this problem by implementing jitReleaseVMAccess
helper differently than other helpers.
Opened #18227.
I merged #18227 yesterday.
Nightly sanity.system failed earlier this week in TestJlmRemoteThreadAuth_1. https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_sanity.system_aarch64_linux_Nightly/546/ I think it should be handled separately.
CL2 stderr java.rmi.UnmarshalException: Error unmarshaling return header; nested exception is:
CL2 stderr java.io.EOFException
CL2 stderr at java.rmi/sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:255)
CL2 stderr at java.rmi/sun.rmi.server.UnicastRef.invoke(UnicastRef.java:165)
CL2 stderr at jdk.remoteref/jdk.jmx.remote.internal.rmi.PRef.invoke(Unknown Source)
CL2 stderr at java.management.rmi/javax.management.remote.rmi.RMIConnectionImpl_Stub.invoke(RMIConnectionImpl_Stub.java:419)
CL2 stderr at java.management.rmi/javax.management.remote.rmi.RMIConnector$RemoteMBeanServerConnection.invoke(RMIConnector.java:1021)
CL2 stderr at net.adoptopenjdk.test.jlm.resources.ThreadData.writeData(ThreadData.java:549)
CL2 stderr at net.adoptopenjdk.test.jlm.remote.ThreadProfiler.getStatsViaServer(ThreadProfiler.java:199)
CL2 stderr at net.adoptopenjdk.test.jlm.remote.ThreadProfiler.main(ThreadProfiler.java:99)
CL2 stderr Caused by: java.io.EOFException
CL2 stderr at java.base/java.io.DataInputStream.readUnsignedByte(DataInputStream.java:290)
CL2 stderr at java.base/java.io.DataInputStream.readByte(DataInputStream.java:268)
CL2 stderr at java.rmi/sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:241)
CL2 stderr ... 7 more
I opened Issue #18280 for the UnmarshalException above.
I think this issue can be closed now. Recent sanity.system jobs are not failing with TestJlmRemoteThreadAuth since #18227 was merged, except for the #18280 case above.
https://openj9-jenkins.osuosl.org/job/Test_openjdk17_j9_sanity.system_aarch64_linux_Nightly_testList_1/531 TestJlmRemoteThreadAuth_0
https://openj9-artifactory.osuosl.org/artifactory/ci-openj9/Test/Test_openjdk17_j9_sanity.system_aarch64_linux_Nightly_testList_1/531/system_test_output.tar.gz