Closed ghost closed 2 years ago
@0xdaryl fyi
@dsouzai : would you mind triaging this crash please?
@siyy123 could you upload the coredump somewhere? Also could you add -XcompilationThreads1 -Xjit:verbose={compilePerformance},vlog=vlog -Xaot:aotrtDebugLevel=30,rtLog=rtLog
(in the run where you have -Xaot:count=0
, do -Xaot:count=0,aotrtDebugLevel=30,rtLog=rtLog
because if you add multiple -Xaot
/-Xjit
only the last of each is considered). If you can upload the coredump, javacore, vlog, and rtLog from the run that generated the coredump then I can try and use windbg to take a look at the problem.
for -Xaot:count=0: https://1drv.ms/u/s!AotJ8siZ9lZhij2YHSfNJWqUQxZG?e=pfA7si
and for -Xquickstart: https://1drv.ms/u/s!AotJ8siZ9lZhij7YTYQ8qgnmCie1?e=TOpNG2
Sorry, wasn't able to get to this this week; will try to take a look on Monday.
Not really sure what's going on as this is quite deep in the codegen. There does seem to be something off though. I've been looking at the -Xquickstart
run.
The java stack at the time of the crash looks like this:
> !stack 0x010a8100
<10a8100> !j9method 0x000000000310C630 org/lwjgl/openal/AL10.► ♠♂?♀*
<10a8100> !j9method 0x000000000310C610 org/lwjgl/openal/AL10.alGetError()I
<10a8100> !j9method 0x0000000001883848 com/tmtravlr/soundfilters/ModifiedLWJGLOpenALLibrary.checkALError()Z
<10a8100> !j9method 0x0000000001883608 com/tmtravlr/soundfilters/ModifiedLWJGLOpenALLibrary.init()V
<10a8100> !j9method 0x00000000030FF7B0 paulscode/sound/SoundSystem.CommandNewLibrary(Ljava/lang/Class;)V
<10a8100> !j9method 0x00000000030FFCD0 paulscode/sound/SoundSystem.CommandQueue(Lpaulscode/sound/CommandObject;)Z
<10a8100> !j9method 0x0000000003108018 paulscode/sound/CommandThread.run()V
<10a8100> JNI call-in frame
<10a8100> Native method frame
From the stackslots:
Signature: org/lwjgl/openal/AL10.alGetError()I !bytecodes 0x000000000310C610
ROM Method: !j9rommethod 0x00000217C1554278
Next Method: !j9method 0x000000000310C630
> !stackslots 0x010a8100
<10a8100> *** BEGIN STACK WALK, flags = 00400001 walkThread = 0x00000000010A8100 ***
<10a8100> ITERATE_O_SLOTS
<10a8100> RECORD_BYTECODE_PC_OFFSET
<10a8100> Initial values: walkSP = 0x000000000314A8F8, PC = 0x0000000000000006, literals = 0x0000000000000000, A0 = 0x000000000314A970, j2iFrame = 0x0000000000000000, ELS = 0x000000EAF96FF500, decomp = 0x0000000
000000000
<10a8100> JIT JNI call-out frame: bp = 0x000000000314A918, sp = 0x000000000314A8F8, pc = 0x0000000000000006, cp = 0x000000000310D190, arg0EA = 0x000000000314A970, flags = 0x0000000020000000
<10a8100> Method: org/lwjgl/openal/AL10.► ♠♂?♀* !j9method 0x000000000310C630
<10a8100> JIT frame: bp = 0x000000000314A948, pc = 0x00007FF988A9DA00, unwindSP = 0x000000000314A920, cp = 0x000000000310D190, arg0EA = 0x000000000314A948, jitInfo = 0x000002184C0AE068
<10a8100> Method: org/lwjgl/openal/AL10.alGetError()I !j9method 0x000000000310C610
<10a8100> Bytecode index = 0, inlineDepth = 0, PC offset = 0x00000000000000A8
<10a8100> stackMap=0x000002184C0AE161, slots=I16(0x0000) parmBaseOffset=I16(0x0008), parmSlots=U16(0x0000), localBaseOffset=I16(0x0000)
<10a8100> Described JIT temps starting at 0x000000000314A948 for IDATA(0x0000000000000000) slots
<10a8100> JIT-RegisterMap = UDATA(0x0000000000000000)
<10a8100> JIT-Frame-RegisterMap[0x000000000314A930] = UDATA(0x0000000000000000) (jit_rbx)
<10a8100> JIT-Frame-RegisterMap[0x000000000314A938] = UDATA(0x0000000000000000) (jit_r9)
<10a8100> I2J values: PC = 0x00000217BFE707F0, A0 = 0x000000000314A970, walkSP = 0x000000000314A958, literals = 0x0000000001883848, JIT PC = 0x00007FF99C46AC10, pcAddress = 0x000000EAF96FF528, decomp = 0x0000000
000000000
<10a8100> Bytecode frame: bp = 0x000000000314A968, sp = 0x000000000314A958, pc = 0x00000217BFE707F0, cp = 0x00000000018824D0, arg0EA = 0x000000000314A970, flags = 0x0000000000000000
<10a8100> Method: com/tmtravlr/soundfilters/ModifiedLWJGLOpenALLibrary.checkALError()Z !j9method 0x0000000001883848
<10a8100> Bytecode index = 0
<10a8100> Using local mapper
<10a8100> Locals starting at 0x000000000314A970 for 0x0000000000000001 slots
<10a8100> O-Slot: a0[0x000000000314A970] = 0x00000007FE8ECD90
...
The top frame is a JIT JNI call-out frame. The native stack looks like this:
...
[0x11] ntdll!KiUserExceptionDispatcher + 0x2e
[0x12] j9jit29!fast_jitAcquireVMAccess + 0x1c
[0x13] j9jit29!jitAcquireVMAccess + 0xaf
There is a helper call to acquire vmaccess in org/lwjgl/openal/AL10.alGetError()I
so looking at that disassembly:
00007ff9`88a9d958 483b6550 cmp rsp, qword ptr [rbp+50h]
00007ff9`88a9d95c 0f8607010000 jbe 00007ff9`88a9da69
00007ff9`88a9d962 4883ec28 sub rsp, 28h
00007ff9`88a9d966 4c894c2418 mov qword ptr [rsp+18h], r9
00007ff9`88a9d96b 48895c2410 mov qword ptr [rsp+10h], rbx
00007ff9`88a9d970 4c8d542428 lea r10, [rsp+28h]
00007ff9`88a9d975 48c7453800000000 mov qword ptr [rbp+38h], 0
00007ff9`88a9d97d 6800000000 push 0
00007ff9`88a9d982 6800000000 push 0
00007ff9`88a9d987 48b800daa988f97f0000 mov rax, 7FF988A9DA00h
00007ff9`88a9d991 50 push rax
00007ff9`88a9d992 6800000020 push 20000000h
00007ff9`88a9d997 48b830c6100300000000 mov rax, 310C630h
00007ff9`88a9d9a1 50 push rax
00007ff9`88a9d9a2 48c7452806000000 mov qword ptr [rbp+28h], 6
00007ff9`88a9d9aa 48c7453000000000 mov qword ptr [rbp+30h], 0
00007ff9`88a9d9b2 48896520 mov qword ptr [rbp+20h], rsp
00007ff9`88a9d9b6 488ba5000a0000 mov rsp, qword ptr [rbp+0A00h]
00007ff9`88a9d9bd 55 push rbp
00007ff9`88a9d9be 4883ec28 sub rsp, 28h
00007ff9`88a9d9c2 488bcd mov rcx, rbp
00007ff9`88a9d9c5 48ba00208e0200000000 mov rdx, 28E2000h
00007ff9`88a9d9cf 4883c230 add rdx, 30h
00007ff9`88a9d9d3 48c785b809000001000000 mov qword ptr [rbp+9B8h], 1
00007ff9`88a9d9de 83bd9800000020 cmp dword ptr [rbp+98h], 20h
00007ff9`88a9d9e5 0f8577000000 jne 00007ff9`88a9da62
00007ff9`88a9d9eb 488b6c2458 mov rbp, qword ptr [rsp+58h]
00007ff9`88a9d9f0 49bb904c904c00800100 mov r11, 180004C904C90h
00007ff9`88a9d9fa 0000 add byte ptr [rax], al
00007ff9`88a9d9fc d3488b ror dword ptr [rax-75h], cl
00007ff9`88a9d9ff c84883c4 enter 8348h, 0C4h
00007ff9`88a9da03 285d48 sub byte ptr [rbp+48h], bl
00007ff9`88a9da06 c785b809000000000000 mov dword ptr [rbp+9B8h], 0
00007ff9`88a9da10 83bd9800000020 cmp dword ptr [rbp+98h], 20h
00007ff9`88a9da17 7542 jne 00007ff9`88a9da5b
00007ff9`88a9da19 4889a5000a0000 mov qword ptr [rbp+0A00h], rsp
00007ff9`88a9da20 488b6520 mov rsp, qword ptr [rbp+20h]
00007ff9`88a9da24 48036530 add rsp, qword ptr [rbp+30h]
00007ff9`88a9da28 48f744240800000300 test qword ptr [rsp+8], 30000h
00007ff9`88a9da31 7521 jne 00007ff9`88a9da54
00007ff9`88a9da33 4883c428 add rsp, 28h
00007ff9`88a9da37 48837d4800 cmp qword ptr [rbp+48h], 0
00007ff9`88a9da3c 753a jne 00007ff9`88a9da78
00007ff9`88a9da3e 488bc1 mov rax, rcx
00007ff9`88a9da41 89442424 mov dword ptr [rsp+24h], eax
00007ff9`88a9da45 4c8b4c2418 mov r9, qword ptr [rsp+18h]
00007ff9`88a9da4a 488b5c2410 mov rbx, qword ptr [rsp+10h]
00007ff9`88a9da4f 4883c428 add rsp, 28h
00007ff9`88a9da53 c3 ret
00007ff9`88a9da54 e897629c13 call j9jit29!jitCollapseJNIReferenceFrame (00007ff9`9c463cf0)
00007ff9`88a9da59 ebd8 jmp 00007ff9`88a9da33
00007ff9`88a9da5b e8f0cc9c13 call j9jit29!jitAcquireVMAccess (00007ff9`9c46a750)
00007ff9`88a9da60 ebb7 jmp 00007ff9`88a9da19
00007ff9`88a9da62 e88b3b4113 call j9jit29!jitReleaseVMAccess (00007ff9`9beb15f2)
00007ff9`88a9da67 eb82 jmp 00007ff9`88a9d9eb
00007ff9`88a9da69 bf28000000 mov edi, 28h
00007ff9`88a9da6e e8fd549c13 call j9jit29!jitStackOverflow (00007ff9`9c462f70)
00007ff9`88a9da73 e9eafeffff jmp 00007ff9`88a9d962
00007ff9`88a9da78 e8e3a99c13 call j9jit29!jitThrowCurrentException (00007ff9`9c468460)
this bit looks very wrong:
00007ff9`88a9d9f0 49bb904c904c00800100 mov r11, 180004C904C90h
00007ff9`88a9d9fa 0000 add byte ptr [rax], al
00007ff9`88a9d9fc d3488b ror dword ptr [rax-75h], cl
00007ff9`88a9d9ff c84883c4 enter 8348h, 0C4h
00007ff9`88a9da03 285d48 sub byte ptr [rbp+48h], bl
00007ff9`88a9da06 c785b809000000000000 mov dword ptr [rbp+9B8h], 0
The codegen generates this https://github.com/eclipse-openj9/openj9/blob/f07b484d053ede27e3680a7ab941ed1c93f6b9f5/runtime/compiler/x/amd64/codegen/AMD64JNILinkage.cpp#L765-L778
In the rtlog, I see this:
TR_JNITargetAddress 0000021832008A46
size 22 type 12 flags 0 reloFlags 0
inlined site index FFFFFFFFFFFFFFFF
constant pool 0000013DB10E3758
cpIndex 0000000000000013
reloLocation: from 0000021832008A66 at 00007FF988A9D9F2 (offset d2)
computeNewConstantPool: newCP 000000000310D190
getMethodFromCP: found static method 000000000310C630
JNI call relocation: found JNI target address 0000000180004C90
applyRelocation: registered JNI Call redefinition site
where the 0000000180004C90
can be seen in mov r11, 180004C904C90h
.
~What's confusing though is that according to this log, the relocation was performed at 00007FF988A9D9F2
, which is true:~
~00007FF9
88A9D9F2 000180004C904C90`~
~However there's two issues:~
~1. According to the vlog, the start of the method is 0x00007FF988A9D958
which is after 0x00007FF988A9D9F2
.~
~2. If the relo infra patched 0x00007FF988A9D9F2
then why is 0x00007ff988a9d9f0
fubar.~
EDIT: Clearly losing my mind. 0x00007FF988A9D958
is NOT after 0x00007FF988A9D9F2
. I guess the relo seems ok in terms of materializing the address and everything; maybe the offset it's supposed to patch is wrong?
I'll have to look into this some more.
I looked at the AOT code as stored in the SCC, and this is what it's supposed to look like:
00000218`32008d57 488b6c2458 mov rbp, qword ptr [rsp+58h]
00000218`32008d5c 49bb904c008001000000 mov r11, offset (00000001`80004c90)
00000218`32008d66 41ffd3 call r11
00000218`32008d69 488bc8 mov rcx, rax
00000218`32008d6c 4883c428 add rsp, 28h
00000218`32008d70 5d pop rbp
00000218`32008d71 48c785b809000000000000 mov qword ptr [rbp+9B8h], 0
Looking at the bad code in the code cache:
00007ff9`88a9d9f0 49bb904c904c00800100 mov r11, 180004C904C90h
00007ff9`88a9d9fa 0000 add byte ptr [rax], al
it looks like the the JNI target address was patched at 00007ff9
88a9d9f4(offset d4) instead of
00007ff988a9d9f2
(offset d2). However, the relo infra claims that it patched it at offset d2. So I guess there's (at least) two possibilities here:
TR_PatchJNICallSite
runtime assumption.). I don't know how much I believe in the TR_PatchJNICallSite
idea because the address to be patched is the same address supposedly used as the location to relocate.
Well I figured out what's wrong; I don't know how we didn't hit this sooner...
As can be seen in the disassembly, the address of the mov
instruction is 00007ff988a9d9f0
. As acquired from the reloLog, the value of reloLocation
is 00007ff988a9d9f2
which is the immediate field of the mov
instruction. Now, as part of relocation, we register a runtime assumption for this JNI call site:
https://github.com/eclipse-openj9/openj9/blob/f9835f865f2e37266c7dabb1d52e849c2a9708a6/runtime/compiler/runtime/RelocationRecord.cpp#L2005-L2006
If we go and look at what happens when the assumption is compensated: https://github.com/eclipse-openj9/openj9/blob/f9835f865f2e37266c7dabb1d52e849c2a9708a6/runtime/compiler/runtime/RuntimeAssumptions.cpp#L148 https://github.com/eclipse-openj9/openj9/blob/f9835f865f2e37266c7dabb1d52e849c2a9708a6/runtime/compiler/runtime/JitRuntime.cpp#L1046-L1049
callPC
here is going to be the value of reloLocation
which points to the immediate field. However, _patchJNICallSite
assumes that the address to patch is the start of the mov
instruction. So the end result is we end up patching at 00007ff988a9d9f2
+2 = 00007ff988a9d9f4
, which is what we're seeing in the JIT'd code at the time of the crash.
I don't think the fix is trivial though; the relo code is common across all platforms so I can't just update the value of reloLocation
in applyRelocation
(unless I do something hacky like an ifdef). I'll have to look at the code to see how best to implement the solution.
@siyy123 in the meantime, I suppose you could run with -Xaot:disableDirectToJNIInline
(after destroying any current SCC). If that doesn't work, then you'll have to do -Xaot:disableDirectToJNI
(after destroying the SCC).
-Xaot:disableDirectToJNIInline seems to work for me. It would not core dump anymore for both -Xquickstart and -Xaot:count=0 options.
Java -version output
openjdk version "1.8.0_322" IBM Semeru Runtime Open Edition (build 1.8.0_322-b04) Eclipse OpenJ9 VM (build openj9-0.30.0-m1, JRE 1.8.0 Windows 10 amd64-64-Bit Compressed References 20211130_265 (JIT enabled, AOT enabled) OpenJ9 - ea06f7479 OMR - eb9cf8525 JCL - a3346d8d7a based on jdk8u322-b04)
Summary of problem
I'm trying to launch minecraft 1.12.2 with rlcraft 2.9 modpack (forge-1.12.2-14.23.5.2860) using openj9-0.30.0-m1 or openj9-0.30.0-m2 with option -Xshareclasses -Xquickstart or -Xshareclasses -Xaot:count=0 and with a 2GB SCC. After destroying the SCC the first launch seems to be ok, but after the first launch it will not start anymore (with -Xaot:count=0 seems need to launch more times) and will crash when or just after initializing minecraft engine.
Diagnostic files
logs when using -Xshareclasses -Xquickstart:
javacore.20220118.221225.17824.0002.txt
logs when using -Xshareclasses -Xaot:count=0:
javacore.20220118.223207.17532.0002.txt
the dumps are with openj9-0.30.0-m1 but the bug still happens in openj9-0.30.0-m2.