Open JasonFengJ9 opened 1 year ago
@hzongaro Please take a look
Julian @zl-wang, I'm assuming for now that this is specific to Power code generation. May I ask you to assign this for investigation?
@bhavanisn please take a look ...
@bhavanisn @zl-wang Any new updates?
@bhavanisn you can add the details of your progress here.
There is no solid update on this. As issue is not reproducible, I am still looking into core dump and jitdump to get more details on crash. This is my analysis so far:
gdb shows that the frame where it crashed, but the values are optimized out.
(gdb) frame 21
#21 0x00007e18b3c82e64 in OMR::PPCConstant<long>::patchRequestors (this=<optimized out>, cg=0x7e17c6876a20, addr=138644082150016)
at /home/jenkins/workspace/build-scripts/jobs/jdk21u/jdk21u-linux-ppc64le-openj9-IBM/workspace/build/src/omr/compiler/p/codegen/OMRConstantDataSnippet.hpp:109
(gdb) p/x 138644082150016
$1 = 0x7e1897444a80
(gdb) info frame
Stack level 21, frame at 0x7e1895b16510:
pc = 0x7e18b3c82e64 in OMR::PPCConstant<long>::patchRequestors
(/home/jenkins/workspace/build-scripts/jobs/jdk21u/jdk21u-linux-ppc64le-openj9-IBM/workspace/build/src/omr/compiler/p/codegen/OMRConstantDataSnippet.hpp:109);
saved pc = 0x7e18b3c820a0
called by frame at 0x7e1895b165a0, caller of frame at 0x7e1895b163f0
source language c++.
Arglist at 0x7e1895b163f0, args: this=<optimized out>, cg=0x7e17c6876a20, addr=138644082150016
Locals at 0x7e1895b163f0, Previous frame's sp is 0x7e1895b16510
Saved registers:
r14 at 0x7e1895b16480, r15 at 0x7e1895b16488, r16 at 0x7e1895b16490, r17 at 0x7e1895b16498, r18 at 0x7e1895b164a0, r19 at 0x7e1895b164a8, r20 at 0x7e1895b164b0,
r21 at 0x7e1895b164b8, r22 at 0x7e1895b164c0, r23 at 0x7e1895b164c8, r24 at 0x7e1895b164d0, r25 at 0x7e1895b164d8, r26 at 0x7e1895b164e0, r27 at 0x7e1895b164e8,
r28 at 0x7e1895b164f0, r29 at 0x7e1895b164f8, r30 at 0x7e1895b16500, r31 at 0x7e1895b16508, pc at 0x7e1895b16520, lr at 0x7e1895b16520
Code where it asserts : https://github.com/eclipse/omr/blob/master/compiler/p/codegen/OMRConstantDataSnippet.hpp#L97-L113
(gdb) p/x addr
$6 = 0x7e1897444a80
(gdb) p instr
$2 = (TR::Instruction *) 0x7e17c5154a40
(gdb) p *(TR::Instruction *) 0x7e17c5154a40
$3 = {<J9::Instruction> = {<OMR::Power::Instruction> = {<OMR::Instruction> = {_vptr.Instruction = 0x7e18b3fb44f8 <vtable for TR::PPCTrg1ImmInstruction+16>,
**_binaryEncodingBuffer = 0x0, _binaryLength = 0 '\000',** _estimatedBinaryLength = 0 '\000', _opcode = {<OMR::Power::InstOpCode> = {<OMR::InstOpCode> = {
_mnemonic = OMR::InstOpCode::pld}, ........
Trying to find the values of offset
and cursor
to start with which is calculated with addr
passed as arg and instr
from the dump as below.
uint32_t cursor = reinterpret_cast<uint32_t>(instr->getBinaryEncoding() + instr->getBinaryLength() - 8);
From the dump as both _binaryEncodingBuffer = 0x0, _binaryLength = 0, cursor=-8(0xFFFFFFF8).
intptr_t offset = reinterpret_cast<uint8_t>(addr) - reinterpret_cast<uint8_t>(cursor);
offset = 0x7e1897444a80 - 0xFFFFFFF8 = 0x7E1797444A88
<< which looks incorrect
Need to look further to understand this.
As it wasn't reproduced in the original grinder or for diagnosis, and hasn't yet been seen again, removing from the Java 21 plan.
1) it looks like possibly reproducible to run on a POWER10 machine; 2) obviously binaryEncodingBuffer==0 means the instruction hasn't been encoded yet (at least, its encodingBuffer hasn't been set yet). That is wrong and unexpected: you certainly cannot back-patch the offset with unknown encodingBuffer.
From the jit dump, the code where the crash occured is part of OutOfLine HelperCall
code.
------------ start out-of-line instructions
[ 0x7e17badf62e0] 37 Outlined Label L7843:
[ 0x7e17badf74b0] 37 ori GPR_ 0x7e17badf7440, gr15, 0x0
[ 0x7e17badf7630] 35 addi GPR_ 0x7e17badf75c0, [851968] # SymRef boolean[#3819 Static] [flags 0x18307 0x0 ]
[ 0x7e17badf7740] 37 ori &GPR_ 0x7e17badf76d0, &GPR_ 0x7e17badf5940, 0x0
[ 0x7e17badf7b60] 37 pld D_GPR_ 0x7e17badf7850, 0000000000000000
[ 0x7e17badf7c00] 37 bl 00007E18B3CAB440 ; Direct Call "jitCheckCast"
PRE: [D_GPR_ 0x7e17badf73d0 : gr2] [GPR_ 0x7e17badf7440 : gr3] [GPR_ 0x7e17badf75c0 : gr4] [&GPR_ 0x7e17badf76d0 : gr5] [D_GPR_ 0x7e17badf77e0 : gr11] [D_GPR_ 0x7e17badf7850 : gr12] [D_GPR_ 0x7e17badf78c0 : gr0] [D_GPR_ 0x7e17badf7930 : gr6] [D_GPR_ 0x7e17badf79a0 : gr7] [D_GPR_ 0x7e17badf7a10 : gr8] [D_GPR_ 0x7e17badf7a80 : gr9] [D_GPR_ 0x7e17badf7af0 : gr10]
POST: [D_GPR_ 0x7e17badf73d0 : gr2] [GPR_ 0x7e17badf7440 : gr3] [GPR_ 0x7e17badf75c0 : gr4] [&GPR_ 0x7e17badf76d0 : gr5] [D_GPR_ 0x7e17badf77e0 : gr11] [D_GPR_ 0x7e17badf7850 : gr12] [D_GPR_ 0x7e17badf78c0 : gr0] [D_GPR_ 0x7e17badf7930 : gr6] [D_GPR_ 0x7e17badf79a0 : gr7] [D_GPR_ 0x7e17badf7a10 : gr8] [D_GPR_ 0x7e17badf7a80 : gr9] [D_GPR_ 0x7e17badf7af0 : gr10]
[ 0x7e17badf7cb0] 37 b Label L7842
[ 0x7e17badf7da0] 37 Label L7845:
------------ end out-of-line instructions
Tracing back to the OOL
label L7843
to the mainline code didn't yield any results. Looking further in the compilation logs by searching the branch b Label L7842
back it turns out that the mainline code did not generate the branch to OOL
helper call. Only labels for that code is generated.
============================================================
; Live regs: GPR=1 FPR=0 CCR=0 VRF=0 VSX_SCALAR=0 VSX_VECTOR=0 {&GPR_ 0x7e17badf5940}
------------------------------
n20706n ( 0) checkcast [#86] [ 0x7e17c29637f0] bci=[239,37,491] rc=0 vc=861 vn=- li=1147 udi=- nc=2
n30567n ( 1) ==>aRegLoad (in &GPR_ 0x7e17badf5940) (X!=0 SeenRealReference )
n20705n ( 1) loadaddr boolean[#3819 Static] [flags 0x18307 0x0 ] [ 0x7e17c29637a0] bci=[239,35,491] rc=1 vc=861 vn=- li=1147 udi=- nc=0
------------------------------
checkcast: Emitting HelperCall for failure
Omitting CCR save/restore for helper calls
------------------------------
n20706n ( 0) checkcast [#86] [ 0x7e17c29637f0] bci=[239,37,491] rc=0 vc=861 vn=- li=1147 udi=- nc=2
n30567n ( 0) ==>aRegLoad (in &GPR_ 0x7e17badf5940) (X!=0 SeenRealReference )
n20705n ( 0) loadaddr boolean[#3819 Static] [flags 0x18307 0x0 ] [ 0x7e17c29637a0] bci=[239,35,491] rc=0 vc=1575 vn=- li=- udi=- nc=0
------------------------------
[ 0x7e17badf6010] 37 Label L7841:
[ 0x7e17badf6120] 37 Label L7840:
PRE:
POST: [CCR_ 0x7e17badf5fa0 : cr0] [&GPR_ 0x7e17badf5940 : ???]
[ 0x7e17badf6200] 37 Label L7842:
============================================================
; Live regs: GPR=0 FPR=0 CCR=0 VRF=0 VSX_SCALAR=0 VSX_VECTOR=0 {}
------------------------------
n25266n ( 0) return [ 0x7e17c2fcc940] bci=[239,37,491] rc=0 vc=861 vn=- li=1147 udi=- nc=0
When the testcase passes, the trace generated a different variation, but helped in analyzing the issue.
In all "good" cases there is always more than one sequence generated. https://github.com/eclipse-openj9/openj9/blob/6d4bb3715ff0f1b15bae28cdac948eaced54d121/runtime/compiler/p/codegen/J9TreeEvaluator.cpp#L3814
This while loop goes through each sequence and generate code for that, except for terminal sequence which can be Helper/failure
case(handled outside of loop).
checkcast:Interpreter profiling instance class: [0000000000207600] java/lang/invoke/DirectMethodHandle, probability=0.5
checkcast: Emitting NullTest
checkcast: Emitting ProfiledClassTest
checkcast: Emitting SuperClassTest
checkcast: Emitting HelperCall for failure
Omitting CCR save/restore for helper calls
------------------------------
The branch to next sequence Label is generated inside the loop. So when we have numSequencesRemaining>1
, all works good.
https://github.com/eclipse-openj9/openj9/blob/6d4bb3715ff0f1b15bae28cdac948eaced54d121/runtime/compiler/p/codegen/J9TreeEvaluator.cpp#L3900
In the failure case, we have only one sequence generated (Might be very rare case considering, we did not hit it till now). So in this case while loop is not entered and thus branch to nextSequenceLabel
is not generated and thus fails to branch to OOL
code leaving it not encoded.
checkcast: Emitting HelperCall for failure
Omitting CCR save/restore for helper calls
------------------------------
Fix would be to generate branch to OOL code when that is the only sequence generated.
Similar to code in aarch: https://github.com/eclipse-openj9/openj9/blob/6d4bb3715ff0f1b15bae28cdac948eaced54d121/runtime/compiler/aarch64/codegen/J9TreeEvaluator.cpp#L2532-L2536
Failure link
From an internal build(
ubu22lert-4
):Rerun in Grinder - Change TARGET to run only the failed test targets.
Optional info
Failure output (captured from console output)
50x internal grinder - passed