Closed pshipton closed 6 years ago
@zl-wang - Power platform failure. Wondering if it has anything to do with @jdmpapin 's or @dsouzai 's recent work. @vijaysun-omr
New AOT changes have not been merged yet; the PRs are currently being reviewed. So I doubt if it's anything related to work being done by @dsouzai
Thanks @vijaysun-omr . @zl-wang, @gita-omr is this a Power specific issue, if so, who should own it? See the last two comments.
I doubted it is p specific. @IBMJimmyk is investigating.
FYI @IBMJimmyk is working on this. Thanks @gita-omr and @zl-wang . @IBMJimmyk - thanks for looking into this, as previously, frequent updates are very appreciated, this is blocking RC1.
From the Javacore file: $r26 = 0xFFF0A478
0x3fff624ef2d0 {java/.../FoldNonvoidHandle.invokeExact_thunkArchetype_X} +26 || 809a002c lwz r4,+44(r26) 0x3fff624ef2d4 {java/.../FoldNonvoidHandle.invokeExact_thunkArchetype_X} +27 || 80ba0028 lwz r5,+40(r26) 0x3fff624ef2d8 {java/.../FoldNonvoidHandle.invokeExact_thunkArchetype_X} +28 || 63430000 ori r3,r26,0x0 0x3fff624ef2dc {java/.../FoldNonvoidHandle.invokeExact_thunkArchetype_X} +29 -1:15 ||| 480000bd bl 0x3fff624ef398 U>> +76 Snippet-> {java/lang/invoke/MethodHandle.undoCustomizationLogic} 0x3fff624ef2e0 {java/.../FoldNonvoidHandle.invokeExact_thunkArchetype_X} +30 ||| 63430000 ori r3,r26,0x0 0x3fff624ef2e4 {java/.../FoldNonvoidHandle.invokeExact_thunkArchetype_X} +31 -1:25 |||| 48000099 bl 0x3fff624ef37c U>> +69 Snippet-> {java/lang/invoke/MethodHandle.doCustomizationLogic} 0x3fff624ef2e8 {java/.../FoldNonvoidHandle.invokeExact_thunkArchetype_X} +32 |||| 83fa0028 lwz r31,+40(r26) //$r31 = 0xFFF0A2A0 0x3fff624ef2ec {java/.../FoldNonvoidHandle.invokeExact_thunkArchetype_X} +33 |||| 807a002c lwz r3,+44(r26) //$r3 = 0x80B0F210 0x3fff624ef2f0 {java/.../FoldNonvoidHandle.invokeExact_thunkArchetype_X} +34 -1:40 |||| 08830000 tdi TO_EQ,r3,+0 0x3fff624ef2f4 {java/.../FoldNonvoidHandle.invokeExact_thunkArchetype_X} +35 |||| 80430010 lwz r2,+16(r3) //$r2 = 0x80B8B668 0x3fff624ef2f8 {java/.../FoldNonvoidHandle.invokeExact_thunkArchetype_X} +36 |||| 63840000 ori r4,r28,0x0 //$r4 = 0x0 0x3fff624ef2fc {java/.../FoldNonvoidHandle.invokeExact_thunkArchetype_X} +37 |||| 63250000 ori r5,r25,0x0 //$r5 = 0x6e 0x3fff624ef300 {java/.../FoldNonvoidHandle.invokeExact_thunkArchetype_X} +38 |||| e8020008 ld r0,+8(r2) //$r0 = 0x3FFF80CE290C 0x3fff624ef304 {java/.../FoldNonvoidHandle.invokeExact_thunkArchetype_X} +39 |||| 7c0903a6 mtspr CTR,r0 0x3fff624ef308 {java/.../FoldNonvoidHandle.invokeExact_thunkArchetype_X} +40 -1:50 |||| 4e800421 bctrl // Branch to 0x3FFF80CE290C
0x00003fff80ce290c: addi r14,r14,-48 0x00003fff80ce2910: std r3,0(r14) 0x00003fff80ce2914: std r15,8(r14) 0x00003fff80ce2918: std r5,40(r14) 0x00003fff80ce291c: std r4,32(r14) 0x00003fff80ce2920: std r3,24(r14) 0x00003fff80ce2924: mflr r4 0x00003fff80ce2928: std r4,16(r14) 0x00003fff80ce292c: ld r3,432(r15) 0x00003fff80ce2930: ld r4,-32248(r3) 0x00003fff80ce2934: ld r3,-32216(r3) 0x00003fff80ce2938: mtctr r4 0x00003fff80ce293c: mr r4,r14 0x00003fff80ce2940: mr r5,r14 0x00003fff80ce2944: bctrl 0x00003fff80ce2948: ld r4,0(r14) //$r4 = 0x00003fff7c0cfdb4 0x00003fff80ce294c: mtctr r4 //$CTR = 0x00003fff7c0cfdb4 0x00003fff80ce2950: ld r4,16(r14) 0x00003fff80ce2954: mtlr r4 0x00003fff80ce2958: ld r3,24(r14) 0x00003fff80ce295c: ld r4,32(r14) 0x00003fff80ce2960: ld r5,40(r14) 0x00003fff80ce2964: addi r14,r14,48 0x00003fff80ce2968: bctr //Branch to 0x00003fff7c0cfdb4 and crash since it doesn't contain a valid instruction
From looking at the core file and javacore file it looks like this is the path taken just before the crash.
invokeDynamic related, it seems. r3 was stored at 0(r14), later r4 was loaded from 0(r14). could you match the coming-in r3 to the value of 0x00003fff7c0cfdb4? Hopefully, the call at 0x00003fff80ce2944 didn't change the value at 0(r14).
Unfortunately the call does change it.
//I can read from +44 from $r26 and see that $r3 = 0x80B0F210. 0x3fff624ef2ec {java/.../FoldNonvoidHandle.invokeExact_thunkArchetype_X} +33 |||| 807a002c lwz r3,+44(r26)
//$r3 should get stored to 0(r14) 0x00003fff80ce2910: std r3,0(r14)
//This call is in between the store and the load 0x00003fff80ce2944: bctrl
//There's a different value when you read it back out. $r4 = 0x00003fff7c0cfdb4 0x00003fff80ce2948: ld r4,0(r14)
The data at 0x80B0F210 isn't executable code either so I think overwriting the value is probably expected. The problem is the value is overwritten with a location that is still bad.
For the call at 0x00003fff80ce2944, that is a strange convention to say the least: you set up all the arguments in registers, but return value at stack-top. The called function is one from a table recorded in J9VMThread --- 0x00003fff80ce2930: ld r4,-32248(r3) which comes from 0x00003fff80ce292c: ld r3,432(r15).
That strange convention is suspicious to me. Let's see what that table is, and clarify that strange convention.
The section of code starting at 0x00003fff80ce290c looks to be _initialInvokeExactThunkGlue. The bctrl at 0x00003fff80ce2944 looks to call out to jitCallCFunction.
0x00003fff80ce2934: ld r3,-32216(r3) //$r3 = 0x3fff80cb54f4 = start of initialInvokeExactThunk_unwrapper 0x00003fff80ce2938: mtctr r4 0x00003fff80ce293c: mr r4,r14 0x00003fff80ce2940: mr r5,r14 0x00003fff80ce2944: bctrl //call to jitCallCFunction
r4 is a pointer to parameters and r5 is a pointer to where jitCallCFunction will write back return values. It looks like a bad return value is coming back for some reason.
Current plan is to try and reproduce the problem with -Xjit:verbose={j2iThunks} option set to gather more information about what might have gone wrong.
From the corefile this is where we are leading up to the crash. I've highlighted important locations where a register or memory value changes
$r3 starts at 0x80b0f210 and is an object of type java/lang/invoke/DirectHandle $r15 starts at 0x2b9800 and is the J9VMThread $r14 starts at 0x63c2f0 but immediately gets changes to 0x63c2c0
0x00003fff80ce290c <startproc._initialInvokeExactThunkGlue+0>: addi r14,r14,-48 //$r14 = 0x63c2c0 0x00003fff80ce2910 <startproc._initialInvokeExactThunkGlue+4>: std r3,0(r14) //0(r14) = 0x80b0f210 0x00003fff80ce2914 <startproc._initialInvokeExactThunkGlue+8>: std r15,8(r14) //8(r14) = $r15 = 0x2b9800 0x00003fff80ce2918 <startproc._initialInvokeExactThunkGlue+12>: std r5,40(r14) 0x00003fff80ce291c <startproc._initialInvokeExactThunkGlue+16>: std r4,32(r14) 0x00003fff80ce2920 <startproc._initialInvokeExactThunkGlue+20>: std r3,24(r14) 0x00003fff80ce2924 <startproc._initialInvokeExactThunkGlue+24>: mflr r4 0x00003fff80ce2928 <startproc._initialInvokeExactThunkGlue+28>: std r4,16(r14) 0x00003fff80ce292c <startproc._initialInvokeExactThunkGlue+32>: ld r3,432(r15) //$r3 = 0x3fff815e1300 0x00003fff80ce2930 <startproc._initialInvokeExactThunkGlue+36>: ld r4,-32248(r3) //$r4 = 0x3fff813b7fe8 = start of jitCallCFunction 0x00003fff80ce2934 <startproc._initialInvokeExactThunkGlue+40>: ld r3,-32216(r3) //$r3 = 0x3fff80cb54f4 = start of initialInvokeExactThunk_unwrapper 0x00003fff80ce2938 <startproc._initialInvokeExactThunkGlue+44>: mtctr r4 //$ctr = 0x3fff813b7fe8 0x00003fff80ce293c <startproc._initialInvokeExactThunkGlue+48>: mr r4,r14 //$r4 = 0x63c2c0, address of input parameter to initialInvokeExactThunk_unwrapper 0x00003fff80ce2940 <startproc._initialInvokeExactThunkGlue+52>: mr r5,r14 //$r5 = 0x63c2c0, address of where to store return value from initialInvokeExactThunk_unwrapper 0x00003fff80ce2944 <startproc._initialInvokeExactThunkGlue+56>: bctrl //Jump to jitCallCFunction (0x3fff813b7fe8) 0x00003fff80ce2948 <startproc._initialInvokeExactThunkGlue+60>: ld r4,0(r14) //Read from 0x63c2c0. $r4 = 0x3fff7c0cfdb4, this is garbage and bad 0x00003fff80ce294c <startproc._initialInvokeExactThunkGlue+64>: mtctr r4 //$ctr = 0x3fff7c0cfdb4 0x00003fff80ce2950 <startproc._initialInvokeExactThunkGlue+68>: ld r4,16(r14) 0x00003fff80ce2954 <startproc._initialInvokeExactThunkGlue+72>: mtlr r4 0x00003fff80ce2958 <startproc._initialInvokeExactThunkGlue+76>: ld r3,24(r14) 0x00003fff80ce295c <startproc._initialInvokeExactThunkGlue+80>: ld r4,32(r14) 0x00003fff80ce2960 <startproc._initialInvokeExactThunkGlue+84>: ld r5,40(r14) 0x00003fff80ce2964 <startproc._initialInvokeExactThunkGlue+88>: addi r14,r14,48 0x00003fff80ce2968 <startproc._initialInvokeExactThunkGlue+92>: bctr //Branch to 0x3fff7c0cfdb4 and crash immediately
The jump to jitCallCFunction at 0x00003fff80ce2944 eventually reaches old_slow_jitCallCFunction inside runtime/codert_vm/cnathelp.cpp.
old_slow_jitCallCFunction calls the function initialInvokeExactThunk_unwrapper and passes in the addresses to its input parameters and where to store its results. In this case both addresses are the same: 0x63c2c0. When the ld at 0x00003fff80ce2948 eventually reads the result from 0x63c2c0, it gets a bad value.
initialInvokeExactThunk_unwrapper is inside runtime/compiler/runtime/JitRuntime.cpp. At the start it attempts to read its two input values using the passed in address. From the corefile, it looks like it should have read 0x2b9800 as the J9VMThread and 0x80b0f210 as the methodHandle. 0x2b9800 looks like it really was the J9VMThread and 0x80b0f210 was a Java object of type java/lang/invoke/DirectHandle. initialInvokeExactThunk_unwrapper then makes a call to initialInvokeExactThunk and stores the result to resPtr. This store looks like it was the last thing to touch the address 0x63c2c0 and seems to be responsible for the bad value that is read back out.
The -Xjit:verbose={j2iThunks} option was expected to give more insight into what was happening inside initialInvokeExactThunk but I have not yet been able to reproduce the problem with this -Xjit option active.
@DanHeidinga - @zl-wang mentioned to me - this "looks like a problem in VM code. JIT-ed code (Thunk_archetype for JSR292) calls jitCallCFunction with valid arguments (J9VMThread, and MethodHandle object). That C function is VM code. It returns bad address on stack top."
@gacholio can you take a look?
I don't understand the comment from @zl-wang - there's nothing wrong with jitCallCFunction (it's so trivial, there pretty much couldn't be):
typedef void (*twoVoidFunc)(void*, void*);
void J9FASTCALL
old_slow_jitCallCFunction(J9VMThread *currentThread)
{
OLD_SLOW_ONLY_JIT_HELPER_PROLOGUE(3);
DECLARE_JIT_PARM(twoVoidFunc, functionPointer, 1);
DECLARE_JIT_PARM(void*, argumentPointer, 2);
DECLARE_JIT_PARM(void*, returnValuePointer, 3);
functionPointer(argumentPointer, returnValuePointer);
SLOW_JIT_HELPER_EPILOGUE();
}
So if there's some kind of mismatch in arguments, it's a mismatch between the generated code which calls the helper and the argument array use in the target function.
I think the problem is with what old_slow_jitCallCFunction calls. old_slow_jitCallCFunction calls out to initialInvokeExactThunk_unwrapper.
void initialInvokeExactThunk_unwrapper(void argsPtr, void resPtr) { J9VMThread vmThread = (J9VMThread)argsPtr[1]; j9object_t methodHandle = (j9object_t)argsPtr[0]; *resPtr = initialInvokeExactThunk(methodHandle, vmThread); }
What seems to be happening is initialInvokeExactThunk returns a bad value that is written to the resPtr location which goes on the cause a problem when it is read later on. Right now it seems like good values for methodHandle and vmThread are being passed into initialInvokeExactThunk.
right, I referred the C function to initialInvokeExactThunk, rather than jitCallCFunction helper. i.e. what is eventually called by jitCallCFunction.
initialInvokeExactThunk
is JIT code:
https://github.com/eclipse/openj9/blob/450399bbd37af823f4360f033ac0e48dbdcd1d6c/runtime/compiler/env/VMJ9.cpp#L9101-L9111
@zl-wang Is this the code you think is incorrect?
For the record, as I understand it this failure occurs 1/20 1/30
And it fails in FoldHandle. Is it something that is rarely used e.g. not created by regular Java lambdas?
Hard to say, it depends on the code being run, and if it uses MethodHandles. Nashorn uses it.
@gita-omr I don't think there's anything special about FoldHandle. We could hit this same issue in any of the MH subclasses
Jimmy will add an update soon. But yes, looks like a general (albeit very intermittent) MH problem.
I ran the test 120 more times since yesterday. 60 runs under the original options and 60 runs with -Xjit:verbose={j2iThunks}. With the original options all 60 runs passed. With the verbose option, 59 out of 60 passed. A grinder yesterday showed 1 failure in 30 tries. There were previous failures even before yesterday but I'm not sure how many runs were done to produce them. Based on these results, I think the failure rate is actually less than 1/30.
I'm currently looking at the new logs I got and hopefully it will show something useful.
A quick update on the nature of the problem: Last night @IBMJimmyk and @jdmpapin did a very thorough debugging and were able to find the content of the i2jThunk table in the core file. This is the table initialInvokeExactThunk is searching through. It looks like the table contains a valid i2jThunk for the signature we are looking for. However, the latest failing verbose log shows that an incorrect value was returned. This can be due to racing conditions (although reads and writes to the table seem to be properly synchronized). @IBMJimmyk is currently trying to confirm the whole theory based on the new verbose log and core file.
From the end of findThunkFromTerseSignature inside runtime/compiler/env/J2IThunk.cpp:
else { OMR::CriticalSection critialSection(_monitor); match = root()->get(terseSignature, _nodes, false); }
return match? match->_thunk : NULL;
After the code exits the critical section the Node object pointed to by match can move. If that happens before we read match->_thunk and something else is written to that memory location, we can end up with a bad result.
This is a problem. With any luck it will turn out to be the problem. I will be running tests to check.
I created a special build where I added a 10us delay just after the critical section but before the code reads from match->_thunk. I got 5/6 failures and they failed in a similar way. Furthermore I added debugging code and checked that reading from match->_thunk just before leaving the critical section never gave the bad result. I am now fairly certain I have found the problem.
Nice @IBMJimmyk ! This should also help speed up the proof that the test passes (no need to wait 60 times for a reproduction). Doing the same experiment above with the fix might do it.
It seems like people are already happy with the fix but for the record I tried running this test 100 times overnight with my fix and they all passed.
The fix was such a right thing to do that I wanted to get it in ASAP :) Thanks a lot!
https://ci.eclipse.org/openj9/job/Test-extended.system-JDK11-linux_ppc-64_cmprssptrs_le/26/
There is a core file in systemtest_test_output.tar.gz
.\openjdk-tests\TestConfig\test_output_15378579504933\SharedClasses.SCM23.MultiThread_0\20180925-071823-SharedClasses\results\