Closed AlenBadel closed 5 years ago
There a number of methods that this is reproducible with.
To name a few:
java/util/HashMap.putVal(ILjava/lang/Object;Ljava/lang/Object;ZZ)Ljava/lang/Object
java/util/HashMap.resize()[Ljava/util/HashMap$Node
java/lang/Object.<init>()V
It can be reproduced with simply
java -version
This fails with the following options:
optLevel=noOpt
lastOptIndex=0
disableInlining
count=0
Running through this in gdb, during the transition from execution within the jit compiled method, and fast_jitInstanceOf
the TOC register r2 is attempted to be populated by register r12.
0x7ffff64376e4 <fast_jitInstanceOf(J9VMThread*, J9Class*, j9object_t)> addis r2,r12,60 │*
│0x7ffff64376e8 <fast_jitInstanceOf(J9VMThread*, J9Class*, j9object_t)+4> addi r2,r2,-19940 │*
The address in r12 before these two instructions is not obvious to me, but it is a valid address. After the two instructions above, an invalid (assumed TOC) address is fed into register r2.
As execution eventually reaches inlineCheckCast
we get the exception when the system tries loading from an invalid address inside of register r2.
Backtrace
#0 0x00007ffff56e869c in 00000017.plt_call._ZN12VM_VMHelpers15inlineCheckCastEP7J9ClassS1_b () from /home/abadel/break/new/jdk/lib/compressedrefs/libj9jit29.so
#1 0x00007ffff6437740 in fast_jitInstanceOf (currentThread=0x44500,
castClass=0xa5600, object=0x7ffe03870) at cnathelp.cpp:2676
#2 0x00007fffc03607c4 in ?? ()
#3 0x00007ffff7182e24 in sendClinit (
currentThread=<error reading variable: Cannot access memory at address 0x88>,
clazz=<error reading variable: Cannot access memory at address 0x80>,
reserved1=<error reading variable: Cannot access memory at address 0x78>,
reserved2=<error reading variable: Cannot access memory at address 0x70>,
reserved3=<error reading variable: Cannot access memory at address 0x68>)
at callin.cpp:419
**There's a fair bit of missing debug data in this debug build.
@AlenBadel will add an update shortly, so I am jumping slightly ahead. I wanted to ask @gacholio : what is the convention between JITted code and the helpers in cnathelp.cpp in terms of setting the TOC?
The C code follows the C calling convention of the platform - there are no exceptions. The wrapper helpers handle this. If the C code is called directly from the compiled code, you will need to follow the appropriate convention there.
@gacholio @zl-wang, apparently with -O0 (@AlenBadel is verifying) gcc++ creates this at the beginning of external methods:
0x7ffff64376e4 <fast_jitInstanceOf(J9VMThread, J9Class, j9object_t)> addis r2,r12,60 │
│0x7ffff64376e8 <fast_jitInstanceOf(J9VMThread, J9Class*, j9object_t)+4> addi r2,r2,-19940
JIT definitely does not set R12 to be the entry point. How should the methods in cnathelp.cpp get the right TOC then?
My feeling is we are the first ones who are trying to build DEBUG VM on PPC LE :)
No mystery here. JIT-ed code call-able helpers: the TOC should be gotten by loading from J9VMThread->jitTOC. That has been the case for: AIX32/64, BE Linux64, and LE Linux. Suppose there is such a load ... I don't see why the global-entry pre-prologue (addis/addi based off r12) can be harmful, i.e. even if r2 is set to garbage by the pre-prologue, it will be corrected by the load mentioned above (before TOC is used in that routine).
So, is this issue due to: in non-debug build, JIT-ed code calls the wrapper containing the load; in debug build, JIT-ed code calls the C routine directly? JIT-ed code doesn't set up either r12 or r2 definitely. I just cannot see how this comes by.
See:
https://github.com/eclipse/openj9/blob/master/runtime/oti/phelpers.m4#L269
Sorry, I missed where that CALL_INDIRECT macro is used. From the earlier GAC's comment it sounded like it's JIT responsibility to set R2 if calling directly from JITtecd code. But then global entry would override that.
GAC comment to me: JIT-ed code should call the wrapper as we did always. If you call the C code directly, you need to set up r12.
So, it is not an issue of production build vs. debug build. It is an issue which helper target you call. As it is, JIT-ed code doesn't set up either r12 or r2 (for helper calls... we do set up r12 or r2 for JNI calls). If the target is wrong, it will not work for both production and debug builds.
Sure, debug build just exposed some confusion in the calling conventions for helper calls. We need to figure out what they actually should be and what is the correct target.
I can verify that the inclusion of the -O0 option directly causes this issue. Either when
export UMA_DO_NOT_OPTIMIZE_CCODE="1"
is set
or -O0 is included in VMDEBUG.
Given that UMA_DO_NOT_OPTIMIZE_CCODE expands to
UMA_OPTIMIZATION_CFLAGS += -O0
UMA_OPTIMIZATION_CXXFLAGS += -O0
Earlier I thought that this was only seen in Java10, and Java11 builds however I can confirm it spans to all debug builds on Linux PPCLE including Java 8. With Java8 it's not obvious during the build, but is reproducible with setting -Xjit=count=0
and running java -version
I can also confirm that the register values of r12, and r2 don't correspond to values of the entry point address, nor do they correspond to the address of the jitTOC respectively. I checked before the direct call to getInstanceOf, at the beginning of getInstanceOf execution and right before the segmentation.
As Gita has mentioned, r12 and r2 aren't used much in production builds and I wasn't able to find any instances where they would used in a way as they contained an expected value, or they're being used as a TOC.
Even in the method that is called inside getInstanceOf, VMInlineCast assumes the data while there's no trace of such instructions on production.
0x7ffff56e8694 <00000017.plt_call._ZN12VM_VMHelpers15inlineCheckCastEP7J9ClassS1_b> std r2,24(r1) │
│0x7ffff56e8698 <00000017.plt_call._ZN12VM_VMHelpers15inlineCheckCastEP7J9ClassS1_b+4> addis r12,r2,1 │
│0x7ffff56e869c <00000017.plt_call._ZN12VM_VMHelpers15inlineCheckCastEP7J9ClassS1_b+8> ld r12,24312(r12) │
Is the compiled code loading r2 before calling the helper? The regular helpers load it before every call to the C code (INIT_GOT macro).
Last night Alen stepped through all the way from JITted code to fast_jitInstanceIOff in cnathelp.cpp and he did not see r2 set anywhere.
When we call other helpers via PicBuilder.m4 we do set r2.
Nobody seems to ever set r12.
So perhaps the fast path from the compiled code is always incorrect, and we get lucky in production builds. r2 is set in pnathelp.m4 for the normal helpers.
Yes, looks like that from the Alen’s Investigation above.
What about setting r12? Or we should somehow jump to the local entry?
In the macros, r12 is used as the function pointer for indirect calls, nothing else. I don't recall if that was necessary, or simply conveinient.
define({FUNC_PTR},{r12})
define({CALL_INDIRECT},{
ifelse($1,,,{mr FUNC_PTR,$1})
std r2,TOC_SAVE_OFFSET(r1)
mtctr r12
bctrl
ld r2,TOC_SAVE_OFFSET(r1)
})
This assumes that r2 is already loaded with the jitTOC.
The other way around, GAC. as long as it is the right target of helper for JIT-ed code, it is the helper's responsibility to restore the TOC if it is needed. That has been the reason we initialized vm->jitTOC (and consequentially J9VMThread->jitTOC).
@zl-wang I think you're missing the point. We're talking about calling the C code directly from the compiled code, not going via the helper (which does all the TOC work).
In general, the helpers don't use the indirect call - on LE the functions are called directly.
define({CALL_DIRECT},{
ori r0,r0,0
bl BRANCH_SYMBOL($1)
ori r0,r0,0
})
BRANCH_SYMBOL does no decoration. I don't recall why the call is surrounded by the NOPs.
For LE, that (r12 setting) is the requirement of the ABI definition. (outside module) caller sets r12 to the callee global entry ... (shared lib) loader relocates TOC base out of it.
If JIT-ed code calls C code directly (JNI code in essence), we do set r12 (on LE) to the global entry.
where does JIT-ed code call C code directly except JNI targets? that is the question to be answered.
no matter the actual call is made through bl or bctrl, the ABI definition needs to be fulfilled (AIX glink or LE plink or something code for bl, for example). i guessed we are mixing up the meaning of direct/indirect call here. I referred to call-directly as meaning JIT-ed code seeing it as a target (instead of meaning it through bl).
i can at least see why you have the second nop: to restore TOC after return. maybe the first nop as well: to save the TOC before call. Typically, the save is done in the glink code (AIX or BE Linux64) or plt_link code (LE).
for example, you call read() in your code. the compiler generates bl to read()'s glink or plt_link inside your module.
Some of the C helpers (in particular instanceof) are called directly.
Sounds like the JIT will not need to load r12 for helper calls, if r2 is loaded - we're not using the linker/loader, so the target C code will simply expect r2 to be loaded. If you're going to generate a bctrl instead of a bl, r12 may as well be used to load CTR for complete safety.
Well, looks like if compiled with -O0 target code expects r12 to be set, from which it derives r2. Copying from above:
0x7ffff64376e4 <fast_jitInstanceOf(J9VMThread, J9Class, j9object_t)> addis r2,r12,60 │
│0x7ffff64376e8 <fast_jitInstanceOf(J9VMThread, J9Class*, j9object_t)+4> addi r2,r2,-199
I guessed this issue existed on AIX and BE Linux64 as well (TOC requirement). We have been just lucky so far switching to call the C target(s) directly (instead of the helper wrapper target(s)). Need to identify the set of C helpers called directly.
Yes, for LE: ABI requires r12 to point at the callee global entry. At the global entry, two instructions are used to calculate the TOC-base off r12. 60/-199 above are filled in resulting from loader relocations.
I've actually been thinking for a while that I'd like to switch the direct calls in the helper wrappers to indirect calls, which would have the side-effect of loading r12. I expect that means I can stop loading r2, and we could likely eliminate the jitTOC entirely.
there are many uses of jitTOC in JIT runtime assembly (picBuilder, Crypto, arrayCopy, and recompilation etc etc).
AIX and BE Linux64 ABI(s): they did require caller set up r2 before call. But we didn't do that for general helper calls. GAC/You set up r2 instead in helper with loading from jitTOC.
LE Linux ABI: it requires caller to set up r12 for callee global-entry address. r2 is calculated from r12. We didn't do that either. It is assumed you set up r2 in helper with loading from jitTOC as well, if TOC is needed.
You'll probably want to make a change in this method, which is where most of the work for calls is done:
In that method we kill r12
; you'll need to change that to put the entry point in r12
instead. The linkage
parameter tells you what you're generating code for (one of TR_Private
, TR_Helper
, TR_CHelper
) so you can check for TR_CHelper
. Alternatively you can put the TOC in r2
and call the local entry point if you want I guess. Not sure which is preferable, but doing it the global entry way probably makes the code easier to understand for observers.
Doing the same thing as JNI is probably the right answer (loading r12 and using it to call indirect). This is what I'm attempting in the helpers.
Now that I think about it this is going to end up being slightly inefficient because later on in the compilation process we'll be able to decide if the helper is reachable via a direct call or not, and if not we'll redirect the call through a trampoline which will correctly load r12
. By also loading r12
during argument setup we'll end up doing it twice if we end up going through a trampoline. A bit of a shame.
That's likely why this bug hasn't been seen until now, the helper needs to end up being reachable via a direct call and has to make use of the TOC and the C compiler has to not optimize the usage away.
How about r2 on AIX and BE Linux64 for TR_CHelper?
You are right, r2
is not set up either and needs to be.
Maybe we can remove r2/r12 setup from the trampolines for C helpers and assume jitted code will always do it, that way we don't pay the penalty twice.
No, we didn't do anything about r2 in trampoline. If it is decided to use bctrl for the call, then it will not go through trampoline anymore.
The drawback of course is target address prediction. bctrl is predicted by the countCache; while bl doesn't need prediction.
The reason the C calls from the helper wrappers work without loading r12 today is that the linker is assuming that r2 is already loaded for calls within the same shared library (note the +8 which the linker added below, skipping the r2 load from r12):
0x3fffb6671ad8 <jitInstanceOf+136>: nop
0x3fffb6671adc <jitInstanceOf+140>: bl 0x3fffb6659858 <old_fast_jitInstanceOf+8>
0x3fffb6671ae0 <jitInstanceOf+144>: nop
In pure C code, the compiler also seems to be assuming that any calls using bl
(i.e. within the same shared library) will return with r2 pointing to the TOC (it does not save/restore r2 across the call, and generates code assuming r2 is still the caller TOC, which is necessarily the same for all functions in the shared library).
0x3fffb70e604c <sendClinit+524>: mr r3,r29
0x3fffb70e6050 <sendClinit+528>: bl 0x3fffb713caf8 <c_cInterpreter>
0x3fffb70e6054 <sendClinit+532>: nop
I'm not sure why the C compiler isn't doing the +8 trick with the above call (update: it's because I need a .localentry
directive). I'll update the code to load the TOC from r12 on LE and add the directive to allow the optimized call.
The pre-NOP I'm generating seems unnecessary, but the C compiler is generating the post-NOP (perhaps to reload r2 if necessary after the call in some situations?).
So, for direct C helpers calls from the compiled code, either r12 must be loaded with the function address, or r2 must be loaded with the TOC and the call must be to helper+8.
So, the call sequences need to be:
AIX32/64 / Linux64 BE
load r2 with jitTOC
bl helper
Linux64 LE
load r2 with jitTOC
bl helper+8
OR
load r12 with helper
bl helper
Linux32
bl helper
This means the trampolines for C helper calls need not contain any loading of r2 or r12, though I suppose as they're doing an indirect branch anyway, they may as well use r12 on LE.
Given that the TOC setup in the C code is technically variable-sized, it's probably safer to load r12 rather than r2 on LE, and it leaves open the possibility of removing the inline load r12 if a trampoline is going to be used.
@zl-wang While investigating all this, I came across:
https://github.com/eclipse/omr/blob/master/util/omrutil/unix/linux/ppc/64le/gettoc.s#L31
Should that not be addi
rather than addis
?
I'm basing this on what I see in the compiled code, and the doc here:
https://www.ibm.com/developerworks/library/l-improve-performance-openpower-abi/index.html
I bet this works because of the localentry directive, so those first two instructions are never executed, since this code is statically-linked with every module that uses it.
Fixing this here: https://github.com/eclipse/omr/pull/3014
As part of the TOC optimization work I'm doing in the VM, the jitTOC will be stored in the TOC slot in the current C stack frame, so the JIT will have the option of encoding the r2 load on the old ABI platforms as either potentially multiple instructions to load the compile-time constant, or a single instruction to fetch it from r1 (the offset will be offsetof(J9CInterpreterStackFrame, currentTOC)
).
Ah, seems this is unnecessary as the jitTOC is already propagated into a slot in the J9VMThread which is also a single instruction fetch.
@gacholio Yes, that should be an addi.
FYi @AlenBadel @gita-omr I've tagged this for the 0.11.0 release as we're not calling the c code correctly and that may result in issues.
@AlenBadel @gita-omr Is this still an open issue? Will it close in time for the 0.11.0 release?
@AlenBadel is actively working on it. There are a few places that need to be checked. We are hoping for 0.11.0 release.
This only occurs when building on PPCLE, and with java versions 10,11. Java 8 builds perfectly.
This occurs during the DDR stage, when it uses the built jdk.
Passes with -Xint.
To reproduce,
I checked out the OpenJDK 11 with OpenJ9 project per https://www.eclipse.org/openj9/oj9_build.html
Enabled debugging per https://blog.openj9.org/2018/06/05/debugging-openj9-in-docker-with-gdb/
To build:
We get the following error