Why does the JIT need to load the VFT offset in a register before a virtual call?

fjeremic commented 4 years ago

While reviewing #8154 I wanted to convince myself of my understanding of virtual dispatches on Z. I went through a simple virtual call and documented the call flow. I'm copying the comment here for further discussion. Two questions arise at the end:

Sorry I'm still not following. What is this LA instruction you are referring to? I just debugged this with an example to convince myself of how it works. Here is the call flow:

We load the VFT offset into GPR0
We branch to the target method stored in the VFT offset memory location (some negative offset off the J9Class*) via the BASR instruction in the JIT mainline code
If the target method has not been compiled the pointer stored in the VFT offset will be the pointer of the icallVMprJavaSendPatchupVirtual function [1]. This function stores the value of GPR0 into J9VMThread.tempSlot and then makes a call to old_slow_icallVMprJavaSendPatchupVirtual [2]
The old_slow_icallVMprJavaSendPatchupVirtual function takes the J9VMThread.tempSlot which is the VFT offset and the return address and calls jitVTableIndex which should return the VFT index. Notice how on z/Architecture this function just returns the second parameter [3]? We do not decode the instruction stream on Z, so it appears. This is not true on other platforms.
Execution returns back to old_slow_icallVMprJavaSendPatchupVirtual which then takes the VFT index and looks up the thunk based on the signature of the method we want to call. These thunks are one per signature and dispatch the calls from JIT to interpreter. The rest of this function call looks up the thunk for the particular call we want to make and then it overwrites the VFT entry in the J9Class* with the address of the thunk. This is so we don't have to pay the cost of looking up the thunk the next time we have to dispatch this virtual call. The last thing this method does is store the thunk value to J9VMThread.tempSlot [4].
Execution returns to icallVMprJavaSendPatchupVirtual which then dispatches a call to the thunk via an indirect branch off of J9VMThread.tempSlot [5]
The thunk in my case is just a simple instruction sequence which takes care of putting arguments in the right place on the stack, and then it calls j2iVirtual [6]. This ends up branching to the interpreter loop to execute the method.

As far as I can see the only reason why we are loading the VFT offset in a register is because looking at the implementation of j2iVirtual [6] the VM seems to want this VFT offset to be in J9VMThread.tempSlot [7]. It seems because of the possibility of having to dispatch via an interface. I don't quite understand the details of how this works, but I definitely would like to.

For me two questions arise:

Why is the implementation of jitVTableIndex more complicated on non-Z platforms? What is this API supposed to do exactly?
Why does the VM need the VFT offset in J9VMThread.tempSlot? For resolved calls we should be able to decode this value from the JIT instruction stream. Given that JIT to interpreter calls are supposed to be rare, I'd like to know if there is an opportunity to remove the VFT load prior to virtual calls on all platforms as that can save quite a bit of path length.

[1] https://github.com/eclipse/openj9/blob/dd96e6747e1a51ca757d501a7c6bf110f108f192/runtime/codert_vm/znathelp.m4#L621-L633 [2] https://github.com/eclipse/openj9/blob/dd96e6747e1a51ca757d501a7c6bf110f108f192/runtime/codert_vm/cnathelp.cpp#L2733-L2752 [3] https://github.com/eclipse/openj9/blob/dd96e6747e1a51ca757d501a7c6bf110f108f192/runtime/oti/JITInterface.hpp#L217-L219 [4] https://github.com/eclipse/openj9/blob/dd96e6747e1a51ca757d501a7c6bf110f108f192/runtime/codert_vm/cnathelp.cpp#L2751 [5] https://github.com/eclipse/openj9/blob/dd96e6747e1a51ca757d501a7c6bf110f108f192/runtime/codert_vm/znathelp.m4#L632 [6] https://github.com/eclipse/openj9/blob/dd96e6747e1a51ca757d501a7c6bf110f108f192/runtime/codert_vm/znathelp.m4#L604-L619 [7] https://github.com/eclipse/openj9/blob/dd96e6747e1a51ca757d501a7c6bf110f108f192/runtime/codert_vm/znathelp.m4#L615-L617

fjeremic commented 4 years ago

Tagging @0xdaryl @zl-wang @joransiu who might know the answer to the first question, and @gacholio who may be able to answer the second question.

zl-wang commented 4 years ago

Putting the offset in a register before dispatch is only required for interface dispatch, since the offset is not obviously available anywhere else. Also, we didn't pass the necessary info for vm helper to re-lookup the offset either.

For the more complicated index method on p and x, it is exactly to retrieve the offset from the instruction sequence for a virtual call where the offset is obviously available (after resolution and patch). I am wondering why the offset is not made available in the virtual dispatch sequence on z. Maybe it cannot be made available at a fixed (relative) location according to the particular CPU type.

gacholio commented 4 years ago

P has fixed-sized instructions, so it's easy to look back and determine virtual vs interface. I assume it's not so easy on Z, so the index is always in the register.

fjeremic commented 4 years ago

P has fixed-sized instructions, so it's easy to look back and determine virtual vs interface. I assume it's not so easy on Z, so the index is always in the register.

It seems that we already emit a NOP to disambiguate between virtual and interface dispatch: https://github.com/eclipse/openj9/blob/386b29b971be293ac255c9c7a004ad24eeb1fd9c/runtime/compiler/z/codegen/S390PrivateLinkage.cpp#L2137-L2140

I wonder if this is some historical reason, because we seem to load the VFT offset for both virtual and interface calls always. The discussion seems to suggest the offset is only needed for interface dispatch, so we could get away by not loading the offset for the virtual dispatch and implementing some logic in jitVTableIndex to fetch it out of the instruction stream.

How exactly does j2iVirtual work on Power then? From reading the code it seems to want the VFT offset even for virtual calls to be in gr12: https://github.com/eclipse/openj9/blob/386b29b971be293ac255c9c7a004ad24eeb1fd9c/runtime/codert_vm/pnathelp.m4#L561-L577

Who loads this value if we don't do it in the JIT mainline?

gacholio commented 4 years ago

See https://github.com/eclipse/openj9/blob/386b29b971be293ac255c9c7a004ad24eeb1fd9c/runtime/oti/JITInterface.hpp#L148-L223 for the instruction decode logic.

gacholio commented 4 years ago

The comment in the nathelp code is somewhat misleading - it's only there for the non-interface case (on platforms other than Z).

gacholio commented 4 years ago

The other issue is that Z can't encode negative immediate offsets, so I'm not sure you're going to be able to come up with a more compact call sequence in the virtual case, since the negative offset will need to be in a register anyway.

fjeremic commented 4 years ago

The other issue is that Z can't encode negative immediate offsets, so I'm not sure you're going to be able to come up with a more compact call sequence in the virtual case, since the negative offset will need to be in a register anyway.

That's only true for 12-bit displacements. Extended 20-bit displacements are signed.

zl-wang commented 4 years ago

so that, i suspected it is the side effect of CPU-type specific reason. is the 20bit displacement available on the lowest denominator you supported now? if yes, you may change the situation. a further issue is to deal with possible overflow of 20-bit displacemnt (very unlikely though).

fjeremic commented 4 years ago

The comment in the nathelp code is somewhat misleading - it's only there for the non-interface case (on platforms other than Z).

Sorry I'm not sure I follow. As I understand VM_JITInterface::jitVTableIndex only gets called once inside of old_slow_icallVMprJavaSendPatchupVirtual. That function seems to fetch the J2I thunk and store it into the VFT slot which we called, presumably so we never call old_slow_icallVMprJavaSendPatchupVirtual again if we are calling the same virtual method of that class.

I did not see the thunk call VM_JITInterface::jitVTableIndex, so who exactly loads gr12 in the Power case? Does the thunk perhaps branch to j2iTransition instead: https://github.com/eclipse/openj9/blob/386b29b971be293ac255c9c7a004ad24eeb1fd9c/runtime/codert_vm/pnathelp.m4#L545-L559

By "misleading" do you mean that j2iVirtual in pnathelp.m4 is only branched to for interface calls (NOT virtual calls)? And virtual calls end up dispatching to j2iTransition?

fjeremic commented 4 years ago

so that, i suspected it is the side effect of CPU-type specific reason. is the 20bit displacement available on the lowest denominator you supported now? if yes, you may change the situation. a further issue is to deal with possible overflow of 20-bit displacemnt (very unlikely though).

Yeah, the 20-bit displacement instructions are available on our minimum ALS of z9. It must have been the historic reason why this was never changed. We should be able to use 20-bit displacement instructions which support negative displacements to implement this, and avoid having to load the VFT offset for virtual calls in the mainline.

gacholio commented 4 years ago

By "misleading" do you mean that j2iVirtual in pnathelp.m4 is only branched to for interface calls (NOT virtual calls)? And virtual calls end up dispatching to j2iTransition?

No, I mean the register is only loaded for the interface case. Both interface and virtual end up at j2iVirtual, and the interpreter calls jitVtableIndex in order to decode the method from the index.

gacholio commented 4 years ago

The VM has no stake in this - if you can come up with a better instruction sequence that can be disambiguated, go for it.

fjeremic commented 4 years ago

No, I mean the register is only loaded for the interface case. Both interface and virtual end up at j2iVirtual, and the interpreter calls jitVtableIndex in order to decode the method from the index.

So then on Power for virtual calls we are storing a bogus value (whatever is in gr12 at the time) to J9VMThread.tempSlot [1] and the interpreter then only for virtual calls will ignore whatever is in gr12 and will call jitVtableIndex to find the true VFT offset?

[1] https://github.com/eclipse/openj9/blob/386b29b971be293ac255c9c7a004ad24eeb1fd9c/runtime/codert_vm/pnathelp.m4#L573

gacholio commented 4 years ago

Right, the decode code defaults to reading the register, then looks back in the instruction stream to see if it's actually an encoded virtual call (which currently does not occur on Z) and if so, decodes the index from the call.

fjeremic commented 4 years ago

Thanks, and just to confirm, j2iVirtual in pnathelp.m4 will branch to the bytecode loop here: https://github.com/eclipse/openj9/blob/bf841bbb5c6209cab4f459f00ef35916f20f41b9/runtime/vm/BytecodeInterpreter.hpp#L9142-L9145

Which makes a call to j2iVirtualMethod: https://github.com/eclipse/openj9/blob/bf841bbb5c6209cab4f459f00ef35916f20f41b9/runtime/vm/BytecodeInterpreter.hpp#L694-L699

Which then calls jitVtableIndex to find the VFT index, and that for Z is trivial, and for other platforms it decodes the instruction stream to look for the offset (if it's a virtual call)?

Describing this explicitly as I'm going to ask someone to work on getting rid of the load on Z.

gacholio commented 4 years ago

Correct.

eclipse-openj9 / openj9

Why does the JIT need to load the VFT offset in a register before a virtual call? #8999