linking improvements on ARM: support load-into-PC as an exit cti

Today we have made two decisions for simpler far-reaching links on ARM that keep exit ctis as normal branch instructions OP_b or conditional OP_b:

1) Turn on -indirect_stubs

2) Turn off -cbr_single_stub

This issue covers reversing those changes by adding support for load-into-PC as an exit cti.

Let's include some notes from those prior decisions:

\ TODO -no_indirect_stubs: exit cti must be OP_ldr

For reachability we want:

ldr pc, [r10 + tls_ibl_offs]

So do we update instr_is_ubr_arch() to claim that's a ubr?!? We'd need to update decode_fragment() and copy_fragment(), and instr_set_target() (called from decode_fragment())? Plus the whole interp-mangle-emit sequence would need tweaking, as currently it relies on passing the target via a jump instr. And it's not a good idea to change the jump to OP_ldr inside emit, as we have some passes there that assume length won't change: best to keep mangling to mangle().

Or, we can set -indirect_stubs.

Will we have to tweak everything about exit cti's anyway for direct link reachability? No: current proposals keep OP_b cti and use the stub for far-away targets.

We could use landing pads more easily here, b/c the targets are fixed and few: just ibl link and unlink entries.

We could go back to inlining the ibl.

Also, what about A64? We'll have to steal a reg? Xref all the same discussions over direct linking.

Decision: going with -indirect_stubs. Because later we may measure perf and decide that "ldr pc, [r10+xxx]" is too expensive and we really need landing pads or sthg that are all direct jumps, or if we come up with a soln that also works for 64-bit, it doesn't seem worth changing all the code to handle "ldr pc, mem" as an exit cti now.

TOFILE: case on branch reachability, performance of various solutions, avoiding -indirect_stubs, working w/ A64. Measure perf of "ldr pc, mem" vs series of direct jmps.

*\ TODO exit cti linking to other fragments

I guess we have to make them indirect b/c 32MB is just too short.

Something like:

ldr pc [pc+8]

Then a link/unlink is a data write: needs no icache flush. However, xref the -no_indirect_stubs discussion where making OP_ldr an exit cti will take a bit of work, and we'll have to pay for an indirect branch even when a direct one would reach.

Can we use the stub when far away? Then we can leave OP_b always as the exit cti, and have it point directly at the target when it reaches. Ideally we'd store the target when far in the stub itself to save space, but we need atomic link/unlink, so we'll have to clobber the 1st instr of the stub. That requires not clobbering the other instrs in the stub. So we'd need another ptr-sized slot at the end of each stub, and we always have an extra instr: but we gain direct instead of indirect branches when they reach, which should be likely for most code since it's co-located in the cache. So we have:

Unlinked:

    b stub
  stub:
    str r0, [r10, #r0-slot]
    movw r0, #bottom-half-&linkstub
    movt r0, #top-half-&linkstub
    ldr pc, [r10, #fcache-return-offs]
    <ptr-sized slot>

Linked, target < 32MB away (or < 1MB for T32 cbr):

    b target
  stub:
    str r0, [r10, #r0-slot]
    movw r0, #bottom-half-&linkstub
    movt r0, #top-half-&linkstub
    ldr pc, [r10, #fcache-return-offs]
    <ptr-sized slot>

Linked, target > 32MB away (or > 1MB for T32 cbr):

    b stub
  stub:
    ldr pc, [pc + 12]
    movw r0, #bottom-half-&linkstub
    movt r0, #top-half-&linkstub
    ldr pc, [r10, #fcache-return-offs]
    <target>

We'll have to turn off -cbr_single_stub as we can't have an unlinked fall-through reaching a linked stub.

What about AArch64? Do we have to spill a register (probably we'd use the stub's spill of r0), and have prefixes on every fragment with a "direct link" entry point? OP_b there can reach +-128MB. Maybe we do not put in direct prefixes by default and you have to flush to add them? For simplicity, we flush once and add to all, rather than partitioning the cache, giving up perf for simplicity on large apps? OTOH after flushing we may not need them (a reset of startup code).

Can we use landing pads? We'd need a dedicated landing pad slot for every branch crossing 32MB (128MB for A64). It could work for pcaches or sthg, or if we never run out of -vm_reserve and can plan where all cache units go, but for organically grown live caches that spill over -vm_reserve and end up in random spots it seems difficult.

*\ TODO once impl far-through-stub linking, switch OP_blx to use it instead of ibl

    /* Unfortunately while there is OP_blx with an immed, OP_bx requires
     * indirection through a register.  We thus need to swap modes separately,
     * but our ISA doesn't support mixing modes in one fragment, making
     * a local "blx next_instr" not easy.  We have two potential solutions:
     *   A) Implement far linking through stub's "ldr pc, [pc + 8]" and use
     *      it for blx.  We need to implement that anyway for reachability,
     *      but as it's not implemented yet, I'm going w/ B) for now.
     *   B) Pretend this is an indirect branch and use the ibl.
     *      This is slower so FIXME i#1551: switch to A once we have far links.
     */

except we do need to set the dcontext isa mode: but can do that in-fragment w/ mangling, right? though what if we get a fault?

DynamoRIO / dynamorio

linking improvements on ARM: support load-into-PC as an exit cti #1611