Reduce overhead of indirect branch on AArch64

It is well known that indirect branches are a major factor in the performance of dynamic binary translation, optimisation and instrumentation (see, for example, "Optimizing Indirect Branches in Dynamic Binary Translators", 2016), although in the case of dynamic binary instrumentation the overhead of instrumentation may dominate everything else.

DynamoRIO's handling of indirect branches on AArch64 is currently rather inefficient; we have so far not paid much attention to performance. This issue documents the current implementation of indirect branches and lists a few improvements that may be worth implementing. Some of these things may also be applicable to other architectures.

The sequence of instructions executed under DynamoRIO to simulate a single RET instruction in the original application typically looks like this:

   0x4c7ea954:  str     x2, [x28,#16]           # Save X2 to TLS.
   0x4c7ea958:  mov     x2, x30                 # Move target address to X2.
   0x4c7ea95c:  b       0x4c7ea960              # Gratuitous branch to following instr.
   0x4c7ea960:  stp     x0, x1, [x28]           # Save X0 and X1 to TLS.
   0x4c7ea964:  mov     x0, #0xae08             # Load value to identify this location;
   0x4c7ea968:  movk    x0, #0x4c81, lsl #16    # in this case we could load the value
   0x4c7ea96c:  movk    x0, #0x0, lsl #32       # with just 2 instrs but we always use 4.
   0x4c7ea970:  movk    x0, #0x0, lsl #48       #
   0x4c7ea974:  ldr     x1, [x28,#120]          # We use LDR+BR to get to look-up code
   0x4c7ea978:  br      x1                      # even when it is in range of direct B.

   0x4c521400:  str     x0, [x28,#24]           # Save X0 to TLS ... again!
   0x4c521404:  ldp     x1, x0, [x28,#168]      # Load hash mask and base from TLS. The
   0x4c521408:  and     x1, x1, x2              # hash_mask is 0x3f: could use immediate?
   0x4c52140c:  add     x1, x0, x1, lsl #4      # Shift depends on ibl_hash_func_offset.
   0x4c521410:  ldr     x0, [x1]                # Load app addr from hash table; use LDP?

   0x4c521414:  cbz     x0, 0x4c521440          # Why check for zero before match?
   0x4c521418:  sub     x0, x0, x2              # Could use EOR here.
   0x4c52141c:  cbnz    x0, 0x4c521438          # Check keys match: no, so branch.

   0x4c521438:  ldr     x0, [x1,#16]!           # Load next app addr from hash table.
   0x4c52143c:  b       0x4c521414              # This direct branch could be avoided.

   0x4c521414:  cbz     x0, 0x4c521440          # Check for zero.
   0x4c521418:  sub     x0, x0, x2
   0x4c52141c:  cbnz    x0, 0x4c521438          # Check keys match: wrong again.

   0x4c521438:  ldr     x0, [x1,#16]!
   0x4c52143c:  b       0x4c521414

   0x4c521414:  cbz     x0, 0x4c521440
   0x4c521418:  sub     x0, x0, x2
   0x4c52141c:  cbnz    x0, 0x4c521438          # Check keys match: wrong yet again.

   0x4c521438:  ldr     x0, [x1,#16]!
   0x4c52143c:  b       0x4c521414

   0x4c521414:  cbz     x0, 0x4c521440
   0x4c521418:  sub     x0, x0, x2
   0x4c52141c:  cbnz    x0, 0x4c521438          # Check keys match: fourth time lucky.

   0x4c521420:  ldp     x0, x2, [x28]           # Reload original X0 and X1 from TLS.
   0x4c521424:  str     x0, [x28,#8]            # Save X0 to TLS ... for the third time?
   0x4c521428:  ldr     x0, [x1,#8]             # Load translated addr from hash table.
   0x4c52142c:  mov     x1, x2                  # Put X1 value into X1.
   0x4c521430:  ldr     x2, [x28,#16]           # Restore X2 from TLS.
   0x4c521434:  br      x0                      # Highly unpredictable indirect branch!

   0x4c7ea984:  ldr     x0, [x28,#8]            # Fragment prefix restores X0 from TLS.

Things to do:

Probably the biggest component of the overhead comes from the linear probing in the hash table: the value we want is in the hash table, but we use a linear search to find it and in this case it is the fourth value that we examine. On a low-performance CPU the loads may be expensive, and on a high-performance CPU the conditional branch at 0x4c52141c is likely to be unpredictable unless the branch predictor is very clever. The values we skip over came from C library start-up code: probably we will never use them again. A bigger hash table would help, but it would have to be much bigger. Other things that might help: flushing old values from the table, or flushing the whole table; or arranging for values added most recently to be found first. But those would be global changes to how the hash tables work.
Since AArch64 instructions are aligned, ibl_hash_func_offset should probably equal 2, and ibl_indcall_hash_offset at least 2.
The performance could perhaps be significantly improved just by adjusting some option defaults (in core/optionsx.h).
Are we using different hash tables for BR, BLR, and RET? Should we?
The AArch64 hash table look-up code could be tested and measured independently of DynamoRIO and easily replaced without changing anything else so it looks like "low-hanging fruit". Unrolling the loop in the look-up code might help on some hardware but would probably be harmful on other hardware. However, using a different conditional branch for the first match check than for subsequent match checks might be generally helpful as the probability of a match could be significantly different in the two cases. The app and translated addresses could be loaded together with LDP. One could avoid the direct branch in the loop, but ideally the first value we look at would be the right one so we would not loop at all so perhaps that is the path to be optimised primarily. We could avoid one (predictable) conditional branch (for cases we care about) by checking for a match before we check for zero.
There are unnecessary loads, stores and branches (including an unnecessarily indirect one) before and after the hash table look-up. Since the argument to an indirect branch is most often X30 perhaps that register should be used as the argument to the hash-table look-up code. Having X30 as the register to be restored in the fragment prefix might help, too, in general, as one could then branch to the fragment with BLR instead of BR. There is a global register allocation problem to be solved here.
It would be nice to translate BR to BR and RET to RET so as to help the CPU's branch predictor. This is feasible but it would be a big change to DynamoRIO: you would also have to translate BL to BL, which means putting the correct value into X30 after the BL.
Having all the app's indirect branches go though a single indirect branch (in this case at 0x4c521434) makes that branch very difficult to predict. Rather than jump to code that does the look-up and the branch, generally it is better to call a function that does the look-up and do the indirect branch after returning, so there is a separate BR/RET in the translation cache for each BR/RET in the original code. Could this be implemented on AArch64 without changing the way DynamoRIO works on other architectures?
There are other standard optimisations described in the literature but they are mostly architecture-independent and would require additional complexity: adding alternative methods for handling an indirect branch rather than just improving the current universal method (a shared hash table).

Here is what the sequence of instructions to simulate a RET using a hash table could perhaps look like in the best case (untested):

        stp     x29, x30, [x28, #?]     // save x29, x30      
        mov     x29, x30                // target address into x29
        bl      address_translator      // call address translator

        stp     x0, x1, [x28, #?]       // save x0, x1
        ldr     x0, [x28, #?]           // load base addr of hash table
        and     x1, x29, #0xf3          // use bits 2-7 as hash value
        add     x0, x0, x1, lsl #2      // compute address in hash table
        ldp     x0, x1, [x0]            // load key and value from table
        eor     x0, x0, x29             // compare key with target addr
        cbnz    x0, wrong               // branch away if not a match
        mov     x29, x30                // put return addr into x29
        mov     x30, x1                 // put translated addr into x30
        ldp     x0, x1, [x28, #?]       // restore x0, x1
        ret     x29                     // return from address translator

        ldr     x29, [x28, #?]          // restore x29
        ret     x30                     // "return" to translated address

        ldr     x30, [x28, #?]          // fragment prefix restores x30

DynamoRIO / dynamorio

Reduce overhead of indirect branch on AArch64 #2390