DynamoRIO / dynamorio

Dynamic Instrumentation Tool Platform
Other
2.67k stars 562 forks source link

Reduce overhead of indirect branch on AArch64 #2390

Open egrimley opened 7 years ago

egrimley commented 7 years ago

It is well known that indirect branches are a major factor in the performance of dynamic binary translation, optimisation and instrumentation (see, for example, "Optimizing Indirect Branches in Dynamic Binary Translators", 2016), although in the case of dynamic binary instrumentation the overhead of instrumentation may dominate everything else.

DynamoRIO's handling of indirect branches on AArch64 is currently rather inefficient; we have so far not paid much attention to performance. This issue documents the current implementation of indirect branches and lists a few improvements that may be worth implementing. Some of these things may also be applicable to other architectures.

The sequence of instructions executed under DynamoRIO to simulate a single RET instruction in the original application typically looks like this:

   0x4c7ea954:  str     x2, [x28,#16]           # Save X2 to TLS.
   0x4c7ea958:  mov     x2, x30                 # Move target address to X2.
   0x4c7ea95c:  b       0x4c7ea960              # Gratuitous branch to following instr.
   0x4c7ea960:  stp     x0, x1, [x28]           # Save X0 and X1 to TLS.
   0x4c7ea964:  mov     x0, #0xae08             # Load value to identify this location;
   0x4c7ea968:  movk    x0, #0x4c81, lsl #16    # in this case we could load the value
   0x4c7ea96c:  movk    x0, #0x0, lsl #32       # with just 2 instrs but we always use 4.
   0x4c7ea970:  movk    x0, #0x0, lsl #48       #
   0x4c7ea974:  ldr     x1, [x28,#120]          # We use LDR+BR to get to look-up code
   0x4c7ea978:  br      x1                      # even when it is in range of direct B.

   0x4c521400:  str     x0, [x28,#24]           # Save X0 to TLS ... again!
   0x4c521404:  ldp     x1, x0, [x28,#168]      # Load hash mask and base from TLS. The
   0x4c521408:  and     x1, x1, x2              # hash_mask is 0x3f: could use immediate?
   0x4c52140c:  add     x1, x0, x1, lsl #4      # Shift depends on ibl_hash_func_offset.
   0x4c521410:  ldr     x0, [x1]                # Load app addr from hash table; use LDP?

   0x4c521414:  cbz     x0, 0x4c521440          # Why check for zero before match?
   0x4c521418:  sub     x0, x0, x2              # Could use EOR here.
   0x4c52141c:  cbnz    x0, 0x4c521438          # Check keys match: no, so branch.

   0x4c521438:  ldr     x0, [x1,#16]!           # Load next app addr from hash table.
   0x4c52143c:  b       0x4c521414              # This direct branch could be avoided.

   0x4c521414:  cbz     x0, 0x4c521440          # Check for zero.
   0x4c521418:  sub     x0, x0, x2
   0x4c52141c:  cbnz    x0, 0x4c521438          # Check keys match: wrong again.

   0x4c521438:  ldr     x0, [x1,#16]!
   0x4c52143c:  b       0x4c521414

   0x4c521414:  cbz     x0, 0x4c521440
   0x4c521418:  sub     x0, x0, x2
   0x4c52141c:  cbnz    x0, 0x4c521438          # Check keys match: wrong yet again.

   0x4c521438:  ldr     x0, [x1,#16]!
   0x4c52143c:  b       0x4c521414

   0x4c521414:  cbz     x0, 0x4c521440
   0x4c521418:  sub     x0, x0, x2
   0x4c52141c:  cbnz    x0, 0x4c521438          # Check keys match: fourth time lucky.

   0x4c521420:  ldp     x0, x2, [x28]           # Reload original X0 and X1 from TLS.
   0x4c521424:  str     x0, [x28,#8]            # Save X0 to TLS ... for the third time?
   0x4c521428:  ldr     x0, [x1,#8]             # Load translated addr from hash table.
   0x4c52142c:  mov     x1, x2                  # Put X1 value into X1.
   0x4c521430:  ldr     x2, [x28,#16]           # Restore X2 from TLS.
   0x4c521434:  br      x0                      # Highly unpredictable indirect branch!

   0x4c7ea984:  ldr     x0, [x28,#8]            # Fragment prefix restores X0 from TLS.

Things to do:

Here is what the sequence of instructions to simulate a RET using a hash table could perhaps look like in the best case (untested):

        stp     x29, x30, [x28, #?]     // save x29, x30      
        mov     x29, x30                // target address into x29
        bl      address_translator      // call address translator

        stp     x0, x1, [x28, #?]       // save x0, x1
        ldr     x0, [x28, #?]           // load base addr of hash table
        and     x1, x29, #0xf3          // use bits 2-7 as hash value
        add     x0, x0, x1, lsl #2      // compute address in hash table
        ldp     x0, x1, [x0]            // load key and value from table
        eor     x0, x0, x29             // compare key with target addr
        cbnz    x0, wrong               // branch away if not a match
        mov     x29, x30                // put return addr into x29
        mov     x30, x1                 // put translated addr into x30
        ldp     x0, x1, [x28, #?]       // restore x0, x1
        ret     x29                     // return from address translator

        ldr     x29, [x28, #?]          // restore x29
        ret     x30                     // "return" to translated address

        ldr     x30, [x28, #?]          // fragment prefix restores x30
derekbruening commented 7 years ago

Xref #1662, #1671, #31, #32

fhahn commented 6 years ago

The unnecessary movk instructions are not emitted anymore. I think going forward it would be good to have some (automatic) performance tracking, to make sure we get the desired benefit.