It is well known that indirect branches are a major factor in the performance of dynamic binary translation, optimisation and instrumentation (see, for example, "Optimizing Indirect Branches in Dynamic Binary Translators", 2016), although in the case of dynamic binary instrumentation the overhead of instrumentation may dominate everything else.
DynamoRIO's handling of indirect branches on AArch64 is currently rather inefficient; we have so far not paid much attention to performance. This issue documents the current implementation of indirect branches and lists a few improvements that may be worth implementing. Some of these things may also be applicable to other architectures.
The sequence of instructions executed under DynamoRIO to simulate a single RET instruction in the original application typically looks like this:
0x4c7ea954: str x2, [x28,#16] # Save X2 to TLS.
0x4c7ea958: mov x2, x30 # Move target address to X2.
0x4c7ea95c: b 0x4c7ea960 # Gratuitous branch to following instr.
0x4c7ea960: stp x0, x1, [x28] # Save X0 and X1 to TLS.
0x4c7ea964: mov x0, #0xae08 # Load value to identify this location;
0x4c7ea968: movk x0, #0x4c81, lsl #16 # in this case we could load the value
0x4c7ea96c: movk x0, #0x0, lsl #32 # with just 2 instrs but we always use 4.
0x4c7ea970: movk x0, #0x0, lsl #48 #
0x4c7ea974: ldr x1, [x28,#120] # We use LDR+BR to get to look-up code
0x4c7ea978: br x1 # even when it is in range of direct B.
0x4c521400: str x0, [x28,#24] # Save X0 to TLS ... again!
0x4c521404: ldp x1, x0, [x28,#168] # Load hash mask and base from TLS. The
0x4c521408: and x1, x1, x2 # hash_mask is 0x3f: could use immediate?
0x4c52140c: add x1, x0, x1, lsl #4 # Shift depends on ibl_hash_func_offset.
0x4c521410: ldr x0, [x1] # Load app addr from hash table; use LDP?
0x4c521414: cbz x0, 0x4c521440 # Why check for zero before match?
0x4c521418: sub x0, x0, x2 # Could use EOR here.
0x4c52141c: cbnz x0, 0x4c521438 # Check keys match: no, so branch.
0x4c521438: ldr x0, [x1,#16]! # Load next app addr from hash table.
0x4c52143c: b 0x4c521414 # This direct branch could be avoided.
0x4c521414: cbz x0, 0x4c521440 # Check for zero.
0x4c521418: sub x0, x0, x2
0x4c52141c: cbnz x0, 0x4c521438 # Check keys match: wrong again.
0x4c521438: ldr x0, [x1,#16]!
0x4c52143c: b 0x4c521414
0x4c521414: cbz x0, 0x4c521440
0x4c521418: sub x0, x0, x2
0x4c52141c: cbnz x0, 0x4c521438 # Check keys match: wrong yet again.
0x4c521438: ldr x0, [x1,#16]!
0x4c52143c: b 0x4c521414
0x4c521414: cbz x0, 0x4c521440
0x4c521418: sub x0, x0, x2
0x4c52141c: cbnz x0, 0x4c521438 # Check keys match: fourth time lucky.
0x4c521420: ldp x0, x2, [x28] # Reload original X0 and X1 from TLS.
0x4c521424: str x0, [x28,#8] # Save X0 to TLS ... for the third time?
0x4c521428: ldr x0, [x1,#8] # Load translated addr from hash table.
0x4c52142c: mov x1, x2 # Put X1 value into X1.
0x4c521430: ldr x2, [x28,#16] # Restore X2 from TLS.
0x4c521434: br x0 # Highly unpredictable indirect branch!
0x4c7ea984: ldr x0, [x28,#8] # Fragment prefix restores X0 from TLS.
Things to do:
Probably the biggest component of the overhead comes from the linear probing in the hash table: the value we want is in the hash table, but we use a linear search to find it and in this case it is the fourth value that we examine. On a low-performance CPU the loads may be expensive, and on a high-performance CPU the conditional branch at 0x4c52141c is likely to be unpredictable unless the branch predictor is very clever. The values we skip over came from C library start-up code: probably we will never use them again. A bigger hash table would help, but it would have to be much bigger. Other things that might help: flushing old values from the table, or flushing the whole table; or arranging for values added most recently to be found first. But those would be global changes to how the hash tables work.
Since AArch64 instructions are aligned, ibl_hash_func_offset should probably equal 2, and ibl_indcall_hash_offset at least 2.
The performance could perhaps be significantly improved just by adjusting some option defaults (in core/optionsx.h).
Are we using different hash tables for BR, BLR, and RET? Should we?
The AArch64 hash table look-up code could be tested and measured independently of DynamoRIO and easily replaced without changing anything else so it looks like "low-hanging fruit". Unrolling the loop in the look-up code might help on some hardware but would probably be harmful on other hardware. However, using a different conditional branch for the first match check than for subsequent match checks might be generally helpful as the probability of a match could be significantly different in the two cases. The app and translated addresses could be loaded together with LDP. One could avoid the direct branch in the loop, but ideally the first value we look at would be the right one so we would not loop at all so perhaps that is the path to be optimised primarily. We could avoid one (predictable) conditional branch (for cases we care about) by checking for a match before we check for zero.
There are unnecessary loads, stores and branches (including an unnecessarily indirect one) before and after the hash table look-up. Since the argument to an indirect branch is most often X30 perhaps that register should be used as the argument to the hash-table look-up code. Having X30 as the register to be restored in the fragment prefix might help, too, in general, as one could then branch to the fragment with BLR instead of BR. There is a global register allocation problem to be solved here.
It would be nice to translate BR to BR and RET to RET so as to help the CPU's branch predictor. This is feasible but it would be a big change to DynamoRIO: you would also have to translate BL to BL, which means putting the correct value into X30 after the BL.
Having all the app's indirect branches go though a single indirect branch (in this case at 0x4c521434) makes that branch very difficult to predict. Rather than jump to code that does the look-up and the branch, generally it is better to call a function that does the look-up and do the indirect branch after returning, so there is a separate BR/RET in the translation cache for each BR/RET in the original code. Could this be implemented on AArch64 without changing the way DynamoRIO works on other architectures?
There are other standard optimisations described in the literature but they are mostly architecture-independent and would require additional complexity: adding alternative methods for handling an indirect branch rather than just improving the current universal method (a shared hash table).
Here is what the sequence of instructions to simulate a RET using a hash table could perhaps look like in the best case (untested):
stp x29, x30, [x28, #?] // save x29, x30
mov x29, x30 // target address into x29
bl address_translator // call address translator
stp x0, x1, [x28, #?] // save x0, x1
ldr x0, [x28, #?] // load base addr of hash table
and x1, x29, #0xf3 // use bits 2-7 as hash value
add x0, x0, x1, lsl #2 // compute address in hash table
ldp x0, x1, [x0] // load key and value from table
eor x0, x0, x29 // compare key with target addr
cbnz x0, wrong // branch away if not a match
mov x29, x30 // put return addr into x29
mov x30, x1 // put translated addr into x30
ldp x0, x1, [x28, #?] // restore x0, x1
ret x29 // return from address translator
ldr x29, [x28, #?] // restore x29
ret x30 // "return" to translated address
ldr x30, [x28, #?] // fragment prefix restores x30
The unnecessary movk instructions are not emitted anymore. I think going forward it would be good to have some (automatic) performance tracking, to make sure we get the desired benefit.
It is well known that indirect branches are a major factor in the performance of dynamic binary translation, optimisation and instrumentation (see, for example, "Optimizing Indirect Branches in Dynamic Binary Translators", 2016), although in the case of dynamic binary instrumentation the overhead of instrumentation may dominate everything else.
DynamoRIO's handling of indirect branches on AArch64 is currently rather inefficient; we have so far not paid much attention to performance. This issue documents the current implementation of indirect branches and lists a few improvements that may be worth implementing. Some of these things may also be applicable to other architectures.
The sequence of instructions executed under DynamoRIO to simulate a single RET instruction in the original application typically looks like this:
Things to do:
Probably the biggest component of the overhead comes from the linear probing in the hash table: the value we want is in the hash table, but we use a linear search to find it and in this case it is the fourth value that we examine. On a low-performance CPU the loads may be expensive, and on a high-performance CPU the conditional branch at 0x4c52141c is likely to be unpredictable unless the branch predictor is very clever. The values we skip over came from C library start-up code: probably we will never use them again. A bigger hash table would help, but it would have to be much bigger. Other things that might help: flushing old values from the table, or flushing the whole table; or arranging for values added most recently to be found first. But those would be global changes to how the hash tables work.
Since AArch64 instructions are aligned, ibl_hash_func_offset should probably equal 2, and ibl_indcall_hash_offset at least 2.
The performance could perhaps be significantly improved just by adjusting some option defaults (in core/optionsx.h).
Are we using different hash tables for BR, BLR, and RET? Should we?
The AArch64 hash table look-up code could be tested and measured independently of DynamoRIO and easily replaced without changing anything else so it looks like "low-hanging fruit". Unrolling the loop in the look-up code might help on some hardware but would probably be harmful on other hardware. However, using a different conditional branch for the first match check than for subsequent match checks might be generally helpful as the probability of a match could be significantly different in the two cases. The app and translated addresses could be loaded together with LDP. One could avoid the direct branch in the loop, but ideally the first value we look at would be the right one so we would not loop at all so perhaps that is the path to be optimised primarily. We could avoid one (predictable) conditional branch (for cases we care about) by checking for a match before we check for zero.
There are unnecessary loads, stores and branches (including an unnecessarily indirect one) before and after the hash table look-up. Since the argument to an indirect branch is most often X30 perhaps that register should be used as the argument to the hash-table look-up code. Having X30 as the register to be restored in the fragment prefix might help, too, in general, as one could then branch to the fragment with BLR instead of BR. There is a global register allocation problem to be solved here.
It would be nice to translate BR to BR and RET to RET so as to help the CPU's branch predictor. This is feasible but it would be a big change to DynamoRIO: you would also have to translate BL to BL, which means putting the correct value into X30 after the BL.
Having all the app's indirect branches go though a single indirect branch (in this case at 0x4c521434) makes that branch very difficult to predict. Rather than jump to code that does the look-up and the branch, generally it is better to call a function that does the look-up and do the indirect branch after returning, so there is a separate BR/RET in the translation cache for each BR/RET in the original code. Could this be implemented on AArch64 without changing the way DynamoRIO works on other architectures?
There are other standard optimisations described in the literature but they are mostly architecture-independent and would require additional complexity: adding alternative methods for handling an indirect branch rather than just improving the current universal method (a shared hash table).
Here is what the sequence of instructions to simulate a RET using a hash table could perhaps look like in the best case (untested):