drreg: register spilling mediation framework

derekbruening commented 9 years ago

From bruen...@google.com on July 10, 2011 16:52:53

we plan to build a "drreg" extension that provides register liveness analysis and stealing. one thing we didn't plan to do but we may want to is to add a feature to permanently steal a register. when only need access to a few fields though, directly-addressable TLS should be more performant: but when have a lot of fields a stolen register could be more efficient.

Original issue: http://code.google.com/p/dynamorio/issues/detail?id=511

derekbruening commented 9 years ago

From bruen...@google.com on July 10, 2011 13:53:03

Summary: drreg: provide permanently-stolen reg support?

derekbruening commented 9 years ago

From bruen...@google.com on March 09, 2012 06:51:05

xref other extensions from the master extension discussion: drutil ( issue #295 ), drwrap ( issue #296 ), drmgr ( issue #402 ), drsyscall ( https://code.google.com/p/drmemory/issues/detail?id=822 ), drcallstack ( https://code.google.com/p/drmemory/issues/detail?id=823 ), drmalloc ( https://code.google.com/p/drmemory/issues/detail?id=824 ), umbra ( https://code.google.com/p/drmemory/issues/detail?id=825 )

derekbruening commented 9 years ago

From bruen...@google.com on May 11, 2012 11:44:53

*\ INFO drmgr and drreg discussion :meeting_DrM_2012_1_25:

xref issue #164/PR 494720: add a post-instru pass that checks for errors in register preservation

drreg and drmgr discussion highlights:

drmgr may be complete enough but we need more examples of usage. integrating into drmemory with each module of drmem being a separate drmgr component will help.
maybe later make fancy virtual registers and map to physical in later pass instead virtual labels but mapped to physical up front
whole-bb for min # used pattern would be 0
local save+restore for extras
have drreg just be a reg picker and help track where values are: whether app is in real reg or spill or xchg
support steal reg across whole app?

*\ TODO drreg framework :meeting_DrM_2012_5_11:

how does client tell us lifetime of values in scratch registers? so drreg knows whether to let app insr clobber scratch reg or have to preserve scratch reg value. proposal: simple for now: either whole-bb reg is preserved across whole bb, or its lifetime is just between app instrs. for locally-requested spills, lifetime ends at next app instr.

API to request whole-bb reg during analysis phase:

uint flags = liveness, and whether going to use "mark used" flag
bitvector of which regs are acceptable
if call during instru, gives you local
eflags are treated like an extra whole-bb reg: updated in spill slot around

API to request app value to be restored (for OP_lea for shadow map, e.g.)

but want more than just pick best regs and keep app values in spill slots for proper restoring: also want:

track who's using what scratch registers so different functions and modules of client don't conflict: but really this is an artifact of using whole-bb for regs used locally. if could instead have such uses use a local API and have the spills/restores and regs used be later optimized across whole bb, that would eliminate the conflict problem: but that's a lot more complicated for drreg to do post-passes and to restore state on a fault. for this tracking want a state kept for each reg: avail, holds value that needs to be preserved, etc., and then the liveness can be smaller granularity than whole bb
allow sharing values in scratch regs among components on client: so state can hold identifier or sthg? but can you search for which reg has identifier, or just use for assert?
lazy spilling: API routine to mark when used or can drreg analyze and infer which ones were used?

have -conservative option for whether to use dead regs (can turn on if worried about fault or debugger examining)

for using Extensions:

for API calls into extension to insert instru, seems should always pass in which regs to use
for extension doing a drmgr pass on its own, seems better for extension to get reg from drreg directly and use it: but if multiple people call drreg for scratch reg can drreg safely give them all the same reg?
provide permanently-stolen reg support?

Summary: drreg: register spilling mediation framework
Labels: -Priority-Low Priority-Medium

derekbruening commented 9 years ago

From bruen...@google.com on June 14, 2012 10:31:15

xref issue #52 , issue #53

derekbruening commented 9 years ago

From bruen...@google.com on July 15, 2013 13:52:20

pasting more notes:

\ TODO later thoughts

virtual regs would be by far the easiest to use. however, they require a reg allocation pass that may need to build CFG => extra overhead that not every tool will want to live with. so should there be 2 different interfaces? one that uses virtual regs and one that doesn't?

drmem is not linear: for check_ignore_unaddr mem2mem, jmps down below and then can jump back up to check_ignore_resume, so reg allocator would have to build CFG and not just do linear scan. but, that should all go away w/ -replace_malloc, if want to assume that will be long-term soln. update: actually it's not clear it can go away.

derekbruening commented 9 years ago

\ TODO revised interface: no up-front whole-bb, just simple reserve+unreserve interface

Originally I had these in drreg_options_t:

    /**
     * The number of scratch registers that need to be reserved across
     * more than one application instruction.  drreg turns these into
     * "whole-basic-block" scratch registers.
     */
    uint num_whole_bb;
    /**
     * Whether to spill the arithmetic flags across each basic block,
     * to minimize per-instruction spills and restores.
     */
    bool aflags_whole_bb;

The only reason to hardcode the number of whole-bb regs up front, and to spill them at the very top of the bb and restore at the very bottom, is to simplify state xl8:

/* Our state restoration model: Only whole-bb scratch regs and aflags
 * need to be restored (i.e., all local scratch regs are restored
 * before any app instr).  For each such reg or aflags, we guarantee
 * that either the app value is in TLS at each app instr (where fault
 * might happen) or the app value is dead and it's ok to have garbage
 * in TLS b/c the app will write it before reading (this is all modulo
 * the app's own fault handler going off on a different path (xref
 * DRi#400): so we're slightly risky here).
 */

From a pure user interface point of view, drreg would be much simpler if it had no concept of global/whole-bb vs local, and instead for whichever regs are still reserved in crossing an app instr (e.g., drmem's shadow xl8 reg), drreg restores or updates if the app reads or writes.

Xref discussion above about the original interface of drreg just picking whole-bb regs and the client having to parcel out who is using which when: this simpler interface here is much nicer to use.

If too many regs are reserved across an app instr, we may run out of places to store the values.

To handle fault xl8, we can decode the in-cache bb and walk forward looking at spills + restores like DR xl8 does. We'll add dr_is_tls_access() query to DR (looks for both raw and DR TLS).

Might still want "local" hints so drreg saves less-used-in-bb regs for non-local requests? Plus, if > raw slots and into DR slots, not supposed to use across app instrs, so complains if non-"local" for those?

Would any unreserved regs warrant keeping spilled across app instrs? What is perf diff between lazy restore at next app use vs keep spilled and update spill location at next app use? Identical for app read I guess; app write is restore + real unreserve vs re-spill to tls slot. So seems the answer is no: fixed # of whole-bb regs is a perf cost.

derekbruening commented 9 years ago

Xref #1771

derekbruening commented 9 years ago

\ TODO add lazy restore of aflags

consecutive drx_insert_counter_update() (after #1771 adds drxmgr test using drreg in the counter routine):

pre:

> bin32/drrun -loglevel 4 -c suite/tests/bin/libclient.drxmgr-test.dll.so -- suite/tests/bin/common.eflags
> head -2000 `ls -1td logs/*0|head -1`/l* | grep -A 20 ', tag'
Fragment 1, tag 0xf7774a70, flags 0x1000030, shared, size 70:
  0x49765004  9f                   lahf    -> %ah
  0x49765005  0f 90 c0             seto    -> %al
  0x49765008  64 a3 4c 00 00 00    mov    %eax -> %fs:0x4c[4byte]
  0x4976500e  81 05 08 20 77 f7 01 add    $0x00000001 0xf7772008[4byte] -> 0xf7772008[4byte]
              00 00 00
  0x49765018  64 a1 4c 00 00 00    mov    %fs:0x4c[4byte] -> %eax
  0x4976501e  04 7f                add    $0x7f %al -> %al
  0x49765020  9e                   sahf   %ah
  0x49765021  9f                   lahf    -> %ah
  0x49765022  0f 90 c0             seto    -> %al
  0x49765025  64 a3 4c 00 00 00    mov    %eax -> %fs:0x4c[4byte]
  0x4976502b  81 05 0c 20 77 f7 03 add    $0x00000003 0xf777200c[4byte] -> 0xf777200c[4byte]
              00 00 00
  0x49765035  64 a1 4c 00 00 00    mov    %fs:0x4c[4byte] -> %eax
  0x4976503b  04 7f                add    $0x7f %al -> %al
  0x4976503d  9e                   sahf   %ah
  0x4976503e  89 e0                mov    %esp -> %eax
  0x49765040  68 77 4a 77 f7       push   $0xf7774a77 %esp -> %esp 0xfffffffc(%esp)[4byte]
  0x49765045  e9 d2 ff 01 00       jmp    $0x4978501c

post:

> head -2000 `ls -1td logs/*0|head -1`/l* | grep -A 20 ', tag'
Fragment 1, tag 0xf7786a70, flags 0x1000030, shared, size 63:
  0xef7c1005  9f                   lahf    -> %ah
  0xef7c1006  0f 90 c0             seto    -> %al
  0xef7c1009  64 a3 4c 00 00 00    mov    %eax -> %fs:0x4c[4byte]
  0xef7c100f  81 05 08 40 78 f7 01 add    $0x00000001 0xf7784008[4byte] -> 0xf7784008[4byte]
              00 00 00
  0xef7c1019  81 05 0c 40 78 f7 03 add    $0x00000003 0xf778400c[4byte] -> 0xf778400c[4byte]
              00 00 00
  0xef7c1023  89 e0                mov    %esp -> %eax
  0xef7c1025  64 a3 50 00 00 00    mov    %eax -> %fs:0x50[4byte]
  0xef7c102b  64 a1 4c 00 00 00    mov    %fs:0x4c[4byte] -> %eax
  0xef7c1031  04 7f                add    $0x7f %al -> %al
  0xef7c1033  9e                   sahf   %ah
  0xef7c1034  64 a1 50 00 00 00    mov    %fs:0x50[4byte] -> %eax
  0xef7c103a  68 77 6a 78 f7       push   $0xf7786a77 %esp -> %esp 0xfffffffc(%esp)[4byte]
  0xef7c103f  e9 d8 ff 01 00       jmp    $0xef7e101c

Adding to the drreg test:

interp: start_pc = 0x08048f0b
  0x08048f0b  ba f4 f1 00 00       mov    $0x0000f1f4 -> %edx
  0x08048f10  ba f4 f1 00 00       mov    $0x0000f1f4 -> %edx
  0x08048f15  0f 95 c0             setnz   -> %al
        reads flag before writing it!
  0x08048f18  39 e2                cmp    %edx %esp
        wrote overflow flag before reading it!
  0x08048f1a  eb 00                jmp    $0x08048f1c
interp: direct jump at 0x08048f1a
end_pc = 0x08048f1c

instrument_basic_block ******************

before instrumentation:
TAG  0x08048f0b
 +0    L3              ba f4 f1 00 00       mov    $0x0000f1f4 -> %edx
 +5    L3              ba f4 f1 00 00       mov    $0x0000f1f4 -> %edx
 +10   L3              0f 95 c0             setnz   -> %al
 +13   L3              39 e2                cmp    %edx %esp
 +15   L3              eb 00                jmp    $0x08048f1c
END 0x08048f0b

drreg test #4
drreg test #4
drreg test #4
drreg_reserve_aflags @0x00000000: spilling aflags
drreg test #4
drreg_event_bb_insert_late @0x08048f15 aflags=0x8: lazily restoring aflags
drreg test #4
drreg_event_bb_insert_late @0x08048f18: re-spilling aflags after app write
drreg test #4
drreg_event_bb_insert_late @0x08048f1a aflags=0x11f: lazily restoring aflags

after instrumentation:
TAG  0x08048f0b
 +0    L3              ba f4 f1 00 00       mov    $0x0000f1f4 -> %edx
 +5    L3              ba f4 f1 00 00       mov    $0x0000f1f4 -> %edx
 +10   m4 @0xef85d31c  64 a3 50 00 00 00    mov    %eax -> %fs:0x00000050[4byte]
 +16   m4 @0xef7ca0a4  9f                   lahf    -> %ah
 +17   m4 @0xef85ed1c  64 a3 4c 00 00 00    mov    %eax -> %fs:0x0000004c[4byte]
 +23   m4 @0xef7c9604  64 a1 50 00 00 00    mov    %fs:0x00000050[4byte] -> %eax
 +29   m4 @0xef85df04  3d 00 00 00 00       cmp    %eax $0x00000000
 +34   m4 @0xef7cb9e4                       <label>
 +34   m4 @0xef85d4a4  64 a3 50 00 00 00    mov    %eax -> %fs:0x00000050[4byte]
 +40   m4 @0xef7cddf4  64 a1 4c 00 00 00    mov    %fs:0x0000004c[4byte] -> %eax
 +46   m4 @0xef85d0bc  9e                   sahf   %ah
 +47   m4 @0xef7c9538  64 a1 50 00 00 00    mov    %fs:0x00000050[4byte] -> %eax
 +53   L3              0f 95 c0             setnz   -> %al
 +56   m4 @0xef7ceb4c  64 a3 50 00 00 00    mov    %eax -> %fs:0x00000050[4byte]
 +62   m4 @0xef7c9fd8  9f                   lahf    -> %ah
 +63   m4 @0xef7cb37c  64 a3 4c 00 00 00    mov    %eax -> %fs:0x0000004c[4byte]
 +69   m4 @0xef7cd05c  64 a1 50 00 00 00    mov    %fs:0x00000050[4byte] -> %eax
 +75   L3              39 e2                cmp    %edx %esp
 +77   m4 @0xef7cc8f0  64 a3 50 00 00 00    mov    %eax -> %fs:0x00000050[4byte]
 +83   m4 @0xef7cce48  64 a1 4c 00 00 00    mov    %fs:0x0000004c[4byte] -> %eax
 +89   m4 @0xef85d5d4  04 7f                add    $0x7f %al -> %al
 +91   m4 @0xef7ce710  9e                   sahf   %ah
 +92   m4 @0xef85fd04  64 a1 50 00 00 00    mov    %fs:0x00000050[4byte] -> %eax
 +98   L3              eb 00                jmp    $0x08048f1c
END 0x08048f0b

derekbruening commented 9 years ago

\ TODO support use in shared gencode outside of bb event, and in instru2instru phase

DrMem's pattern mode wants to save aflags w/ liveness analysis in the instru2instru phase (in pattern_instrument_repstr()).

Maybe for this pattern case we can add flags saving in insert phase and in instru2instru just move flags code around? But that could mess up the liveness assumptions.

Xref the non-drmgr API discussion about a parallel set of routines: drregi_*. But can we avoid whole separate routines? We can tell whether in drmgr insert phase w/o extra arg that breaks compat -- but where is liveness info stored?

We could keep the same API signatures if we store liveness in pt, just like we're doing w/ drmgr. We'd just add sthg like drreg_analyze_instrlist()? But what about the current index? Store instr ptrs with live info and search for instr? OTOH, should we not assume that the instrlist is static since the analysis in instru2instru?

Or, we add nothing extra in the API, and liveness is computed on the spot in each routine, if called not during the drmgr insert phase? We could at least start w/ that, and add explicit storage via new API routine later as an optimization w/o breaking compatibility.

Gencode support seems a subset: the user probably wants to spill while appending, and thus there are no further instrs, and thus everything is live.

derekbruening commented 7 years ago

drreg is complete enough that we should be able to close this soon. With 00d39c7 in place, it now matches the efficiency of the hand-coded spilling in the samples and as part of #1273 I was able to convert all the remaining samples to use drreg. I'm going to wait until the Dr. Memory port to drreg is complete as a few issues remain there that might require additions or possibly changes to the interface.

d749f93 i#511 drreg: initial framework, liveness analysis, and aflags implementation 97393af i#511 drreg: register reservation 2883a1f i#511 drreg: register reservation: implement drreg_get_app_value 942fd5c i#511 drreg: support reserved regs across app instrs 4b16998 i#511 drreg: lazy restore of GPR regs 58e8582 i#511 drreg: export instr_is_reg_spill_or_restore() bf99438 i#511 drreg: add fault handling 7e39d3d i#511 drreg: handle labels added by other components 998070a i#511 drreg: mark as experimental ee3ad18 i#511 drreg: add vector convenience routines 513bcdb i#511 drreg: lazily restore aflags 1f4378c i#511 drreg: restore lazy aflags on a fault 1367094 i#511 drreg: add drreg_reservation_info() 83833a9 i#511 drreg: advance priorities outside of Dr. Memory ranges 7dd5b31 i#511 drreg: add error handling callback 71bbc5b i#511 drreg: add support for aflags preservation outside insert phase 8ee8d81 i#511 drreg: add support for register preservation outside insert phase 23f2abe i#511 drreg: add drreg_is_register_dead() 2469668 i#511 drreg: add drreg_reserve_dead_register() ac8d95e i#1273 drmgr, i#511 drreg: convert countcalls to use drmgr + drreg b2c9aea i#511 drreg: convert drcachesim to use drreg 00d39c7 i#511 drreg: keep aflags in eax where possible

derekbruening commented 7 years ago

Adding a note here to remember to remove the "work in progress...interface may be in flux" from drreg.dox when closing this.

Another note: the drreg barrier needed when invoking things like dr_insert_mbr_instrumentation() needs to be better documented.

One annoying thing hit when converting the samples was that a sample not using drreg but using drx_insert_counter_increment and drmgr was forced to init drreg on its own. Perhaps we can solve this by having drx_init call drreg_init but in a "weak" way so that any pre-existing or later call overrides it?

derekbruening commented 7 years ago

359f8ee i#511 drreg: document lazy restore "barriers" 583b2fc i#511 drreg: simplify usage by combining multiple inits

derekbruening commented 7 years ago

Xref #1963

johnfxgalea commented 5 years ago

Are there any more available notes on the support of virtual registers by any chance? In particular, I am wondering whether the allocation of physical registers would be done in the final phase related to Instrumentation-to-instrumentation transformations? Moreover, how would DynamoRIO reason over virtual registers in previous stages? I assume there needs to be some new class of special registers that the IR may handle?

derekbruening commented 5 years ago

A virtual register feature was never implemented and never got far enough to have a detailed design. If you were interested in adding such a feature I would suggest filing a separate issue and writing some kind of design proposal.

DynamoRIO / dynamorio

drreg: register spilling mediation framework #511