Redo rep string expansion to avoid the instruction fetch on subsequent iterations

derekbruening commented 3 years ago

This is related to #4915 where we may remove the instr fetches for subsequent rep string iterations, currently marked as special "non-fetched" instructions, from the offline drcachesim trace post-processing. Here though the goal is remove the instruction fetch for all clients by changing the expansion.

Today the single-block-external-loop expansion results in a string operation instruction in each iteration:

$ bin64/drrun -c api/bin/libmemtrace_simple.so -- suite/tests/bin/allasm_repstr && cat `ls -1td api/bin/*.log|head -1`
Format: <data address>: <data size>, <(r)ead/(w)rite/opcode>
0x401018:  2, rep movs
0x40200e:  1, r
0x402000:  1, w
0x401018:  2, rep movs
0x40200f:  1, r
0x402001:  1, w
0x401018:  2, rep movs
0x402010:  1, r
0x402002:  1, w
0x401018:  2, rep movs
0x402011:  1, r
0x402003:  1, w
0x401018:  2, rep movs
0x402012:  1, r
0x402004:  1, w

There are 3 ways we could remove it (aside from an offline post-processing model as in #4915):

A) Use runtime state:

if TLS.flag == 0
  record instr fetch
  set TLS.flag to 1
<today's expansion but without emulated instr fetch>
if loop is going to be not taken (check ecx)
  set TLS.flag to 0
loop

This is expensive.

B) Unroll the 1st iter in the expansion. This adds a bunch more branches and we'll want to try to ensure it won't mess up drreg or other linear-flow libraries.

C) Unroll the 1st iter across 2 blocks: 1st iter block plus rest-of-loop block. This may be harder than B due to needing unique PC entries.

Probably B would be the way to go.

derekbruening commented 2 years ago

Re-visiting since we want this for #3995 to ensure the instruction counts used line up with PMU data: I don't understand B) unless it makes the whole loop internal to the block (after the unrolled 1st iter). That seems dangerous: DR relies on loops going between blocks to bound signal delivery times, perform performant "unlink" flushes, build traces, and other operations.

derekbruening commented 2 years ago

For C), I assume it's thinking of making the PC of the 2nd block after the rep byte. But how do we know the app doesn't itself jump to that point? It's unlikely but possible.

derekbruening commented 2 years ago

Trying to re-write the choices here:

A) On entry if TLS flag not set, record instr fetch and set flag; on final iter, clear flag. I assume have to try to clear flag on abnormal loop break too: a signal. So adds overhead and extra branches.

B) Keep entire loop inside block and unroll 1st iter to do fetch. Has all the downsides of a loop inside a block: messes up DR's trace building, signal delivery delay bounds, unlink flushing, shared fragment deletion promptness, etc.

C) Try to split across two blocks, one for 1st iter and 2nd for rest. Use PC+1 (skipping rep byte) for tag for 2nd and hope app never targets that PC. When see a plain string op have to decode backward I guess or have some recorded state so know it's this artificial loop body.

D) #4915: Remove all but 1st fetch in raw2trace, and add analysis to do the same when dumping a trace buffer to get the live instr counts to match too -- which also fixes online. But then instead of a single contiguous buffer being written/transmitted you have to do it in many pieces to skip the extra fetches, or waste time memcpying it on top of itself back to contiguous.

derekbruening commented 2 years ago

For drmemtrace, instruction counting for delaying traces, gaps between trace windows (xref #3995), and trace window durations are all inflated by these non-fetched instructions.

jackgallagher-arm commented 1 year ago

For the AArch64 scatter/gather expansion we decided to unroll the loop to avoid this issue. If we implement a better solution for repstr it might be worth revisiting scatter/gather expansion too.

DynamoRIO / dynamorio

Redo rep string expansion to avoid the instruction fetch on subsequent iterations #4948