better support for clients de-referencing far, PC-relative, or stolen-reg-base app memory values: have DR mangle them

derekbruening commented 8 years ago

Split from #1823

This seems like a more general problem than just an issue with dr_insert_mbr_instrumentation(). While often inserted code that de-references app memory is app code, this is not always the case, and we want to support tool code doing so. Similarly to the stolen register on ARM, we should at least document that the burden falls on clients to detect and handle far refs on Linux rather than blindly copying opnds. We should also add convenience routines if it is too awkward to work around this using regular API routines.

A similar problem exists with PC-relative memory references: if a tool blindly copies the operand, is DR supposed to try and mangle it, potentially spilling a register and maybe messing up the tool's instrumentation assumptions?

One proposal would be a drutil_insert_get_mem_value() routine in an extension that handles the corner cases.

derekbruening commented 8 years ago

\ TODO on amd64 today tools rely on DR mangling pc-rel instrs, which could fail

If we have mangle_rel_addr() only operate on app instrs, these all fail:

debug-internal-64: 219 tests passed, **** 5 tests failed: ****
        code_api|client.count-ctis 
        code_api|client.count-ctis-noopt 
        code_api|sample.modxfer 
        code_api|sample.modxfer_app2lib 
        code_api|sample.instrcalls 

<Application /work/dr/build_suite/build_debug-internal-64/suite/tests/bin/client.count-ctis (5413) DynamoRIO usage error : encode error: rip-relative reference out of 32-bit reach>

For count-ctis-noopt:

interp: start_pc = 0x00007f4b1f4c8aa0
  0x00007f4b1f4c8aa0  ff 25 22 05 22 00    jmp    <rel> 0x00007f4b1f6e8fc8[8byte]

bb ilist before mangling:
TAG  0x00007f4b1f4c8aa0
 +0    m4 @0x000000004e78b120  65 48 89 0c 25 18 00 mov    %rcx -> %gs:0x18[8byte]
                               00 00
 +9    m4 @0x000000004e78b1a0  48 8b 0d c1 e5 fc d0 mov    <rel> 0x00007f4b1f6e8fc8[8byte] -> %rcx

It relies on DR re-relativizing the app's PC-rel mem ref to pass as an arg to a clean call.

This could fail if the cache is not 32-bit-reachable from the app code (which is not guaranteed).

For now I'll put back the mangling of meta instrs for x86 to maintain the status quo (on ARM the reach is too small and we'll assert on almost anything) but this remains an issue.

derekbruening commented 8 years ago

This kind of utility would also simplify handling app memrefs involving the stolen register on ARM, which is up to the client to properly handle and is kind of a pain.

derekbruening commented 8 years ago

Note that using drutil_insert_get_mem_addr() and loading from that address does solve the corner cases for a client. drutil_insert_get_mem_value() might just call drutil_insert_get_mem_addr() to handle the far and pc-rel corner cases, and then load from the address into a passed-in scratch reg.

derekbruening commented 8 years ago

While the utility routine will be a convenience, this issue is about more than just a utility routine, as A) clients won't know to use it: it's very natural to just take the memref opnd; B) it's awkard to use in places like a clean call arg: it takes multiple steps and explicit scratch regs. We're hoping for a solution to clients using memref opnds directly. Can we make that simpler, or at the least can we detect the app memref and raise a warning plus augment the clean call docs to warn about such things. There may not be a simple solution and it may need a fancier feature like #510 that allows chaining the call to get the mem value directly into the clean call.

derekbruening commented 7 years ago

** TODO too hard to make library routine for post-app-inst instru like this!

Inserting instru after the current app instr messes up drreg.

Having a callback where the library routine waits until the next instr and then calls the tool: hard to combine w/ tool's other instru.

We discussed this offline a little bit. It seems relatively straightforward for a client to implement instrumentation to get the stored value:

1) In the insert phase for the target store, get the store address into a register. Set some user_data state to tell the next step to do some work.

2) In the insert phase for after the target store, since the user_data state is set, go and load from the register to get the store value. Use it in the desired manner.

A call has to be specially handled, as does a non-branch bb-final store (if bb is truncated, e.g.).

Where it gets hard is making a library routine to do this, as the whole framework is based on inserting instrumentation and then acting on it immediately. Maybe it's just not worth making a library routine, and we make a sample instead showing how to do it?

If we really want a library routine, one approach not mentioned previously is to duplicate the store operation. Emulating would be a pain, but re-executing a copy of the instruction could work. Any non-copy operation must have a register as the other operand, and a copy is easy to emulate. Get the address, load the value there, and then execute a copy of the store.

E.g.:

L3   add [eax,esi,4], ecx

=>

m4   <spill edx if not dead>
m4   lea edx, [eax,esi,4]    // result of drutil_get_mem_addr
m4   mov edx, [edx]
m4   add edx, ecx                // "emulate" using the processor
m4   <to-be-stored value is now in edx: act on it>
m4   <restore edx if not dead>
L3   add [eax,esi,4], ecx

It can be optimized of course but that's the idea.

egrimley commented 7 years ago

72be5de seems to have broken client.drx_buf-test on ARM and AArch64.

toshipiazza commented 7 years ago

@egrimley Hmm that's odd, I tried on my raspberry pi (so ARM, not AArch64) and everything seemed to work fine. I'll take a closer look to make sure it really does work on ARM. However, I don't have any AArtch64 hardware, would you be able to take a look at that?

derekbruening commented 7 years ago

@egrimley could you point at the CDash failure to make it easier to see the details of how it failed?

egrimley commented 7 years ago

I'll ask what's happening about CDash.

With client.drx_buf-test, on ARM I see an infinite loop, while on AArch64 I see a CHECK failure in event_exit: num_faults is 202 instead of 204. But perhaps that's expected: there's some "NYI on AArch64" in 72be5de.

egrimley commented 7 years ago

The test passes on Debian armhf jessie, but fails on Debian armhf stretch, where it looks as though test_asm_123 (as instrumented) does something that makes pthread_join wait for ever.

toshipiazza commented 7 years ago

I'll do more research on the ARM failure (I actually didn't modify test_asm_123 in my commit so I'm wondering what's up with that). As for the AArch64 failure, I think the following patch should fix it:

diff --git a/suite/tests/client-interface/drx_buf-test.dll.c b/suite/tests/client-interface/drx_buf-test.dll.c
index 89782e9..e1d928a 100644
--- a/suite/tests/client-interface/drx_buf-test.dll.c
+++ b/suite/tests/client-interface/drx_buf-test.dll.c
@@ -436,7 +436,7 @@ event_exit(void)
      * because the callback is called on thread_exit(). Finally, two more for
      * drx_buf_insert_buf_memcpy().
      */
-    CHECK(num_faults == NUM_ITER * 2 + 2 + 2,
+    CHECK(num_faults == NUM_ITER * 2 + 2 + IF_AARCH64_ELSE(0, 2),
             "the number of faults don't match up");
     if (!drmgr_unregister_bb_insertion_event(event_app_instruction))
         CHECK(false, "exit failed");

Of course I can't test this but would you be able to verify that this fixes the AArch64 test?

egrimley commented 7 years ago

Yes, that patch fixes the AArch64 test. Which Raspberry Pi do you have?

toshipiazza commented 7 years ago

I think I have the Raspberry Pi 2 model, which is running Debian Jessie.

egrimley commented 7 years ago

Then perhaps you could try Debian Stretch, in a chroot or as an alternative boot.

toshipiazza commented 7 years ago

Interesting: I chrooted using debootstrap to pull debian stretch, and I was able to replicate the failure in debian stretch -- natively. The drx_buf-test.c program itself fails natively even without running it under dynamorio.

Even more puzzling is that when I copy that program out of the chroot and back into debian jessie, it doesn't crash! What is going on here?

egrimley commented 7 years ago

Perhaps it's the different glibc. What does GDB say?

toshipiazza commented 7 years ago

Looks like the problem was that on Debain Stretch, $r4 was assumed not clobbered, but in Debian Jessie it was restored correctly. The pthread handling code immediately dereferenced $r4 after test_thread_asm quit. Saving them for each of the assembly tests fixed the crash.

So close: it was a difference in libpthread rather than in glibc.

Submitted a PR: GH-2193

egrimley commented 7 years ago

You can't "push" a register on AArch64, and X4 doesn't need to be saved anyway. Pushing a single register on ARM does not conform to the calling convention, though it should be safe in a leaf function. Perhaps a simpler solution would be to use r12 instead of r4 on ARM. See: https://github.com/DynamoRIO/dynamorio/pull/2195

If you do find you need to save and restore registers on ARM/AArch64, take a look at SAVE_PRESERVED_REGS, added in f5adcfe.

derekbruening commented 2 years ago

I think this should be re-opened. The get-mem-value part is resolved as we provided a sample, after deciding a library routine had too many complexities. But we didn't use the sample's approach to update our other samples and tests that have problems with blindly copying far and pc-rel operands. We still have the hack that mangles meta instrs on x86 but not ARM; and in fact because of the IF_ARM used we are now mangling meta rel-addr instrs on AArch64.

derekbruening commented 2 years ago

What use cases are at issue here?

The tool copying an app memref opnd that is pc-relative, has an x86 segment, or on aarchxx uses the stolen register, and using it in a meta instr (e.g., an arg to a clean call for an indirect branch target).

This will normally only be if the tool wants the value stored at that memref address, because the tool should use drutil_insert_get_mem_addr() to get the address and it handles all these corner cases. (We proposed a drutil_insert_get_mem_value() but concluded it was quite difficult to create as a library routine and we made the memval sample instead.)

What are the consequences if DR does not mangle these?

pc-relative: on encoding it may not reach (pretty common for code cache to not reach app for 64-bit) and it's too late there to insert extra instrs, so encoding fails. On x86 there is a nice assert explaining the exact problem; other arches may have vague "can't encode" asserts.
segment and stolen reg: these will encode but will silently produce the wrong value. (They might crash if the offset goes beyond DR's mmap.)

DR could identify app-main-segment uses in tool code and warn about them, but cannot easily distinguish app from tool uses of the stolen reg or the DR segment.

What are the consequences if DR does mangle these?

Mangling might add several additional instructions (e.g., a "movz;movk;movk" sequence to materialize an address into a register on aarch64) inside a tool sequence. Some tools might assume they have tight control over the exact instruction ordering, so this could confuse restore_state handlers or other code that walks the code cache sequence.

DR's mangling may use scratch registers and slots. DR's slot 0 is not exposed to tools, so if DR uses that, there should be no conflict.

For pc-relative mangling:

x86 uses MANGLE_RIPREL_SPILL_SLOT = TLS_REG0_SLOT (regardless of which reg) so there is no problem. MANGLE_FAR_SPILL_SLOT = TLS_REG1_SLOT => only used for far cti so ok.
for aarchxx: mangle_rel_addr() uses slot0 and stolen-reg slot. so no problem there.

For the stolen register:

mangle_rel_addr spills any of r0-r5 to its own slot, so we have a problem.

For segments:

mangle_seg_ref: spills any of a,b,c,d reg to its own slot, so we have a problem. mangle_seg_ref_opnd: for 16-bit base/idx, spills a 2nd a,b,c,d reg to its own slot! So even if we changed mangle_seg_ref to always use slot 0, we'd still have a problem w/ a 16-bit seg base/idx reg.

Slots beyond slot 0 can overlap with slots used by tools or drreg. The DR use is local and self-contained. On a fault in the copied opnd: first DR would restore from the spilled slot; then the tool or drreg would likely also do a restore from the same slot, if it does a walk like drreg does and thinks DR's spill is its own. So it might "just work". If the DR restore went last that might be safer.

We need to ensure that DR will actually restore the value, though, as normally it ignores tool code. Instructions added by DR will be marked "our_mangling": so spills and restores and things like the "movz;movk;movk". But the instr with the app opnd will not be. We'll have to make sure the right thing happens; right now the translate code might complain.

How can DR identify which stolen reg or DR-segment opnds to mangle?

pc-relative opnds are easy to identify and mangle, as are opnds using the app's segment.

But for the stolen reg or a DR-segment-using opnd, DR cannot easily distinguish its use in a tool instr from a tool accessing DR's scratch space.

We could add a flag to opnd_t saying "app opnd"?

Is there space?
Are there cases where the flag would incorrectly be left on? E.g., where the app changes an app opnd for use in a tool instr and it is no longer in need of mangling. Seems unlikely: DR is the one making scratch refs; other way where client takes app opnd w/o stolen reg and inserts stolen reg might happen but we've already documented to avoid explicitly using the stolen reg..
Compatibility concerns? For an old client and new DR: client-created opnds should never be app opnds; only DR-decoded ones.

Whatever the decision, we should document whether the client needs to worry or not

Which way to go?

Both Abhinav and I leaning toward DR mangling these tool opnds: b/c of compatibility issues and complexity if we get rid of today's mangling.

If it's too complex to fully cover the corner cases above, for a short-term solution for #5316 we could have drbbdup look for pc-rel and call instrlist_insert_mov_immed_ptrsz (or, just remove the IF_ARM(xxxx) and mangle as-is on arm).

I don't understand my old comment on ARM

"on ARM the reach is too small and we'll assert on almost anything"

Doesn't that mean we'd want the mangling enabled?

Also note that on ARM DR seems to decode into a base-disp opnd with DR_REG_PC as the base instead of a rel-addr: see the test I added in PR #5306.

qinjingyuan commented 1 year ago

Hello, excuse me, is there any way to obtain the access traces of PC, addr, size, rw, and memval, which is a combination of memtracex_x86.c and memval_simple. c.

derekbruening commented 1 year ago

Hello, excuse me, is there any way to obtain the access traces of PC, addr, size, rw, and memval, which is a combination of memtracex_x86.c and memval_simple. c.

Please use the users list https://groups.google.com/forum/#!forum/DynamoRIO-Users for questions.

DynamoRIO / dynamorio