DynamoRIO / dynamorio

Dynamic Instrumentation Tool Platform
Other
2.67k stars 562 forks source link

asynch signal interactions with client clean calls #569

Open derekbruening opened 9 years ago

derekbruening commented 9 years ago

From bruen...@google.com on October 02, 2011 20:07:11

delayable signal delivery time may be unbounded if arrives in middle of client clean call since won't unlink underlying fragment unless signal arrives directly inside it. not a problem for base DR, only with client inserting clean calls.

Original issue: http://code.google.com/p/dynamorio/issues/detail?id=569

derekbruening commented 7 months ago

This also causes external detach-via-signal attempts to fail: https://groups.google.com/g/dynamorio-users/c/Yl5g5jM-NY8

To solve, presumably we'd have to store the code cache PC or linkstub or some other value into a slot prior to invoking the clean call, and then check that slot to identify the underlying fragment so we can unlink it. It would have be cleared on return from the call.

derekbruening commented 7 months ago

Alternatively we can obtain the return address from the dstack and use it to locate the fragment. This is lower overhead but more complex? Though we already rely on locating the mcontext on the dstack in dr_get_mcontext() so perhaps the dstack layout is easy to navigate from the bottom.

derekbruening commented 7 months ago

Pasting from the user list thread linked above:

I think we want to call find_next_fragment_from_gencode() for detach too to handle the clean call save/restore code.  If it's not in those and not in_fcache, I think what you would do is see whether whereami is DR_WHERE_CLEAN_CALLEE.  If that's the case, that should indicate it's in the initial callee or something it called.  Sanity check by confirming it's on the dstack and not an app stack.  Then you know there's an mcontext laid out on the dstack and I believe the return address slot should be in a constant location (IIRC clean call optimizations that only save some regs still use a full mcontext layout?).  Then you pclookup and unlink that fragment and when the clean call returns it will go back to dispatch.  There can still be delays if the clean call blocks for i/o or something but at least it's not unbounded.

derekbruening commented 7 months ago

We should share this code between detach and pending app signal handling.

artemshcherbak25 commented 7 months ago

I may not have fully understood your last answer, but I did this experiment: In the handle_nudge_signal function which call during signal handling I added else branch for the “not in cache” case. find_next_fragment_from_gencode() and make unlink for this fragment.

    if (safe_is_in_fcache(dcontext, (byte *)sc->SC_XIP, (byte *)sc->SC_XSP) &&
       dcontext->interrupted_for_nudge == NULL) {
                        ...
    } else{
        fragment_t *f = find_next_fragment_from_gencode(dcontext, sc);
        dcontext->underlying_fragment = f;  //save fragment in new field

        if (f != NULL) {
             if (unlink_fragment_for_signal(dcontext, f_test, FCACHE_ENTRY_PC(f_test))) {
                 dcontext->interrupted_for_nudge = f;
             }
        }
    }

After that, I noticed that we now go into dispatcher every time, whether we're in the cache or not. (which should have happened since we made unlink) but the detach_externally_on_new_stack() function works differently and in the case when we are not in the cache we do not translate_mcontext, and as I understood, there is called recreate_app_state_internal, which restores our application to the original state. inside recreate_app_state_internal() there is a condition in which we do not enter when not in cache:

else if (in_fcache(mcontext->pc)) {
} else {
        fragment_t *f = tdcontext->underlying_fragment;
       do_the_same_things_like_for_case_in_cache;
}

Obviously, it did not work fully and after detach the application fell off SIGSEGV. As I understand, somehow we need to do recreate_app_state_internal for the case not in the cache, but I have concerns that it is not easy and may not work when we are somewhere else in the third place (not in cache not in clean call).

Maybe we should consider an alternative, somehow call unlink_fragment_for_signal at a safe point (when we are in the code cache). Somehow postpone the unlink call until returning to the cache?