DynamoRIO / dynamorio

Dynamic Instrumentation Tool Platform
Other
2.64k stars 558 forks source link

HANG suspend signal sent to new thread dropped, while os_thread_suspend() waits forever #1283

Open derekbruening opened 9 years ago

derekbruening commented 9 years ago

From bruen...@google.com on October 05, 2013 12:14:28

A user hit this with DR custom build 4.1.2293, using an Umbra-based client: start=0x0000000072000000 end=0x000000007201c000 prot=5 comment=.../umbra/build/bin/libaio.so

Thread #9344 is building a fragment (adds a client clean call to it) in log.6.9344.html:

+197 m4 @0x000000004ae12348 e8 99 42 7c 29 call $0x00000000720168de %rsp -> %rsp 0xfffffff8 (%rsp)

Fragment 2226, tag 0x00007f2bbfe5d455, flags 0x28000030, size 269:

synch with all threads my id = 9344 Giving 4 permission and seeking 4 state

In the middle of making the fragment (b/c no "Entry into F2226..." line) it triggers a synchall. There's no "dr_suspend" nor "FLUSH" in log files, and we see:

swaptions.0.9337.html:fcache_reset_all_caches_proactively: thread 9344 suspending all threads

So it's a reset. It walks through the threads, though it's not very successful as they're all building bbs (thread-private caches I suppose):

grep -E -i '^Synching' log.6.9344.html Synching with thread 9337, giving 0, requesting 4, blocking=0 Synching with thread 9340, giving 0, requesting 4, blocking=0 Synching with thread 9341, giving 0, requesting 4, blocking=0 Synching with thread 9342, giving 0, requesting 4, blocking=0 Synching with thread 9343, giving 0, requesting 4, blocking=0 Synching with thread 9345, giving 0, requesting 4, blocking=0 Synching with thread 9346, giving 0, requesting 4, blocking=0 Synching with thread 9347, giving 0, requesting 4, blocking=0 Synching with thread 9348, giving 0, requesting 4, blocking=0 Synching with thread 9349, giving 0, requesting 4, blocking=0 Synching with thread 9350, giving 0, requesting 4, blocking=0 Synching with thread 9351, giving 0, requesting 4, blocking=0 Synching with thread 9352, giving 0, requesting 4, blocking=0 Synching with thread 9353, giving 0, requesting 4, blocking=0 Synching with thread 9354, giving 0, requesting 4, blocking=0 Synching with thread 9355, giving 0, requesting 4, blocking=0 Synching with thread 9356, giving 0, requesting 4, blocking=0 Synching with thread 9357, giving 0, requesting 4, blocking=0 Synching with thread 9358, giving 0, requesting 4, blocking=0 Synching with thread 9359, giving 0, requesting 4, blocking=0 Synching with thread 9360, giving 0, requesting 4, blocking=0 Synching with thread 9361, giving 0, requesting 4, blocking=0 Synching with thread 9362, giving 0, requesting 4, blocking=0

% grep 'not at safe' log.6.9344.html thread 9337 not at safe spot (pc=0x00000000712a346c) for 4 thread 9340 not at safe spot (pc=0x00000000712a346c) for 4 thread 9341 not at safe spot (pc=0x00000000712a346c) for 4 ...

Not sure where in DR .so 0x712a346c is.

The final thread to have a suspend attempt is doing its thread init and it prints: SYSLOG_ERROR: ERROR: master_signal_handler with no siginfo (issue #26?): tid=9362, sig=12

This thread has: Client 'Umbra' thread init

And instrument_thread_init() is after os_thread_init(), but remember that signal_thread_inherit() isn't called until after that, in new_thread_setup.

We just ignore the SIGUSR2 in that case b/c it's delayable. And the suspender sits and waits, which is the last thing in the log for that thread:

About to try synch with thread 9362 Synching with thread 9362, giving 0, requesting 4, blocking=0 Looping on synch with thread 9362

It doesn't loop all the way around even once and is sitting on that new thread, who dropped the signal on the ground.

So the bug is that we drop the signal on the ground for a new thread, yet os_thread_suspend() does not have an escape valve (despite the various max iter counts in synch.c).

This is related to issue #26 , but it seems a good idea to have a timeout param to os_thread_suspend() which we'll have to plumb through to our futexes, and make sure all callers retry or ask for no timeout (and won't hit issue #26 ).

Original issue: http://code.google.com/p/dynamorio/issues/detail?id=1283

derekbruening commented 9 years ago

From etemden...@gmail.com on October 05, 2013 09:48:09

It works when I use "-no_enable_reset" flag.

Thanks, Etem

Looks like there's a synchall for reset involved. Does it work with the option "-no_enable_reset"?