DynamoRIO / dynamorio

Dynamic Instrumentation Tool Platform
Other
2.63k stars 557 forks source link

ASSERT (drmgr-test) meta-instr faulted? must set translation field and handle fault! #1369

Open derekbruening opened 9 years ago

derekbruening commented 9 years ago

From bruen...@google.com on February 19, 2014 10:57:54

This has happened at least twice in the last year or so when I've run the test suite on win7 I can't reproduce it when run manually.

72/88 Testing: code_api|client.drmgr-test
72/88 Test: code_api|client.drmgr-test
Command: "D:/derek/dr/build_suite/build_debug-internal-64/bin64/drrun.exe" "-s" "60" "-quiet" "-exit0" "-ops" "-stderr_mask 0xC -msgbox_mask 0 -dumpcore_mask 0xfd -staged  -code_api -client_lib 'D:/derek/dr/build_suite/build_debug-internal-64/suite/tests/bin/client.drmgr-test.dll.dll;0;'  -code_api " "-use_dll" "D:/derek/dr/build_suite/build_debug-internal-64/lib64/debug/dynamorio.dll" "D:/derek/dr/build_suite/build_debug-internal-64/suite/tests/bin/client.drmgr-test.exe"
Directory: D:/derek/dr/build_suite/build_debug-internal-64/suite/tests
"code_api|client.drmgr-test" start time: Jul 11 15:26 Eastern Daylight Time
Output:
----------------------------------------------------------
About to create thread
in wnd_callback 0x0000000000000024 0
in wnd_callback 0x0000000000000081 0
in wnd_callback 0x0000000000000083 0
in wnd_callback 0x0000000000000001 0
in wnd_callback 0x0000000000008001 3 0
About to crash
<Application D:\derek\dr\build_suite\build_debug-internal-64\suite\tests\bin\client.drmgr-test.exe (8408) DynamoRIO usage error : meta-instr faulted?  must set translation field and handle fault!>
<Usage error: meta-instr faulted?  must set translation field and handle fault! (D:\derek\dr\git\src\core\x86\arch.c, line 2924)
version 3.2.1458, custom build
-client_lib 'D:/derek/dr/build_suite/build_debug-internal-64/suite/tests/bin/client.drmgr-test.dll.dll;0;' -code_api -probe_api -msgbox_mask 0 -dumpcore_mask 253 -stderr_mask 12 -stack_size >
<end of output>
Test time =   1.92 sec
----------------------------------------------------------
Test Fail Reason:
Required regular expression not found.Regex=[^About to create thread^M?
in wnd_callback 0x0_0000024 0^M?
in wnd_callback 0x0_0000081 0^M?
in wnd_callback 0x0_0000083 0^M?
in wnd_callback 0x0_0000001 0^M?
in wnd_callback 0x0_0008001 3 0^M?
About to crash^M?
Inside handler^M?
in wnd_callback 0x0_0008001 0 2^M?
Got message 0x0*0008001 1 3^M?
All done^M?
all done^M?
$
]
"code_api|client.drmgr-test" end time: Jul 11 15:26 Eastern Daylight Time
"code_api|client.drmgr-test" time elapsed: 00:00:01

It happened on 32-bit once too -- from my notes from long ago, apparently never filed:

86/93 Testing: code_api|client.drmgr-test
86/93 Test: code_api|client.drmgr-test
Command: "F:/derek/build_suite/build_debug-internal-32/bin32/drrun.exe" "-s" "60" "-quiet" "-exit0" "-ops" "-stderr_mask 0xC -msgbox_mask 0 -dumpcore_mask 0xfd -staged  -code_api -client_lib 'F:/derek/build_suite/build_debug-internal-32/suite/tests/bin/client.drmgr-test.dll.dll;0;'  -code_api " "-use_dll" "F:/derek/build_suite/build_debug-internal-32/lib32/debug/dynamorio.dll" "F:/derek/build_suite/build_debug-internal-32/suite/tests/bin/client.drmgr-test.exe"
Directory: F:/derek/build_suite/build_debug-internal-32/suite/tests

Original issue: http://code.google.com/p/dynamorio/issues/detail?id=1369

derekbruening commented 8 years ago

Related assert in translate.c

Non-det (can't repro):

debug-internal-32: 192 tests passed, **** 1 tests failed: ****
        code_api|client.drmgr-test =>  Application D:\derek\dr\build_suite\build_debug-internal-32\suite\tests\bin\client.drmgr-test.exe (6064).  Internal Error: DynamoRIO debug check failure: D:\derek\dr\git\src\core\translate.c:807 false 
derekbruening commented 7 years ago

This happened in AppVeyor:

https://ci.appveyor.com/project/DynamoRIO/dynamorio/build/1.0.157/job/q6loeg0tadiuf3u9

in event_thread_init_ex
in event_thread_init
in pre_sys_B
in pre_sys_A
in post_sys_B
in post_sys_A
About to create thread
in wnd_callback 0x0000000000000024 0
in wnd_callback 0x0000000000000081 0
in wnd_callback 0x0000000000000083 0
in wnd_callback 0x0000000000000001 0
in wnd_callback 0x0000000000008001 3 0
About to crash
<Application C:\projects\dynamorio\build\build_debug-internal-64\suite\tests\bin\client.drmgr-test.exe (3824) DynamoRIO usage error : meta-instr faulted?  must set translation field and handle fault!>
<Usage error: meta-instr faulted?  must set translation field and handle fault! (C:\projects\dynamorio\core\translate.c, line 724)
version 6.2.17220, custom build
-no_dynamic_options -client_lib 'C:\projects\dynamorio\build\build_debug-internal-64\suite\tests\bin\client.drmgr-test.dll.dll;0;' -code_api -probe_api -msgbox_mask 0 -dumpcore_mask 125 -stderr_mask 12 -stack_size 56K -max_elide_jmp 0 -max_elide_call 0 -no_inline_ignored_syscalls -staged -native_exec_default_list '' -no_
C:\projects\dynamorio\build\build_debug-internal-64\suite\tests\bin\client.drmgr-test.dll.dll=0x00007ff738080000
C:\windows/system32/KERNEL32.dll=0x000000765dd00000
C:\windows/system32/KERNELBASE.dll=0x000000765de50000
C:\projects\dynamorio\build\build_debug-internal-64\ext\lib64\debug/drmgr.dll=0x00007ff740130000>
derekbruening commented 7 years ago

The simpler translation failure also happened in AppVeyor:

https://ci.appveyor.com/project/DynamoRIO/dynamorio/build/1.0.180/job/2ufv6l8duo8m5bic

code_api|client.drmgr-test =>  Application C:\projects\dynamorio\build\build_debug-internal-32\suite\tests\bin\client.drmgr-test.exe (6092).  Internal Error: DynamoRIO debug check failure: C:\projects\dynamorio\core\translate.c:817 false 
abhinav92003 commented 4 years ago

Reviving this thread as this issue is seen frequently on a proprietary app running on AArch64.

abhinav92003 commented 3 years ago

The assert failure occurs in recreate_app_state_from_ilist at https://github.com/DynamoRIO/dynamorio/blob/1bc72b6651f00a1db8ed0863056cd89705d75a26/core/translate.c#L910

recreate_app_state_from_ilist receives an ilist and start_cache address. Usually the instrs in ilist match those starting at start_cache, but in the failing invocation of recreate_app_state_from_ilist, they don't match.

The code iterates over ilist; and when cpc=1247216016, ilist has ldr +0xf0(%sp)[8byte] -> %x30, but decode shows svc $0x0000 at cpc. (record_pending_signal also saw the syscall at this address, which is why it attempted to receive this signal now). This is where we hit the assert, as the former instr (the ldr one) turns out to be a meta-instr.

The ilist is created by recreate_app_state_internal at https://github.com/DynamoRIO/dynamorio/blob/1bc72b6651f00a1db8ed0863056cd89705d75a26/core/translate.c#L1292. Checking that logic now.

abhinav92003 commented 3 years ago

DR attempts to create the ilist for the available fragment using recreate_fragment_ilist at https://github.com/DynamoRIO/dynamorio/blob/08f48f398ff1f4c913be450207380e9d6f59bc01/core/translate.c#L1303

During recreate_fragment_ilist, build_bb_ilist creates the ilist using the fragment's tag (which is the app pc). I see that the app instrs from the bb are added to the ilist just fine. But the client meta instrs added to the ilist by client_process_bb seem very different that the one in the fragment originally. https://github.com/DynamoRIO/dynamorio/blob/08f48f398ff1f4c913be450207380e9d6f59bc01/core/arch/interp.c#L4125

abhinav92003 commented 3 years ago

I found that two different sets of instrumentations are being applied by the client (drcachesim) -- using event_delay_app_instruction and event_app_instruction. drcachesim switches from former to latter after the delay threshold specified by -trace_after_instrs is reached. After the delay threshold, for fragments created before it, ilist recreation is done using event_app_instruction whereas the code cache version seems constructed using event_delay_app_instruction, which leads to the mismatch and the assert failure.

To verify: 3 consecutive attempts with -trace_after_instrs hit the assert failure, but 3 consecutive attempts without didn't.

abhinav92003 commented 3 years ago

I see that drcachesim's tracer indeed invokes dr_unlink_flush_region when delay threshold is reached; this seems intended to flush out stale fragments. https://github.com/DynamoRIO/dynamorio/blob/5cbe8113c919b557f7bfc3daafffc9b1fef7938c/clients/drcachesim/tracer/tracer.cpp#L1393

However, based on the documentation, it shouldn't be invoked with size=0, which seems to be the case in the call above. https://github.com/DynamoRIO/dynamorio/blob/5cbe8113c919b557f7bfc3daafffc9b1fef7938c/core/lib/instrument_api.h#L6265 It also mentions some restrictions on the options that this routine is available with. Verifying if all these conditions are met...

derekbruening commented 3 years ago

I thought you had printed out the instrumentation applied to the original block and to the recreated block and they were identical?

derekbruening commented 3 years ago

I thought you had printed out the instrumentation applied to the original block and to the recreated block and they were identical?

And that the recreate walk was starting at a shifted point, beyond which the code was identical between the two: a clean call. I though the instrumentation prior to the threshold was an inlined counter increment, ruling out the delay.

derekbruening commented 3 years ago

For using dr_unlink_flush_region: it is not synchronous, so there is a problem there with the instru switch. One solution is to use dr_flush_region instead, which is synchronous but more expensive -- but given that this only happens once that's probably ok. Another is to store translations, which adds its own overhead. Another is to track which blocks used which type of instru inside the tracer.

derekbruening commented 3 years ago

Another is to track which blocks used which type of instru inside the tracer.

This may not be simple with duplicate tags for thread-private blocks and intra-trace blocks.

abhinav92003 commented 3 years ago

Another is to track which blocks used which type of instru inside the tracer.

This may not be simple with duplicate tags for thread-private blocks and intra-trace blocks.

I'm first trying to replace dr_unlink_flush_region with dr_flush_region and dr_redirect_execution to the next app pc.

Seeing a new debug assert failure:

 Internal Error: DynamoRIO debug check failure: dynamorio/core/arch/clean_call_opt_shared.c:216 (try__except) == &global_try_except || (try__except) == NULL || (try__except) == &get_thread_private_dcontext()->try_except

This seems to come up when I add the dr_flush_region call to the hit_instr_count_threshold/check_instr_count_threshold inlined callee. Checking...

abhinav92003 commented 3 years ago
 Internal Error: DynamoRIO debug check failure: dynamorio/core/arch/clean_call_opt_shared.c:216 (try__except) == &global_try_except || (try__except) == NULL || (try__except) == &get_thread_private_dcontext()->try_except

Here's a stacktrace for this:

 [Ignore #0 - I added this method to set a breakpoint, it contains just an assert(false)] 
(gdb) bt
#0 fail_with_assert () at /usr/local/google/home/sharmaabhinav/dr/src/i1369-3/core/arch/clean_call_opt_shared.c:218
#1  0x0000fffff7e8a668 in decode_callee_instr (dcontext=0xfffde8c6ec40, ci=0xfffdeadc3df0, instr_pc=0x72018b34 <clean_call()> "\375{\276\251\375\003") at /usr/local/google/home/sharmaabhinav/dr/src/i1369-3/core/arch/clean_call_opt_shared.c:242
#2  0x0000fffff7e8b570 in decode_callee_ilist (dcontext=0xfffde8c6ec40, ci=0xfffdeadc3df0) at /usr/local/google/home/sharmaabhinav/dr/src/i1369-3/core/arch/clean_call_opt_shared.c:389
#3  0x0000fffff7e8d228 in analyze_clean_call (dcontext=0xfffde8c6ec40, cci=0xfffdeb2dd230, where=0xfffdea122650, callee=0x72018b34 <clean_call()>, save_fpstate=false, always_out_of_line=true, num_args=0, args=0x0) at /usr/local/google/home/sharmaabhinav/dr/src/i1369-3/core/arch/clean_call_opt_shared.c:739
#4  0x0000fffff7d76de0 in dr_insert_clean_call_ex_varg (drcontext=0xfffde8c6ec40, ilist=0xfffdea120be0, where=0xfffdea122650, callee=0x72018b34 <clean_call()>, save_flags=DR_CLEANCALL_ALWAYS_OUT_OF_LINE, num_args=0, args=0x0) at /usr/local/google/home/sharmaabhinav/dr/src/i1369-3/core/lib/instrument.c:5606
#5  0x0000fffff7d77760 in dr_insert_clean_call_ex (drcontext=0xfffde8c6ec40, ilist=0xfffdea120be0, where=0xfffdea122650, callee=0x72018b34 <clean_call()>, save_flags=DR_CLEANCALL_ALWAYS_OUT_OF_LINE, num_args=0) at /usr/local/google/home/sharmaabhinav/dr/src/i1369-3/core/lib/instrument.c:5734
#6  0x00000000720195cc in instrument_clean_call (drcontext=0xfffde8c6ec40, ilist=0xfffdea120be0, where=0xfffdea122650, reg_ptr=9) at /usr/local/google/home/sharmaabhinav/dr/src/i1369-3/clients/drcachesim/tracer/tracer.cpp:783
#7  0x000000007201b8a0 in event_app_instruction (drcontext=0xfffde8c6ec40, tag=0xfffff74f0c38, bb=0xfffdea120be0, instr=0xfffdea122650, for_trace=false, translating=true, user_data=0x48c90d70) at /usr/local/google/home/sharmaabhinav/dr/src/i1369-3/clients/drcachesim/tracer/tracer.cpp:1191
#8  0x0000000073004ff4 in drmgr_bb_event_do_insertion_per_instr (drcontext=0xfffde8c6ec40, tag=0xfffff74f0c38, bb=0xfffdea120be0, inst=0xfffdea122650, for_trace=0 '\000', translating=1 '\001', iter_insert=0xfffdeb2dd788, pair_data=0x48c90cf0, quartet_data=0x48c90d58) at /usr/local/google/home/sharmaabhinav/dr/src/i1369-3/ext/drmgr/drmgr.c:872
#9  0x00000000730054c0 in drmgr_bb_event_do_instrum_phases (drcontext=0xfffde8c6ec40, tag=0xfffff74f0c38, bb=0xfffdea120be0, for_trace=0 '\000', translating=1 '\001', pt=0x48c90460, local_info=0xfffdeb2dd750, pair_data=0x48c90cf0, quartet_data=0x48c90d58) at /usr/local/google/home/sharmaabhinav/dr/src/i1369-3/ext/drmgr/drmgr.c:977
#10 0x0000000073005cc8 in drmgr_bb_event (drcontext=0xfffde8c6ec40, tag=0xfffff74f0c38, bb=0xfffdea120be0, for_trace=0 '\000', translating=1 '\001') at /usr/local/google/home/sharmaabhinav/dr/src/i1369-3/ext/drmgr/drmgr.c:1141
#11 0x0000fffff7d6d038 in instrument_basic_block (dcontext=0xfffde8c6ec40, tag=0xfffff74f0c38 "\346\003", bb=0xfffdea120be0, for_trace=false, translating=true, emitflags=0xfffdeb2de5d4) at /usr/local/google/home/sharmaabhinav/dr/src/i1369-3/core/lib/instrument.c:1688
#12 0x0000fffff7e5ee00 in client_process_bb (dcontext=0xfffde8c6ec40, bb=0xfffdeb2de8a0) at /usr/local/google/home/sharmaabhinav/dr/src/i1369-3/core/arch/interp.c:2756
#13 0x0000fffff7e63728 in build_bb_ilist (dcontext=0xfffde8c6ec40, bb=0xfffdeb2de8a0) at /usr/local/google/home/sharmaabhinav/dr/src/i1369-3/core/arch/interp.c:4125
#14 0x0000fffff7e68d84 in recreate_bb_ilist (dcontext=0xfffde8c6ec40, pc=0xfffff74f0c38 "\346\003", pretend_pc=0xfffff74f0c38 "\346\003", stop_pc=0x0, flags=0, res_flags=0xfffdeb2deaec, res_exit_type=0x0, check_vm_area=true, mangle=true, vmlist_out=0x0, call_client=true, for_trace=false) at /usr/local/google/home/sharmaabhinav/dr/src/i1369-3/core/arch/interp.c:5277
#15 0x0000fffff7e69380 in recreate_fragment_ilist (dcontext=0xfffde8c6ec40, pc=0x0, f_res=0xfffdeb2debd8, alloc_res=0xfffdeb2debcf, mangle=true, call_client=true) at /usr/local/google/home/sharmaabhinav/dr/src/i1369-3/core/arch/interp.c:5370
#16 0x0000fffff7d81994 in recreate_app_state_internal (tdcontext=0xfffde8c6ec40, mcontext=0xfffdeb2df330, just_pc=false, owning_f=0x0, restore_memory=false) at /usr/local/google/home/sharmaabhinav/dr/src/i1369-3/core/translate.c:1303
#17 0x0000fffff7d83184 in recreate_app_state (tdcontext=0xfffde8c6ec40, mcontext=0xfffdeb2df330, restore_memory=false, f=0x0) at /usr/local/google/home/sharmaabhinav/dr/src/i1369-3/core/translate.c:1536
#18 0x0000fffff7d5679c in at_safe_spot (trec=0xfffde8c6f2e8, mc=0xfffdeb2df330, desired_state=THREAD_SYNCH_SUSPENDED_VALID_MCONTEXT_OR_NO_XFER) at /usr/local/google/home/sharmaabhinav/dr/src/i1369-3/core/synch.c:622
#19 0x0000fffff7d57e38 in synch_with_thread (id=230509, block=false, hold_initexit_lock=true, caller_state=THREAD_SYNCH_NONE, desired_state=THREAD_SYNCH_SUSPENDED_VALID_MCONTEXT_OR_NO_XFER, flags=2) at /usr/local/google/home/sharmaabhinav/dr/src/i1369-3/core/synch.c:1040
#20 0x0000fffff7d596f0 in synch_with_all_threads (desired_synch_state=THREAD_SYNCH_SUSPENDED_VALID_MCONTEXT_OR_NO_XFER, threads_out=0xfffff7fc7000 <flush_threads>, num_threads_out=0xfffff7fc7008 <flush_num_threads>, cur_state=THREAD_SYNCH_NO_LOCKS_NO_XFER, flags=2) at /usr/local/google/home/sharmaabhinav/dr/src/i1369-3/core/synch.c:1426
#21 0x0000fffff7be4b78 in flush_fragments_synchall_start (ignored=0xfffdeadb4a00, base=0x0, size=18446744073709551615, exec_invalid=false) at /usr/local/google/home/sharmaabhinav/dr/src/i1369-3/core/fragment.c:6013
#22 0x0000fffff7be702c in flush_fragments_synch_unlink_priv (dcontext=0xfffdeadb4a00, base=0x0, size=18446744073709551615, own_initexit_lock=false, exec_invalid=false, force_synchall=true) at /usr/local/google/home/sharmaabhinav/dr/src/i1369-3/core/fragment.c:6538
#23 0x0000fffff7be8350 in flush_fragments_in_region_start (dcontext=0xfffdeadb4a00, base=0x0, size=18446744073709551615, own_initexit_lock=false, free_futures=false, exec_invalid=false, force_synchall=true) at /usr/local/google/home/sharmaabhinav/dr/src/i1369-3/core/fragment.c:6839
#24 0x0000fffff7be8b3c in flush_fragments_from_region (dcontext=0xfffdeadb4a00, base=0x0, size=18446744073709551615, force_synchall=true) at /usr/local/google/home/sharmaabhinav/dr/src/i1369-3/core/fragment.c:6913
#25 0x0000fffff7d7b814 in dr_flush_region (start=0x0, size=18446744073709551615) at /usr/local/google/home/sharmaabhinav/dr/src/i1369-3/core/lib/instrument.c:7224
#26 0x000000007201c200 in hit_instr_count_threshold (next_pc=0xfffff29042d8 "s\002@\371\340\003\023\252\341\003\024\252\342\003\025\252v\325\a\224\200\002\025\213\364OC\251\366WB\251\367\v@\371\375{Ĩ\300\003_\326i\002@\371\310\002\027\213h\006") at /usr/local/google/home/sharmaabhinav/dr/src/i1369-3/clients/drcachesim/tracer/tracer.cpp:1404
#27 0x000000007201c314 in check_instr_count_threshold (incby=5, next_pc=0xfffff29042d8 "s\002@\371\340\003\023\252\341\003\024\252\342\003\025\252v\325\a\224\200\002\025\213\364OC\251\366WB\251\367\v@\371\375{Ĩ\300\003_\326i\002@\371\310\002\027\213h\006") at /usr/local/google/home/sharmaabhinav/dr/src/i1369-3/clients/drcachesim/tracer/tracer.cpp:1420
#28 0x0000000047713014 in ?? ()

I was able to resolve this by relaxing the assert at https://github.com/DynamoRIO/dynamorio/blob/5cbe8113c919b557f7bfc3daafffc9b1fef7938c/core/arch/clean_call_opt_shared.c#L216 to use a TRY_EXCEPT_ALLOW_NO_DCONTEXT with GLOBAL_DCONTEXT, instead of TRY_EXCEPT. This seems reasonable to me because the current thread's dcontext is not available anymore #21 onwards above (see ignored arg).

After this patch, I'm still seeing the "meta-instr faulted" assert failure. This happens before dr_flush_region gets a chance to complete. This happens in a thread which DR indeed attempted to synch_with_thread with but got a THREAD_SYNCH_RESULT_NOT_SAFE result in many attempts.

derekbruening commented 3 years ago

to use a TRY_EXCEPT_ALLOW_NO_DCONTEXT with GLOBAL_DCONTEXT, instead of TRY_EXCEPT. This seems reasonable to me because the current thread's dcontext is not available anymore

The current thread's dcontext is always available through its TLS pointer (get_thread_private_dcontext()), unless you're in early process init or late process exit code.

abhinav92003 commented 3 years ago

The current thread's dcontext is always available through its TLS pointer (get_thread_private_dcontext()), unless you're in early process init or late process exit code.

TRY_EXCEPT_ALLOW_NO_DCONTEXT(GLOBAL_DCONTEXT,...) internally does the same -- if !dynamo_initialized, it uses global_try_except, or else it uses get_thread_private_dcontext()->try_except. This might be safer than using TRY_EXCEPT(get_thread_private_dcontext(),... directly.

After this patch, I'm still seeing the "meta-instr faulted" assert failure. This happens before dr_flush_region gets a chance to complete. This happens in a thread which DR indeed attempted to synch_with_thread with but got a THREAD_SYNCH_RESULT_NOT_SAFE result in many attempts.

@derekbruening Any thoughts about this? This seems like another race to me. It occurs during handling of SIGPROF (same as the original failure), not during handling of the SIGUSR2 which is sent by DR during synch_with_thread. It seems that dr_flush_region isn't able to make it in time, right?

As I understand it, the sequence of events for the thread T that sees the "meta-instr faulted" assert failure: T enters master_signal_handler to handle SIGPROF DR sends SIGUSR2 to T during synch_with_thread As a result of previous, T enters master_signal_handler to handle SIGUSR2 T exits master_signal_handler for SIGUSR2 synch_with_thread returns THREAD_SYNCH_RESULT_NOT_SAFE for synch with T T resumes handling of SIGPROF and encounters the assert failure

derekbruening commented 3 years ago

The current thread's dcontext is always available through its TLS pointer (get_thread_private_dcontext()), unless you're in early process init or late process exit code.

TRY_EXCEPT_ALLOW_NO_DCONTEXT(GLOBAL_DCONTEXT,...) internally does the same -- if !dynamo_initialized, it uses global_try_except, or else it uses get_thread_private_dcontext()->try_except. This might be safer than using TRY_EXCEPT(get_thread_private_dcontext(),... directly.

Yes, safer when in code that might be called in early init or late exit: but I don't think that clean call analysis is such code.

After this patch, I'm still seeing the "meta-instr faulted" assert failure. This happens before dr_flush_region gets a chance to complete. This happens in a thread which DR indeed attempted to synch_with_thread with but got a THREAD_SYNCH_RESULT_NOT_SAFE result in many attempts.

@derekbruening Any thoughts about this?

Did you do the flush first and only when it's finished enable the second-phase instrumentation?

derekbruening commented 3 years ago

Did you do the flush first and only when it's finished enable the second-phase instrumentation?

Hmm it does seem to be impossible doesn't it, with the current interface. The flushing is all geared toward the client being able to free bookkeeping. It may be that the only solution is to store translations, or to add a callback to the synchall flush called at the all-suspended point. Interesting that this has not come up before with phased instrumentation.

abhinav92003 commented 3 years ago

Did you do the flush first and only when it's finished enable the second-phase instrumentation?

Before, I didn't try changing the current order of those: first the change in instrumentation and then the flush. My thought was that there'd be a race even if we change the order (to flush and then change instrumentation): as other threads are still running, it may lead to some bbs getting the old instrumentation again. Nonetheless, I tried changing the order now. This allowed me to get past the "meta-instr faulted" assert failure (I'm seeing some SIGSEGV now. Not sure why that is; though, I did confirm that dr_flush_region and dr_redirect_execution finish and d_r_dispatch gets invoked successfully). But I suspect that the race is still there, so this design might not fix the issue.

It may be that the only solution is to store translations, or to add a callback to the synchall flush called at the all-suspended point.

Storing translations leads to more memory overhead (upto 20% as per documentation). Is that acceptable? I thought briefly about storing only for debug builds, but I don't think that's a good idea because that would reduce usefulness of debug runs (as debug behaviour would diverge from release).

or to add a callback to the synchall flush called at the all-suspended point

And change the instrumentation in the callback right? But won't DR continue to build new fragments while the flush and callback is in-progress? Or can we pause all threads somehow until the callback completes?

derekbruening commented 3 years ago

But I suspect that the race is still there, so this design might not fix the issue.

Yes, that is what I meant by it being impossible.

Storing translations leads to more memory overhead (upto 20% as per documentation). Is that acceptable?

For most uses, like standalone tracing of SPEC or sthg, it's fine. For resource-budgeted uses on cloud/datacenter: maybe measure it on a large app to see a real number. The other concern is that storing is not used much so there may be stability risks -- but it would be good long-term to find any such issues anyway.

or to add a callback to the synchall flush called at the all-suspended point

And change the instrumentation in the callback right?

Yes.

But won't DR continue to build new fragments while the flush and callback is in-progress? Or can we pause all threads somehow until the callback completes?

This "synchall" type of flush suspends all the threads.

abhinav92003 commented 3 years ago

I'm still seeing the "meta-instr faulted" assert failure. This happens before dr_flush_region gets a chance to complete.

Did you do the flush first and only when it's finished enable the second-phase instrumentation?

Hmm it does seem to be impossible doesn't it, with the current interface.

I added an optional callback arg to dr_flush_region, that gets executed before the threads are resumed by flush_fragments_in_region_finish in flush_fragments_from_region. In the callback, I change the instrumentation before letting the threads go. Doing this resolved the "meta-instr faulted" assert failure.

I'm seeing some SIGSEGV now

Back to this seg fault that I saw with the other workaround (of reversing the flush and change-instrumentation order) too: I see that the thread that coordinates the flush enters d_r_dispatch successfully; so do various other threads. Some threads seg fault and don't reach d_r_dispatch.

I'm currently trying to understand what's different between the threads that segfault and the ones that don't. Initially, all threads are suspended in master_signal_handler_C for the suspend signal. But, I see that the segfaulting ones have dcontext->next_tag in different region (something like fffff74f0c5c, instead of 4a272400)

(gdb) info proc mappings
Mapped address spaces:

          Start Addr           End Addr       Size     Offset objfile
          0x4a1b0000         0x4a1c0000    0x10000        0x0 
...
          0x4a240000         0x4a270000    0x30000        0x0 
          0x4a270000         0x4a280000    0x10000        0x0 
          0x4a280000         0x4a290000    0x10000        0x0 
...
      0xfffff74d0000     0xfffff74e0000    0x10000        0x0 /usr/grte/v5/lib64/librt-2.27.so
      0xfffff74e0000     0xfffff7500000    0x20000        0x0 /usr/grte/v5/lib64/libpthread-2.27.so
      0xfffff7500000     0xfffff7510000    0x10000    0x10000 /usr/grte/v5/lib64/libpthread-2.27.so
abhinav92003 commented 3 years ago

I'll create a draft PR to show intermediate progress, so that it's easier to provide feedback.

derekbruening commented 3 years ago

@abhinav92003 -- are you sure that PR #4491 was related to this issue? Stepping back, this was filed for this assert happening on x86 in the drmgr-test. Your fix was for phased instrumentation doing a racy phase switch, right? I don't think that's related to drmgr-test on x86? Probably should re-open?

abhinav92003 commented 3 years ago

PR #4491 definitely fixed this assert on the proprietary app.

I didn't reproduce the issue on those tests, as (based on the comments) it seemed to occur very infrequently, and it occurred 4/5 runs on the proprietary app. I guess it's possible that the tests hit the assert for a different reason than what PR #4491 fixes.

derekbruening commented 3 years ago

This same assert appeared on the cleancallsig test: https://github.com/DynamoRIO/dynamorio/pull/4703/checks?check_run_id=1803908764


2021-02-01T04:22:35.6531031Z 169: Test command: /home/runner/work/dynamorio/dynamorio/build_debug-internal-64/bin64/drrun "-s" "90" "-quiet" "-debug" "-killpg" "-stderr_mask" "0xC" "-dumpcore_mask" "0" "-code_api" "-c" "/home/runner/work/dynamorio/dynamorio/build_debug-internal-64/suite/tests/bin/libclient.strace.dll.so" "--" "/home/runner/work/dynamorio/dynamorio/build_debug-internal-64/suite/tests/bin/client.strace"
2021-02-01T04:22:35.6532869Z 169: Test timeout computed to be: 600
2021-02-01T04:22:35.6533257Z 165: all done
2021-02-01T04:22:35.6534642Z 165: <Application /home/runner/work/dynamorio/dynamorio/build_debug-internal-64/suite/tests/bin/client.cleancallsig (22421) DynamoRIO usage error : meta-instr faulted?  must set translation field and handle fault!>
2021-02-01T04:22:35.6536528Z 165: <Usage error: meta-instr faulted?  must set translation field and handle fault! (/home/runner/work/dynamorio/dynamorio/core/translate.c, line 997)
2021-02-01T04:22:35.6537436Z 165: version 8.0.18659, custom build
2021-02-01T04:22:35.6539652Z 165: -no_dynamic_options -client_lib '/home/runner/work/dynamorio/dynamorio/build_debug-internal-64/suite/tests/bin/libclient.cleancallsig.dll.so;0;' -client_lib64 '/home/runner/work/dynamorio/dynamorio/build_debug-internal-64/suite/tests/bin/libclient.cleancallsig.dll.so;0;' -code_api -stderr_mask 12 -stack_size 56K -signal_
2021-02-01T04:22:35.6541423Z 165: 0x00007f9bf39c9080 0x00007f9df3c0714d
2021-02-01T04:22:35.6541990Z 165: 0x00007f9bf39c90d0 0x00007f9df3d24cc0
2021-02-01T04:22:35.6542544Z 165: 0x00007f9bf39c91d0 0x00007f9df3d272e0
2021-02-01T04:22:35.6543118Z 165: 0x00007f9bf39ca4e0 0x00007f9df3d278a0
2021-02-01T04:22:35.6543701Z 165: 0x00007f9bf39ca520 0x00007f9df3cfdee6
2021-02-01T04:22:35.6544262Z 165: 0x00007f9bf39ca580 0x00007f9df3e0763c
2021-02-01T04:22:35.6544840Z 165: 0x00007f9bf39caec0 0x00007f9df3e0d388
2021-02-01T04:22:35.6545410Z 165: 0x00007f9bf39cb1d0 0x00007f9df3e10647
2021-02-01T04:22:35.6545969Z 165: 0x00007f9bf39cb370 0x00007f9df3dceea0
2021-02-01T04:22:35.6546529Z 165: 0x00007fff3cf3c040 0x000000000000000a
2021-02-01T04:22:35.6547647Z 165: /home/runner/work/dynamorio/dynamorio/build_debug-internal-64/lib64/debug/libdynamorio.so=0x00007f9df3b22000
2021-02-01T04:22:35.6549236Z 165: /home/runner/work/dynamorio/dynamorio/build_debug-internal-64/suite/tests/bin/libclient.cleancallsig.dll.so=0x0000000072000000>
derekbruening commented 3 years ago

Again on cleancallsig: https://github.com/DynamoRIO/dynamorio/pull/4757/checks?check_run_id=1990951037