Collect multiple trace samples from a single benchmark run

prasun3 commented 4 years ago

Add a pointer to any prior users list discussion. Currently we can collect one trace sample during a benchmark run. We use '-trace_after_instrs' and '-exit_after_tracing' to select a trace point.

Is your feature request related to a problem? Please describe. We need to run the benchmark multiple times to collect traces throughout the benchmark execution. For long running benchmark, this may take a very long time.

Describe the solution you'd like Have a way to ‘trace x million insts every y million insts’. A more advanced method would be to have a config file listing the sampling windows.

Do you have any implementation in mind for this feature? No

Describe alternatives you've considered Currently we plan to run the benchmark with a diff sampling point.

Additional context None.

johnfxgalea commented 4 years ago

Thank you for the request. I guess one quick solution is to use a script that runs the benchmark multiple times with updated param values for -trace_after_instrs. However, I do understand this is not ideal for long benchmarks. Would be happy to look at a PR if you wish to contribute the functionality.

derekbruening commented 4 years ago

Presumably the approach of recording a single long trace covering all desired execution windows and then splitting it up into pieces offline will not work due to such a trace simply being too large to easily store?

prasun3 commented 4 years ago

That is right. We see pretty large trace sizes and run into disk space issues.

derekbruening commented 4 years ago

Xref an existing feature request to use annotations added to the application to delineate phase regions and have the tracer recognize the annotations and enable/disable recording a the boundaries: #2478.

Also note that another method of creating multiple traces from one execution is to insert start/stop commands into the application. This is well-supported today, in particular with static linkage of the tracer into the application, and we have a number of regression tests of this. E.g., see https://github.com/DynamoRIO/dynamorio/blob/master/clients/drcachesim/tests/burst_static.cpp

Those two approaches both require modifying the application. This issue here covers specifying boundaries on an unmodified application.

derekbruening commented 4 years ago

@prasun3 -- would your use case prefer to modify the application with annotations in the source code to delineate precise tracing regions? Or you would prefer this feature as filed to trace a certain number of instructions without regard to any corresponding application phases or code boundaries?

prasun3 commented 4 years ago

Or you would prefer this feature as filed to trace a certain number of instructions without regard to any corresponding application phases or code boundaries?

We would prefer this approach -- based on instruction count.

derekbruening commented 4 years ago

My proposal is to imlement this using #4134. The existing -trace_after_instrs does a flush, which is expensive. If we want to swap a bunch of times it may be more efficient to use the multi-version support being added for #4134.

derekbruening commented 4 years ago

Xref #3107 as another proposal for delimiting tracing regions

5surim commented 3 years ago

@derekbruening Hello Derek,

I found this thread while looking for a way to do periodic tracing in a single run. Your new options -trace_for_instrs and -retrace_every_instrs to drcachesim will be very useful in my case for periodic trace bustrs of an unmodified application. Options have been added, but the behavior for the options seems not developed yet. Are you still planning on adding implementation for this? (TODO: Implement these using the new drbbdup framework by repeatedly alternating among the cases.)

Best, Surim

derekbruening commented 3 years ago

Right, there were still a number of issues here and it ended up de-prioritized and so was not finished. First I'm going to dump my notes from a year ago:

TODO use i#4134 drbbdup to swap bet instru [6/11]

DONE app2app used to pass user_data w/ info on repstr to analysis => storing in TLS

CLOSED: [2020-04-19 Sun 22:40]

DONE drmgr_is_first_nonlabel_instr => John added drbbdup_is_first_nonlabel_instr()

CLOSED: [2020-04-19 Sun 22:40]

Could deduce it in orig analysis cb and pass it through. => John added drbbdup_is_first_nonlabel_instr()

TODO drwrap interactions: priority, and how do drwrap for one case but not another?

Another issue: drcachesim needs its insert instrumentation callback to go after drwrap's (see memtrace_pri comments). It looks like drbbdup hardcodes the insert event to be at DRMGR_PRIORITY_DRBBDUP = -1500, which is too early since DRMGR_PRIORITY_INSERT_DRWRAP is 500.

I'm not sure I can work around this one. The app2app I could move the user data into TLS, and the first_nonlabel I could store myself. But the priority does not seem solvable from the outside.

I guess this raises a larger issue: we want the drwrap clean call to only be inserted for one of our drbbdup cases. But with no control over drwrap, we can't arrange that, and drwrap is going to go insert its clean call at the start before the drbbdup case dispatch I would guess.

How to solve?

If drbbdup were integrated inside drmgr: would non-bbdup-aware users of drmgr like drwrap be invoked for the default case insertion and not the other cases? So the user can either have wrapping for just one case (has to be default)? Provide an option where user can pick one case, or all cases?

Or, drwrap has to turn its control model around and have the user call "do_drwrap_instru" from its insertion event, with drwrap not registering instru events (but still registering modload, etc.)?

Leaning toward the latter: add a drwrap global flag.

DONE let user provide TLS memop for encoding to avoid mem2mem move

CLOSED: [2020-04-19 Sun 23:20]

Looks like this today (here coming from non-TLS but I would change that):

load case encoding:
 +37   m4 @0x00007f4384a5ab20  48 bf e0 8c a1 04 44 mov    $0x00007f4404a18ce0 -> %rdi
                               7f 00 00
 +47   m4 @0x00007f4384a5b880  8b 3f                mov    (%rdi)[4byte] -> %edi
 +49   m4 @0x00007f4384a5a920  65 48 89 3c 25 00 01 mov    %rdi -> %gs:0x00000100[8byte]
                               00 00
 +58   m4 @0x00007f4384a5b600                       <label>
 +58   m4 @0x00007f4384a5ae00  48 b8 01 00 00 00 00 mov    $0x0000000000000001 -> %rax
                               00 00 00
 +68   m4 @0x00007f4384a5b6d0  65 48 39 04 25 00 01 cmp    %gs:0x00000100[8byte] %rax
                               00 00

Either I provide TLS opnd, or I can query the TLS offset drbbdup is using and I can write to it.

Wait, actually I can just write directly to it: assuming the slot isn't cleared or otherwise touched by drbbdup.

Maybe drbbdup can document that it guarantees to not write to the slot itself, so users know they can have insert_encode be a nop (why not let it be NULL?)

DONE barrier on load for cmp in every bb? or send signal to every thread: is only API for that dr_suspend_all_other_threads?? => atomic load, though NYI inside drbbdup for non-x86

CLOSED: [2020-04-19 Sun 23:20]

The issue for a globally changing case is memory visibility. If I make my encoding address a single global, I would need a barrier here for aarchxx. The alternative is to have the encoding be TLS and when the value changes go force all the threads to update their values -- although I'm not sure how to do that. If I could run code as each thread I could have each go do an acquire load. I can do that on UNIX by sending a signal. I suppose I could do NtSetContextThread on Windows twice to run my code sequence. I would want to add DR API support for those. dr_suspend_all_other_threads() could do it I guess by setting the mcontext and then resuming -- but there's no per-thread control point to restore w/o having a wait point and another suspend-the-world.

I don't know how performance of a barrier at the top of a every block compares to a heavyweight interrupt-all-threads-when-change approach. Certainly the barrier is much much easier to implement (esp when #4215 is done).

How do you envision a typical use case changing the encoding? Are all the use cases you're thinking of wanting local changes where only the current thread changes its own encoding independently of all others?

Should there then be an option for whether to use an acquire load or a regular load? (This is why I was proposing a register interface before.)

DONE calling drbbdup_register_case_encoding and passing the default encoding: it lets me make a duplicate, and calls both => John fixed

CLOSED: [2020-04-19 Sun 23:21]

DONE labels not preserved from analysis to insert phase => now preserved

CLOSED: [2020-04-19 Sun 22:45]

Used for elision, and same func used in raw2trace so don't want special iterator

TODO add for_trace, translating, and dr_emit_flags_t to cb's to avoid needing _ex versions later for complex users?

TODO runtime_case_opnd problems

1) For a global: can I use opnd_create_rel_addr() on all platforms? Client lib is reachable by default (for x86 2G reachability; how far can A64 reach? LDR only reaches +-1MB!!) We need drbbdup's XINST_CREATE_load to auto-convert to a pc-rel load on AArch64.

1) The size of drbbdup_options_t.runtime_case_opnd is not specified. For me I'd like to make it std::atomic but that won't work if my code writes just one byte but drbbdup goes and reads 4 or 8 bytes. I also get runtime errors if I use a non-pointer-size: ERROR: Could not find encoding for: mov 0x00007f006f43fce0[4byte] -> %rax

TODO redundant spill-restore code

After dispatching to BBDUP_MODE_COUNT, the flags are restored and immediately re-spilled before the drx_insert_counter_update():

 +44   m4 @0x00007fe75f0a6880  48 39 05 91 79 0d 80 cmp    0x00007fe7df17a298[8byte] %rax
 +51   m4 @0x00007fe75f0a5920  0f 85 48 00 00 00    jnz    @0x00007fe75f0a6348[8byte]
 +57   m4 @0x00007fe75f0a5e00  65 48 a1 10 01 00 00 mov    %gs:0x00000110[8byte] -> %rax
                               00 00 00 00
 +68   m4 @0x00007fe75f0a66d0  04 7f                add    $0x7f %al -> %al
 +70   m4 @0x00007fe75f0a5ca0  9e                   sahf   %ah
 +71   m4 @0x00007fe75f0a5e68  65 48 a1 08 01 00 00 mov    %gs:0x00000108[8byte] -> %rax
                               00 00 00 00
 +82   m4 @0x00007fe75f0a6750  65 48 a3 e8 00 00 00 mov    %rax -> %gs:0x000000e8[8byte]
                               00 00 00 00
 +93   m4 @0x00007fe75f0a5ee8  9f                   lahf    -> %ah
 +94   m4 @0x00007fe75f0a6050  0f 90 c0             seto    -> %al
 +97   m4 @0x00007fe75f0a61c8  48 83 05 e0 13 fc 7f add    $0x0000000000000002 <rel> 0x00007fe7df063ce8[8byte] -> <rel> 0x00007fe7df063ce8[8byte]
                               02

It's b/c drbbdup doesn't use drreg for that flags restore.

TODO can't use for existing code w/o aarch support!

Need barrier to read encoding for aarch

TODO use drbbdup for regular instru too? is there overhead w/ only 1 case?

Can tell it not to duplicate at all on a per-bb basis.

TODO i#4226: problem: drbbdup has fragment_deleted event!

ext/drbbdup/drbbdup.c: dr_register_delete_event(deleted_frag);

TODO re-implement delayed tracing using drbbdup too

TODO function tracing: needs drwrap changes discussed above

For function tracing, we need to invoke drwrap only for the full tracing case and not for the instruction counting case. The plan is to add a drwrap mode where drwrap does not use its own insertion event and instead the user invokes drwrap from its insertion event.

TODO need AArch64 drbbdup reachability and encoding issues with loading the global case value into a register

TODO Set opts.dup_limit to 1 as noted by John

TODO Set event_bb_analyze_orig and event_bb_analyze_orig_cleanup to NULL because they do nothing. Same goes for event_bb_retrieve_mode. (Again as noted by John.)

TODO change -max_trace_size to swap to instr-count instead of continuing to trace when limit is reached?

TODO support a config file for variable lengths of each burst?

TODO measure perf: ensure no overhead w/ only 1 case

TODO measure perf vs flush for delay

derekbruening commented 3 years ago

So I started on this a year ago and as the two commit messages show I started on the first step of refactoring the existing delayed tracing to use drbbdup. At the time, drbbdup was just being developed, and I shared the branch https://github.com/DynamoRIO/dynamorio/tree/i3995-multi-burst for discussions with the drbbdup author @johnfxgalea who FTR had these comments which I do not think were acted upon yet on my side:

Thanks, I had a look and overall the implementation seems to be good.

Just some minor issues:

1) The dup limit is the number of additional cases, excluding the default case. Therefore, you could have set this to 1. Essentially, you defined an additional slot for nothing. However, this is not a big deal as drbbdup does not produce a wasted duplication of the basic block because it only acts on defined cases.

opts.dup_limit = 2;

2) You could have set event_bb_analyze_orig and event_bb_analyze_orig_cleanup to NULL because they do nothing. Same goes for event_bb_retrieve_mode

3) I see a lot of changes from using “instr" to “where" during the insertion stage due to no fault of your own but as a requirement stemming from drbbdup. I looked at the docs and I don’t seem to motivate the reasons behind requirement. Essentially, drbbdup cannot duplicate syscall/cti instruction but must leave such instructions at the end of the basic block. In other to provide different case instrumentation for these instruction, instrumentation must be inserted with respect to “where". I’ll update the docs.

After this the feature was de-prioritized and other work took precedence.

For the refactoring step of having the existing delay use drbbdup:

[x] For function tracing, we need to invoke drwrap only for the full tracing case and not for the instruction counting case. The plan is to add a drwrap mode where drwrap does not use its own insertion event and instead the user invokes drwrap from its insertion event. Update: this is #5356
[x] For AArch64, drbbdup needs to handle reachability and encoding issues with loading the global case value into a register. Probably this was done already a year ago? @johnfxgalea may know offhand. Update: No there were a number of changes needed to support AArch64. This is now done in PR's:
- 5294
- 5305
- 5306
- 5298
- 5317
- 5321
- 5318
- 5322
- 5323
[x] Set opts.dup_limit to 1 as noted by John.
[ x Set event_bb_analyze_orig and event_bb_analyze_orig_cleanup to NULL because they do nothing. Same goes for event_bb_retrieve_mode. (Again as noted by John.)
[x] The #4226 issue in the TODO in my notes: hmm looks like the delete event was removed so that may be solved. Probably I brought it up to John during drbbdup development?
[ ] As noted in my notes, measure performance of the no-delay case, and compare flushing vs swapping.

And then:

[ ] Implement the new options using the new drbbdup framework by repeatedly alternating among the cases.

Un-assigning to me for now as I am not sure when I would have time for it. If someone else wants to pick it up that would be appreciated.

prasun3 commented 3 years ago

Reading some of the stuff had me wondering: with the bbdup mechanism would it possible to implement switching between different tracers too? For example, we could collect an "L0_filter" trace (which could be used for cache warmup) and then switch to a full instruction trace?

johnfxgalea commented 3 years ago

@prasun3 That is a pretty good idea! In theory, it should be possible but not sure about the technical effort required. Personally, I created drbbdup for research on taint analysis, so much of my investigation on the approach focuses on that application.

I'd be happy to take the initiative in taking on this PR, but all my time available for DynamoRIO maintenance is being spent on drreg atm.

L-Chambers commented 3 years ago

Is there any sort of timeline for the DrCacheSim flags -trace_for_instrs and -retrace_every_instrs flags to be implemented?

Thank you.

derekbruening commented 3 years ago

Is there any sort of timeline for the DrCacheSim flags -trace_for_instrs and -retrace_every_instrs flags to be implemented?

Please go ahead and pick up the branch where it was left if you are interested in this feature -- as the comments above note it is not clear someone else will have time to take this on.

derekbruening commented 2 years ago

We have renewed interest in this and may revive it: first checking whether anyone else has put work into this that was not pushed to a branch? I didn't see any pushes beyond my initial work from before.

We're looking at two features:

Trace for N instructions every M instructions
Specify many windows via precise start and length points. The first feature could use this mechanism but this likely requires a config file and it might be nice to have convenience parameters that don't need a separate file.

One final idea: for long periods of no tracing: detach to native and use PMU to count instrs for re-attach, for very low overhead when not tracing.

derekbruening commented 2 years ago

It looks like drbbdup doesn't build for AArchXX today.

From above:

For AArch64, drbbdup needs to handle reachability and encoding issues with loading the global case value into a register. Probably this was done already a year ago? @johnfxgalea may know offhand.

No, this was not done. And the compare also needs to be fixed up, so we append this to this from above:

TODO runtime_case_opnd problems

3) drbbdup_insert_compare_encoding() needs to use an immediate if possible regardless of the pointer size but based on the encoding value; if the value is large, it needs to use a 2nd scratch reg as AArchXX has no compare-with-memory opcode.

derekbruening commented 2 years ago

A problem I hit when integrating drbbdup into dr$sim is with -satisfy_w_xor_x which we need for our internal uses:

<Application /home/bruening/dr/git/build_x64_dbg_tests/suite/tests/bin/simple_app (453153) DynamoRIO usage error : reachable executable client memory is not supported with -satisfy_w_xor_x>
#0  report_dynamorio_problem (dcontext=0x0, dumpcore_flag=16, exception_addr=0x0, report_ebp=0x0, fmt=0x7ffff7ec58ff "Usage error: %s (%s, line %d)")
    at /home/bruening/dr/git/src/core/utils.c:2107
#1  0x00007ffff7c5012a in external_error (file=0x7ffff7eeeb48 "/home/bruening/dr/git/src/core/lib/instrument.c", line=2872, 
    msg=0x7ffff7eef710 "reachable executable client memory is not supported with -satisfy_w_xor_x") at /home/bruening/dr/git/src/core/utils.c:201
#2  0x00007ffff7d63e09 in dr_nonheap_alloc (size=4096, prot=7) at /home/bruening/dr/git/src/core/lib/instrument.c:2872
#3  0x00007fffb3caf003 in init_fp_cache (clean_call_func=0x7fffb3caeb37 <drbbdup_handle_new_case>)
    at /home/bruening/dr/git/src/ext/drbbdup/drbbdup.c:1564
#4  0x00007fffb3caf926 in drbbdup_init (ops_in=0x7fffffffbd60) at /home/bruening/dr/git/src/ext/drbbdup/drbbdup.c:1793
#5  0x00007fffb3b76c09 in instrumentation_init () at /home/bruening/dr/git/src/clients/drcachesim/tracer/tracer.cpp:761
#6  0x00007fffb3b7c86d in drmemtrace_client_main (id=0, argc=2, argv=0x7ffd73b9f9e8)
    at /home/bruening/dr/git/src/clients/drcachesim/tracer/tracer.cpp:2289
#7  0x00007fffb3b7cca8 in dr_client_main (id=0, argc=2, argv=0x7ffd73b9f9e8) at /home/bruening/dr/git/src/clients/drcachesim/tracer/tracer.cpp:2361
#8  0x00007ffff7d5ed82 in instrument_init () at /home/bruening/dr/git/src/core/lib/instrument.c:766

The custom code cache used by drbbdup is only needed for dynamic handling of cases. @johnfxgalea is there any way to know at drbbdup_init time whether there will ever be dynamic handling or would I need to add a new flag? dr$sim does not use dynamic handling and the easiest solution here is to not set up this cache for dr$sim's use of drbbdup.

johnfxgalea commented 2 years ago

The current dynamic handling flag is maintained by drbbdup managers (one per basic block) and set by the user upon bb construction. There is no universal flag which you can use at the moment that denotes whether dynamic handling would ever be used.

I think there are two options: 1) Introduce a universal flag, something like is_dynamic_gen_never, and then add assertions to ensure that drbbdup managers' dynamic handling flags conform with this universal flag. One does not want is_dynamic_gen_never to be set to false, and then have enable_dynamic_handling of a manager to be true.

2) Create the code cache lazily, upon first sight of enable_dynamic_handling as true. (Be careful of races here upon creation (i.e., use a lock)). In this fashion, since dr$sim does not use dynamic generation, the code cache should never be created.

johnfxgalea commented 2 years ago

Made a quick PR (https://github.com/DynamoRIO/dynamorio/pull/5358) that takes the lazy approach.

derekbruening commented 2 years ago

@johnfxgalea -- see also questions at https://github.com/DynamoRIO/dynamorio/issues/5356#issuecomment-1041126899 about more drbbdup changes needed

derekbruening commented 2 years ago

The branch https://github.com/DynamoRIO/dynamorio/tree/i3995-multi-burst has been subsumed by PR #5393 so I am deleting it now.

derekbruening commented 2 years ago

We have a number of follow-up clean/extension items. Perhaps some should be split into their own issues:

refactor tracer.cpp: separate out drbbdup, other components: https://github.com/DynamoRIO/dynamorio/pull/5393#discussion_r820264882 => now split into #5560
update all comments to be C++-style?: https://github.com/DynamoRIO/dynamorio/pull/5402/files#r822084129
support a config file for variable lengths and placement of each window
change -max_trace_size/-max_global_trace_refs to swap to instr-count instead of continuing to trace when limit is reached? deprecate -max_global_trace_refs in favor of -trace_for_instrs (w/o -retrace_every_instrs) since better to stop tracing and better to count instrs?

add a 3rd drbbdup mode of no-instru for -max_trace_size, beyond single-window, or nudges enabling tracing:

* XXX i#3995: To implement -max_trace_size with drbbdup cases (with thread-private
* encodings), or support nudges enabling tracing, or have a single -trace_for_instrs
* transition to something lower-cost than counting, we will likely add a 3rd mode that
* has zero instrumentation.  We also would use the 3rd mode for just -trace_for_instrs
* with no -retrace_every_instrs.  For now we have just 2 as the case dispatch is more
* efficient that way.

5199: also support swapping bet filtered and unfiltered

derekbruening commented 2 years ago

Pasting some design notes for this feature. Maybe this could turn into a design doc on the web page:

Design Point: Separate Traces v. Merged-with-Markers

Focusing on a use case of a series of 50 10-billion-instruction traces for a SPEC benchmark, there are two main ways to store them. We could create 50 independent sets of trace files, each with its own metadata and separate set of sharded data files. A simulator could either simulate all 50 separately and aggregate just the resulting statistics, or a single instance of a simulator could fast-forward between each sample to maintain architectural state and simulate the full execution that way.

The alternative is to store all the data in one set of data files, with metadata markers inserted to indicate the division points between the samples. This doesn’t support the separate simulation model up front, though we could provide an iterator interface that skips ahead to a target window and stops at the end of that window (or the simulator could be modified to stop when it sees a sample separation marker). However, this will not be as efficient for parallel simulation with separate simulator instances for each window, since the skipping ahead will take some time. This arrangement does more easily support the fast-forward single-simulator-instance approach, and more readily fits with upstream online simulation.

In terms of implementation, there are several factors to consider here.

Separate raw files

If we want separate final traces, at first the simplest approach is to produce a separate set of raw files for each tracing window. These would be post-processed separately and independently.

However, implementing this split does not fit well in the current interfaces. To work with other filesystems, we have separated out the i/o and in particular directory creation.

For upstream use with files on the local disk, we could add creation of a new directory (and duplication of the module file) for each window by the tracing thread that hits the end-of-window trigger. The other threads would each create a new output raw file each time they transitioned to a new window (see also the Proposal A discussion below).

Splitting during raw2trace

Alternatively, we could keep a single raw file for each thread and split it up into per-window final trace files during postprocessing by the raw2trace tool. We would use markers inserted at the window transition points to identify where to separate.

raw2trace would need to create a new output dir and duplicate the trace headers and module file. Like for separate raw files, this goes against the current i/o separation where today we pass in a list of all the input and output files up front and raw2trace never opens a file on its own, to better support proprietary filesystems with upstream code.

Another concern here is hitting file size limits with a single raw file across many sample traces. For the example above of 50 10-billion-instruction traces, if we assume an average of 2 dynamic instructions per raw entry, each window might contain 5GB of data, reaching 250GB for all 50. Furthermore, the final trace is even larger.

The file size problem gets worse if we use a constant sampling interval across SPEC2017. Some SPEC2017 benchmarks have many more instructions than others. The bwaves_s benchmark has 382 trillion instructions, so a constant interval might result in it having 50x more data than other benchmarks, exceeding the file size limit. A constant number of samples is preferred for this reason.

Splitting during analysis

Given the complexities of splitting in earlier steps, and given that we may want to use a single simulator instance to process all of the sample traces, and given that for upstream online analysis we will likely also have a single simulator instance: perhaps we should not try to split the samples and instead treat the 50 samples as a single trace with internal markers indicating the window division.

Online and offline analyzers can use window ID markers to fast-forward and align each thread to the next window. Maybe the existing serial iterator can have built-in support for aligning the windows.

If single-file final traces will exist, we would need to update all our existing analyzers to handle the gaps in the traces: reset state for function and callstack trackers; keep per-window numbers for statistics gatherers.

We can also create an analyzer that splits a final trace up if we do want separate traces.

Decision: Split during analysis

Separate files seems to be the most flexible and useful setup for our expected use cases, in particular parallel simulation. But given that separating early in the pipeline is complex, we’ll split in the analysis phase, initially with a manual tool since we do not plan to have automatically-gathered multi-window traces.

We’ll update some simple inspection and sanity tools (view, basic_counts, and invariant_checker) to handle and visualize windows, but we’ll assume that trace windows will be split before being analyzed by any more complex analysis tools. For online traces we will probably stick with multi-window-at-once.

We’ll create a tool to manually split up multi-window trace files.

Design Point: Continuous Control v. Re-Attach

One method of obtaining multiple traces is to repeat today’s bursts over and over, with a full detach from the application after each trace. However, each attach point is expensive, with long periods of profiling and code cache pre-population. While a scheme of sharing the profiling and perhaps code cache could be developed while keeping a full detach, a simpler approach is to remain in control but switch from tracing to instruction counting in between tracing windows. Instruction counting is needed to determine where to start the next window in any case.

Instruction counting through instrumentation is not cheap, incurring perhaps a 1.5x slowdown. Compared to the 50x overhead while tracing, however, it is acceptable. If lower overhead is desired in the future, a scheme using a full detach and using hardware performance counters to count instruction can be investigated. The decision for the initial implementation, however, is to use the simpler alternating tracing and counting instrumentation windows.

Design Point: Instrumentation Dispatch v. Flushing

As the prior section concluded, we plan to alternate between tracing and instruction counting. There are two main approaches to varying instrumentation during execution: inserting all cases up front with a dispatch to the desired current scheme, and replacing instrumentation by flushing the system’s software code cache when changing schemes.

Flushing is an expensive process, and can be fragile as the lower-overhead forms of flushing open up race conditions between threads executing the old and new code cache contents. Its complexity is one reason we are deciding to instead us a dispatch approach for our initial implementation.

With dispatch, we insert both tracing and counting instrumentation for each block in the software code cache. Dispatch code at the top of the block selects which scheme to use. The current mode, either tracing or counting, is stored in memory and needs to be synchronized across all threads.

The simplest method of synchronizing the instrumentation mode is to store it in a global variable, have the dispatch code use a load-acquire to read it, and modify it with a store-release. There is overhead to a load-acquire at the top of every block, but experimentation shows that it is reasonable compared to the overhead of the instrumentation itself even for instruction counting mode, and its simplicity makes it our choice for the initial implementation.

The mechanisms for creating the dispatch and separate copies for the modes is provided for us by the upstream drbbdup library. This library was, however, missing some key pieces we had to add.

derekbruening commented 2 years ago

Handling Phase Transitions

For a normal memtrace burst, we completely detach from the server at the end of our desired trace duration. This detach process synchronizes with every application thread.

For multi-window traces, we are using multi-case dispatched instrumentation where we change the instrumentation type for each window. We have no detach to go through and wake up all threads and have them flush their trace buffers and we're deliberately trying to avoid a global synchronization point. Yet we would prefer perfect transitions between windows, whether that's separate raw files or accurately-placed markers.

Key step: Add end-of-block phase change check

We do flush prior to a syscall, so a thread at a kernel wait point should have an empty buffer and not be a concern.

The main concern is a thread not in a wait state that happens to not be scheduled consistently for a long time and so does not fill up its buffer until well after the window ends.

We can augment the current end-of-block flush check which today looks for the buffer being full. We can add a check for the prior window having ended, by having a global window ordinal and storing its value per thread at the start of filling up a new buffer. (This is better than simply checking the drbbdup mode value for being in non-tracing mode as that will not catch a double mode change.) If the prior window has ended, we can flush the buffer, or simply add a marker, depending on the scheme (see below).

A thread that receives a signal mid-block (it would have to be a synchronous signal as DR waits until the end of the block for asynchronous) will skip its end-of-block checks and redirect to run the app's signal handler: but it would hit the checks for the first block of the handler.

The worst case inaccuracy here is a thread who starts writing in window N but ends up unscheduled until a much later window M. But at most one basic block's worth of trace entries will be placed into window N even though it happened later. Thus we have "basic block accuracy", which is pretty good, as typically a basic block only contains a few instructions.

Proposal A: Separate raw files split at flush time

If we're splitting raw files (see above), we would use the end-of-block window-change flush to emit a thread exit and create a new file. In post-processing, we'd add any missing thread exits to threads that don't have them, to cover waiting threads who never reached a flush.

As discussed above, the trigger thread would create a new directory for each window. A just-finished buffer is writtent to the directory corresponding to the window for its start point.

A thread that is unscheduled for a long time could have a nearly-full buffer that is not written out until many windows later, but it would be written to the old directory for the old window. The next buffer would go to a new file in the new window, with no files in the in-between window directories.

(Originally we thought this scheme would have buffer-level inaccuracy (and talked about using timestamps at the start and end of each buffer to detect): but that would only be if it wrote out to the current window dir.)

Proposal B: Label buffers with owning window

If we add the window ordinal to every buffer header, we can identify which window they belong to, and avoid the need to separate raw files. A window-end flush ensures a buffer belongs solely to the window identified in its header; the next buffer will have the new window value.

This scheme can be used with file splitting during raw2trace, or waiting until analysis. Each thread has one raw file which contains all windows during the execution.

Proposal C: Trigger thread identifies buffer transition point of the other threads

For this proposal, the thread triggering the end of the window walks the other threads and identifies the phase transition point inside the buffer, telling the post-processor where to split them.

I considered having the triggerer also flush the buffers, but that is challenging with a race with the owner also flushing. Plus, it still requires post-processing help to identify the precise point for splitting the buffer (without synchronization the triggerer can only get close).

To avoid barriers on common case trace buffer writes, we use a lazy scheme where the triggerer does not modify the trace buffers themselves, but instead marks which portion has been written using a separate variable never accessed in a fastpath.

Implementation:

The tracer maintains a global list of thread buffers using a global mutex on thread init and exit.
Each trace buffer has a corresponding externally_written variable holding a distance into the buffer that was written out by another thread.
On hitting the trace window endpoint threshold, the triggering thread grabs the mutex and walks the buffers.

The triggerer doesn't have the current buffer position pointer. Instead it walks the buffer until it reaches zeroed memory (we zero the buffer after each flush). We have no synchronization with the owning thread: but observing writes out of order should be ok since we'll just miss one by stopping too early. We need to fix things up in post-processing in any case, because we need the phase transition to be at a clean point (we can't identify that point from triggerer: if we end at an instr entry, we don't know if some memrefs are coming afterward or not). In post-processing we adjust that position to the end of the block, and we split the buffer contents around that point to the neighboring traces.

The triggerer does a store-release of the furthest-writting point into the externally_written variable.
Whenever a trace writes out its buffer, it does a load-acquire on the externally_written variable and if non-zero it writes out a marker in the buffer header. Post-processing reads the marker and uses it to split the buffer at the nearest block boundary after the marker value.
If windows are small enough that the triggerer doesn't complete its buffer walk before a new window starts: other thread buffers may completely go into the new window. That seems ok: if the windows are that small, in the absence of application synchronization the resulting window split should be a possible thread ordering.

This scheme ends up with block-level accuracy since the trigger thread's marked transition point must be adjusted to essentially a block boundary in post-processing. Thus, it does not seem any better than the other schemes, and it is more complex.

Online Traces

It makes sense for offline to treat each window trace as separate and simulate them separately (though aggregating the results to cover the whole application).

But for online: we have the same instance of the simulator or analysis tool throughout the whole application run. It will get confused if it throws away thread bookkeeping on a thread exit for a window.

Either we have a window-controller simulator who spins up and down a new instance of the real target simulator/tool on each window, or we introduce new "end phase/start phase" markers. If we have split offline traces, those would only be for online though which does not sound appealing. Simulators/tools would need special handling for them: reporting statistics for the phase while continuing to aggregate for a multi-phase report or something.

We might want combined files for offline too, as discussed above. That would unify the two, which is appealing.

derekbruening commented 2 years ago

For x86 we also want to eliminate counting of non-fetched rep string insructions which make the instruction counts used for the windows and gaps not match what the PMU will say: this is #4948.

DynamoRIO / dynamorio