DynamoRIO / dynamorio

Dynamic Instrumentation Tool Platform
Other
2.57k stars 551 forks source link

improve memory trace performance #2001

Open derekbruening opened 7 years ago

derekbruening commented 7 years ago

This is something of a broad issue covering analyzing and improving the performance of the following:

Xref #1929: memtrace slowness due to unbuffered printf Xref #790: online trace compression

I wanted a place to dump some of my notes on this. The #790 notes are somewhat duplicated in that issue:

memtrace_binary sample perf: 70x (SSD) to 180x (HDD); 4x-25x (ave 18x) w/o disk; w/ no PC 36x (SSD)

mcf test:

% cd /extsw/spec2006/mcf-test
% /usr/bin/time ./mcf_base.gcc-64bit inp.in
1.63user 0.03system 0:01.67elapsed 99%CPU (0avgtext+0avgdata 159668maxresident)k
% /usr/bin/time /work/dr/git/exports/bin64/drrun -c /work/dr/git/exports/samples/bin64/libmemtrace_x86_binary.so -- ./mcf_base.gcc-64bit inp.in
18.41user 10.45system 5:05.96elapsed 9%CPU (0avgtext+0avgdata 164712maxresident)k

=> clearly i/o bound: 9% CPU. produces a 41GB file with 1.3 billion memrefs. slowdown: 183x

More like 70x on laptop, and higher %CPU:

% /usr/bin/time  ./mcf_base.clang-64bit inp.in
2.00user 0.09system 0:02.09elapsed 99%CPU (0avgtext+0avgdata 159648maxresident)k
% /usr/bin/time ~/dr/git/build_x64_rel/bin64/drrun -c ~/dr/git/build_x64_rel/api/bin/libmemtrace_x86_binary.so -- ./mcf_base.clang-64bit inp.in
15.18user 21.44system 2:23.12elapsed 25%CPU (0avgtext+0avgdata 165128maxresident)k

Because it's got an SSD? Or also b/c CPU is slower (so higher CPU-to-disk ratio; also slower native)?

Disabling dr_write_file:

% /usr/bin/time /work/dr/git/exports/bin64/drrun -c /work/dr/git/exports/samples/bin64/libmemtrace_x86_binary.so -- ./mcf_base.gcc-64bit inp.in
9.28user 0.04system 0:09.41elapsed 99%CPU (0avgtext+0avgdata 164908maxresident)k

=> 5.6x That's PC, read or write, size, and address. It should be easy to improve by 2x by removing read/write and size (statically recoverable) and only including PC once per bb or even less.

But it's much worse on other spec. Taking too long to do a ref run of everything but bmarks at the point I killed the run, 9 hours in:

% spec2k6cmp result/CINT2006.045.ref.txt result/res-native-allc-int
400.perlbench         0.00 (     0 /    207)
401.bzip2            16.28 (  5029 /    309)
403.gcc              22.69 (  4447 /    196)
429.mcf               4.27 (   994 /    233)
445.gobmk            18.95 (  5761 /    304)
456.hmmer            25.61 (  7095 /    277)
458.sjeng            18.37 (  5788 /    315)
462.libquantum        0.00 (     0 /    299)
464.h264ref           0.00 (     0 /    327)
471.omnetpp           0.00 (     0 /    271)
473.astar             0.00 (     0 /    267)
483.xalancbmk         0.00 (     0 /    150)
average              17.70

Qin: "if memtrace is 100x, if you can make the profile 1/5 the size, can hit 20x"

Can shrink some fields, but not to 1/5. Online gzip compression should easily give 1/5. Simple test: I see >20x gzip compression (though w/ naive starting format):

% /work/dr/git/exports/bin64/drrun -c /work/dr/git/build_x64_dbg/api/bin/libmemtrace_x86_binary.so -- ls
% ls -sh /work/dr/git/build_x64_dbg/api/bin/*.log
12M /work/dr/git/build_x64_dbg/api/bin/memtrace.ls.15577.0000.log
% gzip /work/dr/git/build_x64_dbg/api/bin/*.log
% ls -sh /work/dr/git/build_x64_dbg/api/bin/*.log*
528K /work/dr/git/build_x64_dbg/api/bin/memtrace.ls.15577.0000.log.gz

Removing the PC field:

% /usr/bin/time /work/dr/git/exports/bin64/drrun -c /work/dr/git/exports/samples/bin64/libmemtrace_x86_binary.so -- ./mcf_base.gcc-64bit inp.in
13.26user 11.07system 3:31.28elapsed 11%CPU (0avgtext+0avgdata 164684maxresident)k

=> still i/o bound: 11% CPU. produces a 31GB file. slowdown: 126x

On laptop:

% /usr/bin/time ~/dr/git/build_x64_rel/bin64/drrun -c ~/dr/git/build_x64_rel/api/bin/libmemtrace_x86_binary.so -- ./mcf_base.clang-64bit inp.in
13.14user 15.50system 1:15.94elapsed 37%CPU (0avgtext+0avgdata 164736maxresident)k

Up to 37%CPU, and 36x slowdown.

drcachesim tracer performance => 2x slower b/c of icache entries

Switching from mcf test to bzip2 test b/c it's a little closer to the 18x average performance for the memtrace sample not writing to disk and so is more representative:

native:

% /usr/bin/time ./bzip2_base.gcc-64bit dryer.jpg 2
1.95user 0.00system 0:01.96elapsed 100%CPU (0avgtext+0avgdata 17176maxresident)k

No disk writes at all:

% /usr/bin/time /work/dr/git/exports/bin64/drrun -c /work/dr/git/exports/samples/bin64/libmemtrace_x86_binary.so -- ./bzip2_base.gcc-64bit dryer.jpg 2
30.23user 0.01system 0:30.49elapsed 99%CPU (0avgtext+0avgdata 24300maxresident)k

That's 15.6x.

% /usr/bin/time /work/dr/git/build_x64_rel/bin64/drrun -c /work/dr/git/build_x64_rel/clients/lib64/release/libdrmemtrace.so -- ./bzip2_base.gcc-64bit dryer.jpg 2
60.32user 0.01system 1:00.34elapsed 99%CPU (0avgtext+0avgdata 25564maxresident)k

30.8x! 2x vs memtrace, b/c it's including icache info, presumably.

Currently trace_entry_t is 4+8 => 16 bytes b/c of alignment (we didn't pack it: b/c we only care about 32-bit?).

Packing trace_entry_t w/o any other changes to the struct:

55.70user 0.01system 0:55.72elapsed 99%CPU (0avgtext+0avgdata 25560maxresident)k

Also compressing size+type from 4 bytes into 2 bytes: (Might need extra escape entry for memsz > 256)

54.11user 0.00system 0:54.12elapsed 100%CPU (0avgtext+0avgdata 25264maxresident)k

Also shrinking pc/addr field to 4 bytes:

53.57user 0.01system 0:53.59elapsed 99%CPU (0avgtext+0avgdata 25060maxresident)k

Also removing INSTR_BUNDLE (always has preceding abs pc so redundant):

48.63user 0.01system 0:48.99elapsed 99%CPU (0avgtext+0avgdata 24952maxresident)k

10.7x = Also removing all instr entries (thus there's no PC at all):

20.88user 0.01system 0:21.05elapsed 99%CPU (0avgtext+0avgdata 24896maxresident)k

Having the instr bundles and all the instr boundary info coming from the tracer seems worth it for online simulation, where having the simulator go dig it up from disassembly of a giant binary is going to be slower than the tracer providing it. But for offline, it does seem like we want to really optimize the tracing -- thus we need a split tracer!

14.3x = Adding back one instr entry per bb (1st instr in bb):

27.94user 0.00system 0:27.95elapsed 100%CPU (0avgtext+0avgdata 24768maxresident)k

Significant cost for instr-entry-per-bb: 33% more expensive. Maybe we can leverage traces to bring it down, having one instr entry per trace + a bit per cbr + an extra entry per mbr crossed?!?

790: try online compression with zlib

With the private loader, should be able to just use zlib library directly.

% /usr/bin/time /work/dr/git/exports/bin64/drrun -c /work/dr/git/exports/samples/bin64/libmemtrace_x86_gz.so -- ./mcf_base.gcc-64bit inp.in
484.16user 1.91system 8:13.96elapsed 98%CPU (0avgtext+0avgdata 165552maxresident)k

It produces a 4GB file (vs 41GB uncompressed binary) but it is much slower! 295x vs native, 1.6x vs uncompressed. 98% CPU, too.

try zlib format instead of gz format, where we can set high speed => Z_BEST_SPEED is faster than uncompressed for HDD, but still not SSD

Z_BEST_SPEED

Have to use the deflate interface directly and the zlib compression format. The gz interface uses the gzip compression format and apparently has no interface to set the speed vs size.

% /usr/bin/time /work/dr/git/exports/bin64/drrun -c /work/dr/git/exports/samples/bin64/libmemtrace_x86_gz.so -- ./mcf_base.gcc-64bit inp.in
181.29user 1.91system 3:10.70elapsed 96%CPU (0avgtext+0avgdata 165220maxresident)k

It produces a 4.5GB file and is significantly faster than uncompressed, but it's still 114x vs native.

On laptop it makes a 4.3GB file (should have saved to see if really different) and:

229.51user 2.56system 3:53.45elapsed 99%CPU (0avgtext+0avgdata 167372maxresident)k

So even Z_BEST_SPEED is slower than uncompressed on an SSD!

derekbruening commented 7 years ago

Xref #1738: optimize cache simulator

For drcachesim, the tracer's trace_entry_t has padding for 64-bit. We can eliminate the padding if having 4-byte-aligned 8-byte accesses is not a bigger perf loss than the gain from shrinking the memory and pipe footprint. This is almost certainly true for x86, but we should measure for ARM and AArch64.

derekbruening commented 7 years ago

Optimized trace format for offline traces with post-processing: 67x SSD, 42x SSD just data; 13.7x w/o disk; 10.5x just data

See full numbers in #1729 at https://github.com/DynamoRIO/dynamorio/issues/1729#issuecomment-290462969 and https://github.com/DynamoRIO/dynamorio/issues/1729#issuecomment-290464388

derekbruening commented 7 years ago

Xref #2299

zhaoqin commented 7 years ago

One idea to reduce memtrace overhead is to use asynchronous write so that the profile collection and trace dump could be performed in parallel. The basic implementation is to create a sideline threading pool and a producer-consumer queue. The application threads produce the traces and put them into the queue, while the sideline threads consume them and write them into disks.

There are several factors may affect the performance, for example:

  1. the overall workload: sideline threads and passing traces via queue are additional workload comparing to the synchronous write implementation. There would be little performance improvement if the original workload already exceeded the system can handle. Thus, it is less likely to improve the memtrace for multi-threaded applications.
  2. the lock contention on the queue: all application threads and sideline threads need exclusive access to the queue to produce and consume the traces, which could be the performance bottleneck.
  3. the working set: when application threads and sideline threads generate and write out profiles, more memory buffers are used, and larger working set could cause more cache misses.
  4. the disk bandwidth: the sideline threads performance depends on the disk bandwidth, having too many sideline threads may even hurt the performance due to resource contention.
zhaoqin commented 7 years ago

I created a mico-benchmark for experiment, which basically create a few thread and perform the same task in each thread.

    for (i = 0; i < num_threads; i++) {
        pthread_create(&thread[i], NULL, thread_func, NULL);
    }
    for (i = 0; i < num_threads; i++) {
        pthread_join(thread[i], NULL);
    }

Two hardware platform are tested:

  1. Desktop: Xeon(R) CPU E5-1650 0 @ 3.20GHz, 6 core with hyper-threading, cache size 12288 KB, harddrive, /dev/sda: Timing cached reads: 23184 MB in 2.00 seconds = 11619.28 MB/sec Timing buffered disk reads: 520 MB in 3.01 seconds = 172.81 MB/sec

2 Laptop: Core(TM) i7-4712HQ CPU @ 2.30GHz, 4 core with hyper-threading, cache size 6144 KB, SSD Timing cached reads: 20210 MB in 2.00 seconds = 10128.99 MB/sec Timing buffered disk reads: 802 MB in 3.00 seconds = 266.92 MB/sec

Experimental results on Desktop: Execution time (seconds):

a\s native 0 (w/o write) 0 (w/ write) 1 2 4 8 size
1 0.316 1.005 11.086 10.865 9.614 10.562 10.182 2.2GB
2 0.325 1.033 31.843 31.663 31.469 30.895 31.362 4.4GB
4 0.341 1.091 66.276 64.038 62.763 68.738 71.858 8.7GB

Experimental results on Laptop: Execution time (seconds):

a\s native 0 (w/o write) 0 (w/ write) 1 2 4 8 size
1 0.325 1.005 4.765 4.610 4.393 4.863 7.789 2.2GB
2 0.336 1.024 9.018 13.671 9.929 11.146 13.484 4.4GB
4 0.380 1.091 22.057 23.689 22.821 24.968 29.630 8.7GB
8 0.473 1.667 49.481 60.944 62.887 60.068 59.672 18GB

The data suggest the disk write bandwidth is the limitation on my experiment. It takes 10 sec for writing 2.2GB and 30 sec for writting 4.4GB on my desktop, i.e., ~200MB/sec, and 4 sec for writing 2.2GB and 9 sec for writing 4.4GB on my laptop, i.e., ~500MB/sec. Single thread profile write is already reaching the bandwidth limit, so more slides does not really speed up anything, but make the performance varies a lot.

zhaoqin commented 7 years ago

The pull request #2319 contains the threading pool implementation.

derekbruening commented 6 years ago

Here are some simple opt ideas from prior discussions:

** TODO opt: lean proc for buffer-full clean call
like memtrace_x86.c does

** TODO opt: re-measure fault buffer-full handling

** TODO opt: avoid lea and just store base reg

Instead of:

mov %ecx -> 0x44(%eax)
=>
spill %edx
lea 0x44(%eax) -> %edx
store %edx -> buffer
restore %edx

Just do:

store %eax -> buffer

And reconstruct +0x44 in post-proc.

Keep lea for index reg:
mov %ecx -> 0x44(%eax,%ebx)

** TODO opt: single entry for consecutive same-base memrefs

mov %ecx -> 0x44(%eax)
mov %edx -> 0x48(%eax)
mov %ebx -> 0x4c(%eax)
=>
store %eax -> buffer
<reconstruct 2nd and 3rd entries>
toshipiazza commented 6 years ago

I may add that I'd be interested in seeing a re-evaluation of the faulting buffer performance done. iirc, we opted not to land drx_buf into the clients because at least one drcachesim test timed out with the faulting buffer implementation.

At the time I didn't really do any benchmarks, and what tests I did run were on a crummy VM. This single test case I evaluated was also heavily multithreaded, and I'm wondering if potential overlocking like in #2114 could have indirectly contributed to the problem.

I have an implementation of the trace samples which use drx_buf here if anyone's interested.