Open derekbruening opened 7 years ago
Xref #1738: optimize cache simulator
For drcachesim, the tracer's trace_entry_t has padding for 64-bit. We can eliminate the padding if having 4-byte-aligned 8-byte accesses is not a bigger perf loss than the gain from shrinking the memory and pipe footprint. This is almost certainly true for x86, but we should measure for ARM and AArch64.
See full numbers in #1729 at https://github.com/DynamoRIO/dynamorio/issues/1729#issuecomment-290462969 and https://github.com/DynamoRIO/dynamorio/issues/1729#issuecomment-290464388
Xref #2299
One idea to reduce memtrace overhead is to use asynchronous write so that the profile collection and trace dump could be performed in parallel. The basic implementation is to create a sideline threading pool and a producer-consumer queue. The application threads produce the traces and put them into the queue, while the sideline threads consume them and write them into disks.
There are several factors may affect the performance, for example:
I created a mico-benchmark for experiment, which basically create a few thread and perform the same task in each thread.
for (i = 0; i < num_threads; i++) {
pthread_create(&thread[i], NULL, thread_func, NULL);
}
for (i = 0; i < num_threads; i++) {
pthread_join(thread[i], NULL);
}
Two hardware platform are tested:
2 Laptop: Core(TM) i7-4712HQ CPU @ 2.30GHz, 4 core with hyper-threading, cache size 6144 KB, SSD Timing cached reads: 20210 MB in 2.00 seconds = 10128.99 MB/sec Timing buffered disk reads: 802 MB in 3.00 seconds = 266.92 MB/sec
Experimental results on Desktop: Execution time (seconds):
a\s | native | 0 (w/o write) | 0 (w/ write) | 1 | 2 | 4 | 8 | size |
---|---|---|---|---|---|---|---|---|
1 | 0.316 | 1.005 | 11.086 | 10.865 | 9.614 | 10.562 | 10.182 | 2.2GB |
2 | 0.325 | 1.033 | 31.843 | 31.663 | 31.469 | 30.895 | 31.362 | 4.4GB |
4 | 0.341 | 1.091 | 66.276 | 64.038 | 62.763 | 68.738 | 71.858 | 8.7GB |
Experimental results on Laptop: Execution time (seconds):
a\s | native | 0 (w/o write) | 0 (w/ write) | 1 | 2 | 4 | 8 | size |
---|---|---|---|---|---|---|---|---|
1 | 0.325 | 1.005 | 4.765 | 4.610 | 4.393 | 4.863 | 7.789 | 2.2GB |
2 | 0.336 | 1.024 | 9.018 | 13.671 | 9.929 | 11.146 | 13.484 | 4.4GB |
4 | 0.380 | 1.091 | 22.057 | 23.689 | 22.821 | 24.968 | 29.630 | 8.7GB |
8 | 0.473 | 1.667 | 49.481 | 60.944 | 62.887 | 60.068 | 59.672 | 18GB |
The data suggest the disk write bandwidth is the limitation on my experiment. It takes 10 sec for writing 2.2GB and 30 sec for writting 4.4GB on my desktop, i.e., ~200MB/sec, and 4 sec for writing 2.2GB and 9 sec for writing 4.4GB on my laptop, i.e., ~500MB/sec. Single thread profile write is already reaching the bandwidth limit, so more slides does not really speed up anything, but make the performance varies a lot.
The pull request #2319 contains the threading pool implementation.
Here are some simple opt ideas from prior discussions:
** TODO opt: lean proc for buffer-full clean call
like memtrace_x86.c does
** TODO opt: re-measure fault buffer-full handling
** TODO opt: avoid lea and just store base reg
Instead of:
mov %ecx -> 0x44(%eax)
=>
spill %edx
lea 0x44(%eax) -> %edx
store %edx -> buffer
restore %edx
Just do:
store %eax -> buffer
And reconstruct +0x44 in post-proc.
Keep lea for index reg:
mov %ecx -> 0x44(%eax,%ebx)
** TODO opt: single entry for consecutive same-base memrefs
mov %ecx -> 0x44(%eax)
mov %edx -> 0x48(%eax)
mov %ebx -> 0x4c(%eax)
=>
store %eax -> buffer
<reconstruct 2nd and 3rd entries>
I may add that I'd be interested in seeing a re-evaluation of the faulting buffer performance done. iirc, we opted not to land drx_buf into the clients because at least one drcachesim test timed out with the faulting buffer implementation.
At the time I didn't really do any benchmarks, and what tests I did run were on a crummy VM. This single test case I evaluated was also heavily multithreaded, and I'm wondering if potential overlocking like in #2114 could have indirectly contributed to the problem.
I have an implementation of the trace samples which use drx_buf here if anyone's interested.
This is something of a broad issue covering analyzing and improving the performance of the following:
Xref #1929: memtrace slowness due to unbuffered printf Xref #790: online trace compression
I wanted a place to dump some of my notes on this. The #790 notes are somewhat duplicated in that issue:
memtrace_binary sample perf: 70x (SSD) to 180x (HDD); 4x-25x (ave 18x) w/o disk; w/ no PC 36x (SSD)
mcf test:
=> clearly i/o bound: 9% CPU. produces a 41GB file with 1.3 billion memrefs. slowdown: 183x
More like 70x on laptop, and higher %CPU:
Because it's got an SSD? Or also b/c CPU is slower (so higher CPU-to-disk ratio; also slower native)?
Disabling dr_write_file:
=> 5.6x That's PC, read or write, size, and address. It should be easy to improve by 2x by removing read/write and size (statically recoverable) and only including PC once per bb or even less.
But it's much worse on other spec. Taking too long to do a ref run of everything but bmarks at the point I killed the run, 9 hours in:
Qin: "if memtrace is 100x, if you can make the profile 1/5 the size, can hit 20x"
Can shrink some fields, but not to 1/5. Online gzip compression should easily give 1/5. Simple test: I see >20x gzip compression (though w/ naive starting format):
Removing the PC field:
=> still i/o bound: 11% CPU. produces a 31GB file. slowdown: 126x
On laptop:
Up to 37%CPU, and 36x slowdown.
drcachesim tracer performance => 2x slower b/c of icache entries
Switching from mcf test to bzip2 test b/c it's a little closer to the 18x average performance for the memtrace sample not writing to disk and so is more representative:
native:
No disk writes at all:
That's 15.6x.
30.8x! 2x vs memtrace, b/c it's including icache info, presumably.
Currently trace_entry_t is 4+8 => 16 bytes b/c of alignment (we didn't pack it: b/c we only care about 32-bit?).
Packing trace_entry_t w/o any other changes to the struct:
Also compressing size+type from 4 bytes into 2 bytes: (Might need extra escape entry for memsz > 256)
Also shrinking pc/addr field to 4 bytes:
Also removing INSTR_BUNDLE (always has preceding abs pc so redundant):
10.7x = Also removing all instr entries (thus there's no PC at all):
Having the instr bundles and all the instr boundary info coming from the tracer seems worth it for online simulation, where having the simulator go dig it up from disassembly of a giant binary is going to be slower than the tracer providing it. But for offline, it does seem like we want to really optimize the tracing -- thus we need a split tracer!
14.3x = Adding back one instr entry per bb (1st instr in bb):
Significant cost for instr-entry-per-bb: 33% more expensive. Maybe we can leverage traces to bring it down, having one instr entry per trace + a bit per cbr + an extra entry per mbr crossed?!?
790: try online compression with zlib
With the private loader, should be able to just use zlib library directly.
It produces a 4GB file (vs 41GB uncompressed binary) but it is much slower! 295x vs native, 1.6x vs uncompressed. 98% CPU, too.
try zlib format instead of gz format, where we can set high speed => Z_BEST_SPEED is faster than uncompressed for HDD, but still not SSD
Z_BEST_SPEED
Have to use the deflate interface directly and the zlib compression format. The gz interface uses the gzip compression format and apparently has no interface to set the speed vs size.
It produces a 4.5GB file and is significantly faster than uncompressed, but it's still 114x vs native.
On laptop it makes a 4.3GB file (should have saved to see if really different) and:
So even Z_BEST_SPEED is slower than uncompressed on an SSD!