DynamoRIO / dynamorio

Dynamic Instrumentation Tool Platform
Other
2.57k stars 551 forks source link

online drcachesim is too slow (8x slower than cachegrind): regression? #3614

Open derekbruening opened 5 years ago

derekbruening commented 5 years ago

This issue is about improving the overall drcachesim online analysis performance. Xref #1738 on optimizing the cache simulation code. Xref #2001 on optimizing the tracer.

Running SPEC2006 mcf on the test input, we are currently very slow: 8x slower than cachegrind in fact (!!). I believe this is a big regression since I recall running this same performance test years ago and being in the 20 second range? Even the basic_counts tool is slow which makes this separate from #1738. I have not yet profiled this: that would be the first step.

This is a 64-bit release build:

$ /usr/bin/time ./mcf ../../data/test/input/inp.in
MCF SPEC CPU2006 version 1.10
Copyright (c) 1998-2000 Zuse Institut Berlin (ZIB)
Copyright (c) 2000-2002 Andreas Loebel & ZIB
Copyright (c) 2003-2005 Andreas Loebel

nodes                      : 5985
active arcs                : 102404
simplex iterations         : 63475
objective value            : 3180065918
new implicit arcs          : 2393292
active arcs                : 2495696
simplex iterations         : 118645
objective value            : 2060055866
erased arcs                : 2387557
checksum                   : 2997477
done
1.58user 0.04system 0:01.70elapsed 95%CPU (0avgtext+0avgdata 159752maxresident)k

$ /usr/bin/time /work/dr/git/build_x64_rel_tests/bin64/drrun -- ./mcf ../../data/test/input/inp.in
<...>
1.66user 0.03system 0:01.71elapsed 98%CPU (0avgtext+0avgdata 162812maxresident)k

$ /usr/bin/time /work/dr/git/build_x64_rel_tests/bin64/drrun -t drcachesim -- ./mcf ../../data/test/input/inp.in
<...>
---- <application exited with code 0> ----
Cache simulation results:
Core #0 (1 thread(s))
  L1I stats:
    Hits:                    3,367,529,621
    Misses:                          1,830
    Invalidations:                       0
    Miss rate:                        0.00%
  L1D stats:
    Hits:                    1,002,703,388
    Misses:                    369,946,321
    Invalidations:                       0
    Prefetch hits:              47,349,279
    Prefetch misses:           322,597,042
    Miss rate:                       26.95%
Core #1 (0 thread(s))
Core #2 (0 thread(s))
Core #3 (0 thread(s))
LL stats:
    Hits:                      356,111,139
    Misses:                     13,837,012
    Invalidations:                       0
    Prefetch hits:             254,939,620
    Prefetch misses:            67,657,422
    Local miss rate:                  3.74%
    Child hits:              4,417,582,288
    Total miss rate:                  0.29%
224.86user 16.13system 3:21.27elapsed 119%CPU (0avgtext+0avgdata 167692maxresident)k

$ /usr/bin/time /work/dr/git/build_x64_rel_tests/bin64/drrun -t drcachesim -simulator_type basic_counts -- ./mcf ../../data/test/input/inp.in
<...>
---- <application exited with code 0> ----
Basic counts tool results:
Total counts:
  3223195651 total (fetched) instructions
         433 total non-fetched instructions
           0 total prefetches
  1178112041 total data loads
   194412943 total data stores
           1 total threads
    28369030 total scheduling markers
           0 total transfer markers
           0 total function id markers
           0 total function return address markers
           0 total function argument markers
           0 total function return value markers
           0 total other markers
Thread 18131 counts:
  3223195651 (fetched) instructions
         433 non-fetched instructions
           0 prefetches
  1178112041 data loads
   194412943 data stores
    28369030 scheduling markers
           0 transfer markers
           0 function id markers
           0 function return address markers
           0 function argument markers
           0 function return value markers
           0 other markers
124.72user 13.88system 1:44.85elapsed 132%CPU (0avgtext+0avgdata 167700maxresident)k

$ /usr/bin/time valgrind --tool=cachegrind -- ./mcf ../../data/test/input/inp.in
==18032== Cachegrind, a cache and branch-prediction profiler
==18032== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==18032== Using Valgrind-3.14.0 and LibVEX; rerun with -h for copyright info
==18032== Command: ./mcf ../../data/test/input/inp.in
==18032== 
--18032-- warning: L3 cache found, using its data for the LL simulation.

<...>
==18032== 
==18032== I   refs:      3,223,256,389
==18032== I1  misses:            1,764
==18032== LLi misses:            1,741
==18032== I1  miss rate:          0.00%
==18032== LLi miss rate:          0.00%
==18032== 
==18032== D   refs:      1,346,785,609  (1,178,206,897 rd   + 168,578,712 wr)
==18032== D1  misses:      437,406,928  (  423,864,952 rd   +  13,541,976 wr)
==18032== LLd misses:       81,227,588  (   73,769,939 rd   +   7,457,649 wr)
==18032== D1  miss rate:          32.5% (         36.0%     +         8.0%  )
==18032== LLd miss rate:           6.0% (          6.3%     +         4.4%  )
==18032== 
==18032== LL refs:         437,408,692  (  423,866,716 rd   +  13,541,976 wr)
==18032== LL misses:        81,229,329  (   73,771,680 rd   +   7,457,649 wr)
==18032== LL miss rate:            1.8% (          1.7%     +         4.4%  )
24.04user 0.05system 0:24.20elapsed 99%CPU (0avgtext+0avgdata 180484maxresident)k
itamarst commented 3 years ago

Tried this out myself, and had the same result, it's much slower than Cachegrind.

derekbruening commented 3 years ago

All our usage nowadays is with offline traces, so it is not clear core contributors will have time to spend on this: hopefully someone new is motivated to take a look.

itamarst commented 3 years ago

Totally get people not having time, just, I'm not quite understanding the issue.

With offline traces the collection and analysis is done separately, yes? Any reason to think that would be any faster? In which case I should be trying that.

Or is it just that drcachesim isn't being maintained because core contributors are working on other things?

derekbruening commented 3 years ago

drcachesim's offline tracing and analysis tools are heavily used today and are actively maintained.

Offline trace gathering is optimized to not record what can be reconstructed later, with a post-processing pass to fill in that information. Online traces are not optimized in that way. E.g., online traces record an entry for every single instruction, while offline record one entry per basic block; similarly, offline omit entries for statically known or similar-to-neighbor addresses; online do not. You could imagine online trying something similar, but the post-processing overhead there will add to the online overhead and it's not clear how much of a win it would be.

Offline's overhead is dominated by i/o. The instrumentation itself is in the ~15x slowdown range, but the overall overhead is more like 50x-100x depending on the i/o capabilities. A simple online analysis could beat that, but a heavyweight online analysis may not.

Offline lets us run multiple heavyweight analyses on the same trace without affecting trace gathering overhead, and run new analyses on old traces.

itamarst commented 3 years ago

Hm. Taking a step back: my goal is not really profiling, it's benchmarking. Specifically, I want the equivalent of the numbers perf stat spits out, but not tied to underlying hardware, so that you can get consistent numbers even running in cloud VMs or cloud CI runners. You can do this with Cachegrind (https://pythonspeed.com/articles/consistent-benchmarking-in-ci/), but that suffers from (A) slowness (B) lack of realism, thus my interest in drcachesim. My hope is (A) to run faster and (B) to have prefetching modeled for more accurate cache miss metrics.

So I guess my question is, what would be the best way to achieve speed with drcachesim given this use case?

derekbruening commented 3 years ago

The good news is that the trace analysis tools for generating the numbers will work in either online or offline mode, as that's how we designed the analysis interface.

For speed with today's code, for the use case of a single end-to-end application-execution-to-analysis-results, we'd have to add the offline tracing of the run itself to the post-processing step and then to the analysis of the final trace file to compare to the online analysis. Even if offline's 3 separate steps summed are faster today, it requires disk space for the trace in its entirety, and a long run may not have enough space. Plus, it could be that there is low-hanging fruit that would speed up online.

If everything were optimized, you would expect online to out-perform end-to-end offline for any single analysis, just because offline is doing more work for its multiple steps with post-processing reconstruction. Offline can parallelize its recording step, even if the analysis is serial, but will still pay for the serial analysis when it gets to that step. So if the application is not perturbed by the slowdowns of serial analysis, I would think online would be the way to go. If the application is perturbed then you have to go with offline, with the complication of storage space.