Open derekbruening opened 5 years ago
Tried this out myself, and had the same result, it's much slower than Cachegrind.
All our usage nowadays is with offline traces, so it is not clear core contributors will have time to spend on this: hopefully someone new is motivated to take a look.
Totally get people not having time, just, I'm not quite understanding the issue.
With offline traces the collection and analysis is done separately, yes? Any reason to think that would be any faster? In which case I should be trying that.
Or is it just that drcachesim isn't being maintained because core contributors are working on other things?
drcachesim's offline tracing and analysis tools are heavily used today and are actively maintained.
Offline trace gathering is optimized to not record what can be reconstructed later, with a post-processing pass to fill in that information. Online traces are not optimized in that way. E.g., online traces record an entry for every single instruction, while offline record one entry per basic block; similarly, offline omit entries for statically known or similar-to-neighbor addresses; online do not. You could imagine online trying something similar, but the post-processing overhead there will add to the online overhead and it's not clear how much of a win it would be.
Offline's overhead is dominated by i/o. The instrumentation itself is in the ~15x slowdown range, but the overall overhead is more like 50x-100x depending on the i/o capabilities. A simple online analysis could beat that, but a heavyweight online analysis may not.
Offline lets us run multiple heavyweight analyses on the same trace without affecting trace gathering overhead, and run new analyses on old traces.
Hm. Taking a step back: my goal is not really profiling, it's benchmarking. Specifically, I want the equivalent of the numbers perf stat
spits out, but not tied to underlying hardware, so that you can get consistent numbers even running in cloud VMs or cloud CI runners. You can do this with Cachegrind (https://pythonspeed.com/articles/consistent-benchmarking-in-ci/), but that suffers from (A) slowness (B) lack of realism, thus my interest in drcachesim. My hope is (A) to run faster and (B) to have prefetching modeled for more accurate cache miss metrics.
So I guess my question is, what would be the best way to achieve speed with drcachesim given this use case?
The good news is that the trace analysis tools for generating the numbers will work in either online or offline mode, as that's how we designed the analysis interface.
For speed with today's code, for the use case of a single end-to-end application-execution-to-analysis-results, we'd have to add the offline tracing of the run itself to the post-processing step and then to the analysis of the final trace file to compare to the online analysis. Even if offline's 3 separate steps summed are faster today, it requires disk space for the trace in its entirety, and a long run may not have enough space. Plus, it could be that there is low-hanging fruit that would speed up online.
If everything were optimized, you would expect online to out-perform end-to-end offline for any single analysis, just because offline is doing more work for its multiple steps with post-processing reconstruction. Offline can parallelize its recording step, even if the analysis is serial, but will still pay for the serial analysis when it gets to that step. So if the application is not perturbed by the slowdowns of serial analysis, I would think online would be the way to go. If the application is perturbed then you have to go with offline, with the complication of storage space.
This issue is about improving the overall drcachesim online analysis performance. Xref #1738 on optimizing the cache simulation code. Xref #2001 on optimizing the tracer.
Running SPEC2006 mcf on the test input, we are currently very slow: 8x slower than cachegrind in fact (!!). I believe this is a big regression since I recall running this same performance test years ago and being in the 20 second range? Even the basic_counts tool is slow which makes this separate from #1738. I have not yet profiled this: that would be the first step.
This is a 64-bit release build: