Modify benchmarks so that we can compare hotness heuristics

GoogleCodeExporter commented 8 years ago

Before we can improve the hotness heuristic, we have to measure its 
effectiveness.  Right now, we have no repeatable, low noise way to compare 
two hotness heuristics.  Benchmarking with -j whenhot creates a lot of 
compilation in the early passes, and stabilizes in the later passes.  

Here's how I think we should fix this:

First, we should do a couple of priming runs, wait for any background 
compilation to finish, disable LLVM compilation, and then do the 
benchmarking runs.  This is a good first step, because it makes each run 
independent, which it isn't right now.

However, if we just did that, if we assume that compilation always improves 
runtime, then the most effective heuristic is -j always.  We need to 
balance the execution time against the compilation time.  To do this, I 
think we should introduce a new mode to perf.py called "compile_mode" (-c), 
which instead of measuring the execution time, measures the compile time.  
To compare two heuristics, we'd do something like this:

perf.py -a '-j heuristic1,-j heuristic2' ./python ./python
... run time stats ...
perf.py -c -a '-j heuristic1,-jheuristic2' ./python ./python
... compile time stats ...

Perhaps passing -c could cause it to run the benchmark scripts twice, so 
that we could get one final analysis without running the script twice.

This mode is not very useful for benchmarks like the regexp or micro 
benchmarks, so not every script should support it.  For tuning the 
heuristic, we really only care about the macro benchmarks.

Original issue reported on code.google.com by reid.kle...@gmail.com on 10 Aug 2009 at 11:48

GoogleCodeExporter commented 8 years ago

I agree with the plan to stop compilation after the priming runs. We'll need 
some way of 
gauging how many priming runs is appropriate.

What do you mean by "compilation time"? If you mean only the time it takes us 
to 
compile Python bytecode to machine code, I think that's too narrow. Better 
would be the 
time from when the work unit is put on the queue to when co_native_function is 
set 
(aka, when we can actually start using the machine code).

Original comment by collinw on 11 Aug 2009 at 5:54

GoogleCodeExporter commented 8 years ago

So maybe we could turn the hot code set into a map from code to time spent on 
the 
queue and then expose that information via _llvm?  The numbers would be noisier 
since 
they would only be measured once over the course of the priming runs, but we 
don't 
need too much precision for tuning this metric.  Time spent on the queue will 
be an 
important metric if we make the queue a priority queue that dynamically moves 
hot code 
to the front, but for just improving the hotness metric I think the compile 
time will 
be a good measure.

Original comment by reid.kle...@gmail.com on 11 Aug 2009 at 11:56

GoogleCodeExporter commented 8 years ago

Our hotness function seems to be good enough for now, so this isn't very high
priority.  I'm linking the patches I've been working on for this in the tracker 
and
dropping this to low priority.

http://codereview.appspot.com/105058/show
http://codereview.appspot.com/157081/show

Original comment by reid.kle...@gmail.com on 22 Feb 2010 at 5:56

Added labels: Priority-Low
Removed labels: Priority-Medium

arvindm95 / unladen-swallow

Modify benchmarks so that we can compare hotness heuristics #76