airspeed-velocity / asv

Airspeed Velocity: A simple Python benchmarking tool with web-based reporting
https://asv.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
867 stars 181 forks source link

Stagger benchmarks to smooth background noise #595

Closed jbrockmendel closed 6 years ago

jbrockmendel commented 6 years ago

When running asv continuous to compare commits A and B, all the benchmarks from A run followed by all the benchmarks from B, e.g. "A.foo, A.bar, A.baz, B.foo, B.bar, B.baz". Would it be feasible to instead run "A.foo, B.foo, A.bar, B.bar, A.baz, B.baz"?

(In fact because each benchmark is run multiple times, Ideally I would like the staggering to be even finer-grained.)

The thought here is that background processes make noise auto-correlated, so the running comparisons back-to-back may give more informative ratios. (Based on amateur speculation)

pv commented 6 years ago

See here https://github.com/pv/asv/commits/many-proc

Full staggering may not be so sensible, because you need to uninstall/install the project in between.

The general issue is also probably not so much background processes, but hardware causes for CPU performance fluctuations. These occur on laptops, but I guess desktop CPUs are have similar behavior vs. thermal throttling etc.

jbrockmendel commented 6 years ago

The hardware issue sounds tough, but also sounds like you've put a lot of time and thought into this, which is reassuring.

What am I looking at with many-proc? Is the idea to save+aggregate results across multiple runs? That seems like a reasonable approach (and possibly easier to implement than staggering?).

Is the "not so sensible" an indication that I should give up on this? The status-quo of running asv continuous repeatedly and hunting for which ratios are similar across many runs is... not ideal.

pv commented 6 years ago

Staggering can be implemented (in the same way as multiple runs + aggregating results), but I expect it will be slower because between each benchmark you need to uninstall/reinstall the project to the environment.

jbrockmendel commented 6 years ago

it will be slower because between each benchmark you need to uninstall/reinstall the project to the environment.

Can you expand on that a bit? Could the parent process spawn two env-specific processes which in turn spawn processes for each benchmark without switching in between?

Even if not, slower and more accurate is a tradeoff I'd be happy with. I implemented asv for dateutil but the PR stalled because ATM the results are just way too noisy.

pv commented 6 years ago

@jbrockmendel: you can perhaps try asv master branch now that the multi-process benchmarking is in there. E.g. asv run ... -a processes=4 to override benchmark attributes from command line.

jbrockmendel commented 6 years ago

I'll give it a try. How do I enable the feature where it stores results to calculate statistics over multiple runs? If we can get that working then stability becomes a problem we can throw hardware-hours at.

pv commented 6 years ago

It's automatically on (by default processes=2), adjustable on command line as above.

There's no option to combine results from multiple asv continuous runs to each other, however.

Ultimately, if you want good benchmarking accuracy, you probably need to at least disable CPU frequency tuning for one CPU and then use taskset (on linux) to run processes on that CPU (and maybe isolating it from general allocation).

None of this is particularly necessary if you just accumulate historical data, as done here and by most projects using asv: https://pv.github.io/numpy-bench/ Sure, the results are somewhat noisy, but unlike asv continuous (which was not a main intended use case for asv in the beginning) small noise in time series won't matter much.

jbrockmendel commented 6 years ago
time asv continuous -E virtualenv -f 1.1 master $HEAD -b groupby.Categories
[...]
real    0m44.505s
user    0m41.800s
sys 0m10.168s

time asv continuous -E virtualenv -f 1.1 master $HEAD -b groupby.Categories -a processes=4
[...]
real    1m13.417s
user    1m9.164s
sys 0m18.696s

Is this the expected behavior? I thought it would go the other way.

jbrockmendel commented 6 years ago

There's no option to combine results from multiple asv continuous runs to each other, however.

Darn. Would a PR implementing this be accepted? (or feasible?)

None of this is particularly necessary if you just accumulate historical data, Ultimately, if you want good benchmarking accuracy, you probably need to at least disable CPU frequency tuning for one CPU and then use taskset [...]

I have taken to using taskset, am not familiar with the tuning bit.

It sounds like our [pandas, prospectively dateutil] use case is not exactly the intended one, but I'm optimistic/hopeful we can make it work. Generally when a PR is made in a perf-sensitive part of the code the maintainer asks for asv comparison. A single run of asv continuous [...] says that un-changed parts of the code have gotten much faster/slower; subsequent runs often flip those results and/or give a new set of benchmarks that have changed.

Cranking the sample size up to 11 seems like the least labor-intensive way to address this. Am I wrong?

pv commented 6 years ago

Is this the expected behavior? I thought it would go the other way.

I'm not sure what you mean --- the default is processes=2, and increasing the number to 4 makes it take longer --- as expected?

Darn. Would a PR implementing this be accepted? (or feasible?)

It's feasible, just needs some plumbing in the right places.

But I'm not sure it it will ultimately solve the problem with the accuracy of the results --- you may be able to somewhat reduce the number of false positives, but not fully... E.g., if there is performance variation on time scale of 10sec (such as with laptop CPU thermal control), then you still need some luck to have all your benchmarks sample the full variation.

It sounds like our [pandas, prospectively dateutil] use case is not exactly the intended one, but I'm optimistic/hopeful we can make it work. Generally when a PR is made in a perf-sensitive part of the code the maintainer asks for asv comparison. A single run of asv continuous [...] says that un-changed parts of the code have gotten much faster/slower; subsequent runs often flip those results and/or give a new set of benchmarks that have changed.

Cranking the sample size up to 11 seems like the least labor-intensive way to address this. Am I wrong?

Sure, it should be possible to make it work by measuring longer (i.e. adjust repeat and processes). However, I'm not sure if it will be feasible to e.g. run the whole pandas benchmark suite on an unconfigured laptop CPU without false positives.

This is not specific to asv, except in the sense that asv runs many timing benchmarks, and the chance of false positives is multiplied by the number of benchmarks run.

However, I expect the situation is already better with asv 0.3.x, which does the statistics properly, than with asv 0.2.x.

pv commented 6 years ago

cf gh-689

jbrockmendel commented 6 years ago

689 looks like it could be a big help, thanks.

I'm not sure what you mean --- the default is processes=2, and increasing the number to 4 makes it take longer --- as expected?

I'm back to being confused. I expected more processes to mean faster execution; why is that the wrong intuition?

pv commented 6 years ago

The processes are run sequentially, not at the same time. If they would be run at the same time, this would change the load on the machine and affect results.

pv commented 6 years ago

689 looks like it could be a big help, thanks.

You can try it out in principle. Also, it's unclear to me if the issues you mention were with asv 0.2.x, or whether they are an issue with the current master branch.

pv commented 6 years ago

gh-697