HowProgrammingWorks / Benchmark

Performance testing for different techniques
https://www.youtube.com/TimurShemsedinov
MIT License
17 stars 15 forks source link

Statistical significance of benchmark results #8

Open aqrln opened 7 years ago

aqrln commented 7 years ago

In order for us to be able to make any decisions based on the results of benchmarks, they must be statistically significant. There may be options depending on what a benchmark tries to test, but I'll try to describe the common approach here.

First of all, separate the collection of raw data and the analysis of it. These are completely different things.

What you'll probably want to do to collect the data is to write a generic benchmarking framework that will warm up V8 (hint: it may be done faster using %OptimizeFunctionOnNextCall()), trigger a sensible amount of a benchmarked function runs and calculate the average count of operations per second. It should be possible to run such a script on its own and see the result so it is feasible to make some assumptions about the code immediately without detailed analysis. If we really want to compare the results of two benchmarks and they don't differ in several orders of magnitude, we have no right to do this unless some statistics is involved. In this case the whole process must be repeated some 100 times or so (maybe less, idk, worth playing with it) so that the benchmark runs for a fair amount of time (20–30 minutes would be okay) and writes a series of results using CSV, JSON or other machine-readable format. Then the next part is being involved, the most interesting one for mathematically-minded people out there.

The approach to comparison of results depends on the nature of these benchmarks and their output. If, e.g., we have two series of "ops/s" results from two comparable (i.e., testing the performance of two ways of solving a single problem, like comparing ES2015+ features with their ES5 counterparts) benchmarks, and we want to know which one is faster and how much, it is appropriate to use 2-sample independent Student's t-test with null hypothesis that performance is the same. There's also additional neat stuff we can do with raw data, like plotting graphs and charts. I'd recommend using R or Python with scipy for all of these, but if one wants to do the whole thing in JavaScript and Node, it'll be a great project on its own.