benchmark-ips prints the SD as an error margin, but then if a result is within this error margin it still confidently tells the user that the benchmark is 'slower'. If the SD (which we're using as the error) of two benchmarks overlaps then we should say that we can't tell if it's slower, faster, or the same.
I've seen several benchmarks where people say that the results prove something is slower, when the errors overlap, so I think we could make it more clear.
I wrote this benchmark:
require 'benchmark/ips'
Benchmark.ips do |x|
x.report("a") do
sleep rand / 100
end
x.report("b") do
sleep rand / 100
end
x.report("c") do
sleep 0.75 / 100
end
x.compare!
end
Which has some limited random variation due to rand, but a or b are not really any faster than each other. c is slower however. With this patch this is what you see:
Warming up --------------------------------------
a 17.000 i/100ms
b 18.000 i/100ms
c 12.000 i/100ms
Calculating -------------------------------------
a 202.060 (± 22.8%) i/s - 816.000
b 196.970 (± 17.8%) i/s - 972.000
c 122.529 (± 3.3%) i/s - 612.000
Comparison:
a: 202.1 i/s
b: 197.0 i/s - can't tell if faster, slower, or the same
c: 122.5 i/s - 1.65x slower
Maybe we want an option to turn this off, but keep it on by default to stop people accidentally taking something as proof.
Using the SD as an error in the first place may not be ideal. I'm far from an expert in statistics, but I'm not sure it's really the correct thing, and what we probably want for this kind of data is a bootstrap confidence interval. There's some Ruby code for this produced from some people working on PyPy and studying warmup, but it isn't released as a gem.
benchmark-ips
prints the SD as an error margin, but then if a result is within this error margin it still confidently tells the user that the benchmark is 'slower'. If the SD (which we're using as the error) of two benchmarks overlaps then we should say that we can't tell if it's slower, faster, or the same.I've seen several benchmarks where people say that the results prove something is slower, when the errors overlap, so I think we could make it more clear.
I wrote this benchmark:
Which has some limited random variation due to
rand
, buta
orb
are not really any faster than each other.c
is slower however. With this patch this is what you see:Maybe we want an option to turn this off, but keep it on by default to stop people accidentally taking something as proof.
Using the SD as an error in the first place may not be ideal. I'm far from an expert in statistics, but I'm not sure it's really the correct thing, and what we probably want for this kind of data is a bootstrap confidence interval. There's some Ruby code for this produced from some people working on PyPy and studying warmup, but it isn't released as a gem.