evanphx / benchmark-ips

Provides iteration per second benchmarking for Ruby
MIT License
1.72k stars 97 forks source link

Don't tell people something is slower if it's within the error #60

Closed chrisseaton closed 8 years ago

chrisseaton commented 8 years ago

benchmark-ips prints the SD as an error margin, but then if a result is within this error margin it still confidently tells the user that the benchmark is 'slower'. If the SD (which we're using as the error) of two benchmarks overlaps then we should say that we can't tell if it's slower, faster, or the same.

I've seen several benchmarks where people say that the results prove something is slower, when the errors overlap, so I think we could make it more clear.

I wrote this benchmark:

require 'benchmark/ips'

Benchmark.ips do |x|

  x.report("a") do
    sleep rand / 100
  end

  x.report("b") do
    sleep rand / 100
  end

  x.report("c") do
    sleep 0.75 / 100
  end

  x.compare!

end

Which has some limited random variation due to rand, but a or b are not really any faster than each other. c is slower however. With this patch this is what you see:

Warming up --------------------------------------
                   a    17.000  i/100ms
                   b    18.000  i/100ms
                   c    12.000  i/100ms
Calculating -------------------------------------
                   a    202.060  (± 22.8%) i/s -    816.000 
                   b    196.970  (± 17.8%) i/s -    972.000 
                   c    122.529  (± 3.3%) i/s -    612.000 

Comparison:
                   a:      202.1 i/s
                   b:      197.0 i/s - can't tell if faster, slower, or the same
                   c:      122.5 i/s - 1.65x slower

Maybe we want an option to turn this off, but keep it on by default to stop people accidentally taking something as proof.

Using the SD as an error in the first place may not be ideal. I'm far from an expert in statistics, but I'm not sure it's really the correct thing, and what we probably want for this kind of data is a bootstrap confidence interval. There's some Ruby code for this produced from some people working on PyPy and studying warmup, but it isn't released as a gem.

evanphx commented 8 years ago

Change the text of the new output to "same-ish: difference falls within error" and I'll commit it.

chrisseaton commented 8 years ago

Done.

kbrock commented 8 years ago

@chrisseaton thanks for this change