thedarkone commented 8 years ago

Grrrr, alright, I've had enough :cry: :rage: :neckbeard:.

Don't know if I'm going to be able to get this merged, but lets at least try (I'm open to all kinds of suggestions and will amend the PR as necessary).

Here's the deal: benchmark-ips has been immensely beneficial to the state of accurate Ruby benchmarking, however I feel that in some cases it does lead its users astray. Namely because it measures and reports SD and newly (#60) rightfully refuses to decide benchmarks on flimsy statistical evidence, this leads some developers to inaccurately conclude that 2 versions of the code being benchmarked are equal in performance, when in fact this is demonstrably false. It is not that the benchmark is undecidable, it is just their configuration of benchmark-ips that prevents them from arriving to the correct result.

In my experience (subjective anecdotal evidence for the win :innocent:), in the absence of hard time limits on benchmark duration, it is absolutely the case that noisy benchmarks can be tamed and decided. Ignoring variance while benchmarking is of course a fools errand, but sometimes some noise is acceptable and we want to know which version of the code is faster (it is also usually the case that both versions have comparable SD, so it is not about plunging for the average while ignoring the latency/SD).

In this PR I propose to make sample duration/interval (or in other words sampling frequency) configurable, additionally in case of statistically ambiguous results, benchmark-ips would now propose to the user to tweak the configuration and attempt to re-run the experiment with a longer duration and decreased sampling frequency.

The PR is a spiritual successor to @chrisseaton's PR #60, let me know what you guys think.

Tested with MRI:

# bench.rb
require 'benchmark/ips'

# I like to make my VMs work instead of sleeping ;)
def work(num)
  nnum = num * 300 # magic const to make the example work
  i = 0
  while i < nnum
    i+=1
  end
  i
end

Benchmark.ips do |x|
  x.report("work(1000 + r(500))") do
    work(1000 + rand(500))
  end

  x.report("work(1000 + r(400))") do
    work(1000 + rand(400))
  end

  x.compare!
end

> ruby -Ilib bench.rb ⏎
Warming up --------------------------------------
 work(1000 + r(500))    15.000  i/100ms
 work(1000 + r(400))    15.000  i/100ms
Calculating -------------------------------------
 work(1000 + r(500))    155.226  (± 2.6%) i/s -    780.000  in   5.029015s
 work(1000 + r(400))    162.102  (± 3.1%) i/s -    810.000  in   5.000809s

Comparison:
 work(1000 + r(400)):      162.1 i/s
 work(1000 + r(500)):      155.2 i/s - same-ish: difference falls within error

-------------------------------------------------------------------------

Some reports were within standard deviation of each other. Because of that
benchmark-ips was unable to decide which of them is faster. It is quite
possible that with a slightly tweaked configuration benchmark-ips will
be able to provide more accurate results.

Please try re-running your benchmark with the following additional
configuration:

  Benchmark.ips do |x|
    x.sample_duration = 0.4 # <--- new config
    x.time = 10 # <--- new config
    # ...

Please note that running the benchmark with the new configuration
will take longer and might not result in a more accurate measurement.
In that case benchmark-ips will re-print this warning with an even
more aggressive configuration option suggestions. It is a good idea
to try to follow benchmark-ips's advice a couple of times (each time
re-running the benchmark with new options), letting it escalate
its configuration repeatedly (a good rule is to give up after 3-4
iterations).

Implement suggested changes:

# bench.rb
require 'benchmark/ips'

# I like to make my VMs work instead of sleeping ;)
def work(num)
  nnum = num * 300 # magic const to make the example work
  i = 0
  while i < nnum
    i+=1
  end
  i
end

Benchmark.ips do |x|
  x.sample_duration = 0.4
  x.time = 10

  x.report("work(1000 + r(500))") do
    work(1000 + rand(500))
  end

  x.report("work(1000 + r(400))") do
    work(1000 + rand(400))
  end

  x.compare!
end

> ruby -Ilib bench.rb ⏎
Warming up --------------------------------------
 work(1000 + r(500))    61.000  i/100ms
 work(1000 + r(400))    64.000  i/100ms
Calculating -------------------------------------
 work(1000 + r(500))    154.372  (± 1.3%) i/s -      1.586k in  10.276211s
 work(1000 + r(400))    161.165  (± 1.2%) i/s -      1.664k in  10.326133s

Comparison:
 work(1000 + r(400)):      161.2 i/s
 work(1000 + r(500)):      154.4 i/s - 1.04x slower

✨ 🎉

chrisseaton commented 8 years ago

I have a PR to introduce bootstrap confidence intervals, which is in progress. In practice, these are much smaller than the SD, while also being arguably more mathematically rigorous and actionable.

They will give us a result such as '1.5x faster (± 0.1) with 95% confidence'.

It introduces different functionality to this PR, so there's no conflict, but it will also work to solve the same problem - the intervals will be much less likely to overlap.

evanphx commented 8 years ago

I don't mind exposing how this works, but:

What is the unit of sample_duration?
What is the unit of time?
This breaks the public API of Benchmark.compare, which we probably should not do.

thedarkone commented 8 years ago

What is the unit of sample_duration?

What is the unit of time?

time's unit are seconds (it is an existing option, unrelated to this PR), sample_duration would have the same unit: seconds. There is a slight difference, time says it wants an Integer, but afaik will work with floats just fine, whereas sample_duration says it wants Float.

I considered going with milliseconds for sample_duration, or with Hz, but in the end decided on seconds (mainly for consistency with time).

This breaks the public API of Benchmark.compare, which we probably should not do.

Will fix.

What do you think about doing this automatically (without printing a length warning) for benchmarks that don't have an explicit time and sample_duration configured (covers probably 99.9% of benchmark-ips usage)?

Have 2 iterations of attempts to fix SD overlap, each time print a 1-line warning about an inconclusive result and then proceeding on with re-running the experiment? By default for a run with to 2 code blocks this would look smth like this:

default run takes around 10s,
1st notice is printed, second run takes 20s,
2nd notice is printed, third run takes 40s, after that bench-ips gives up.

evanphx commented 8 years ago

evanphx / benchmark-ips

Configurable sampling frequency/duration #68

69 solves this instead!