Closed thedarkone closed 8 years ago
I have a PR to introduce bootstrap confidence intervals, which is in progress. In practice, these are much smaller than the SD, while also being arguably more mathematically rigorous and actionable.
They will give us a result such as '1.5x faster (± 0.1) with 95% confidence'.
It introduces different functionality to this PR, so there's no conflict, but it will also work to solve the same problem - the intervals will be much less likely to overlap.
I don't mind exposing how this works, but:
sample_duration
?time
?Benchmark.compare
, which we probably should not do.
- What is the unit of
sample_duration
?- What is the unit of
time
?
time
's unit are seconds (it is an existing option, unrelated to this PR), sample_duration
would have the same unit: seconds. There is a slight difference, time
says it wants an Integer
, but afaik will work with floats just fine, whereas sample_duration
says it wants Float
.
I considered going with milliseconds for sample_duration
, or with Hz, but in the end decided on seconds (mainly for consistency with time
).
- This breaks the public API of
Benchmark.compare
, which we probably should not do.
Will fix.
What do you think about doing this automatically (without printing a length warning) for benchmarks that don't have an explicit time
and sample_duration
configured (covers probably 99.9% of benchmark-ips
usage)?
Have 2 iterations of attempts to fix SD overlap, each time print a 1-line warning about an inconclusive result and then proceeding on with re-running the experiment? By default for a run with to 2 code blocks this would look smth like this:
bench-ips
gives up.
Grrrr, alright, I've had enough :cry: :rage: :neckbeard:.
Don't know if I'm going to be able to get this merged, but lets at least try (I'm open to all kinds of suggestions and will amend the PR as necessary).
Here's the deal:
benchmark-ips
has been immensely beneficial to the state of accurate Ruby benchmarking, however I feel that in some cases it does lead its users astray. Namely because it measures and reports SD and newly (#60) rightfully refuses to decide benchmarks on flimsy statistical evidence, this leads some developers to inaccurately conclude that 2 versions of the code being benchmarked are equal in performance, when in fact this is demonstrably false. It is not that the benchmark is undecidable, it is just their configuration ofbenchmark-ips
that prevents them from arriving to the correct result.In my experience (subjective anecdotal evidence for the win :innocent:), in the absence of hard time limits on benchmark duration, it is absolutely the case that noisy benchmarks can be tamed and decided. Ignoring variance while benchmarking is of course a fools errand, but sometimes some noise is acceptable and we want to know which version of the code is faster (it is also usually the case that both versions have comparable SD, so it is not about plunging for the average while ignoring the latency/SD).
In this PR I propose to make sample duration/interval (or in other words sampling frequency) configurable, additionally in case of statistically ambiguous results,
benchmark-ips
would now propose to the user to tweak the configuration and attempt to re-run the experiment with a longer duration and decreased sampling frequency.The PR is a spiritual successor to @chrisseaton's PR #60, let me know what you guys think.
Tested with MRI:
Implement suggested changes:
✨ 🎉