Add warnings, errors, and tips to benchmark report

jwoudenberg commented 7 years ago

In the same vein as the elm compiler it wouldn't be really nice if elm-benchmark gave us warning, errors, and tips to help us write better benchmarks. From working with Brian a bit, I know he has tons of context on this, part of which could be automatically distributed in the benchmark report.

Below is an outline of some of Brian's tips I remeber, to give an idea of the type of helpful messages that could be displayed.

Standard deviations are large:
- Run counts are low
- "Try to make the benchmark run faster by narrowing down the code it runs to the part you're actually trying to benchmark."
- "Perhaps you're not actually writing a micro-benchmark? Consider using a different tool. Here's a good one: <link>."
- Run counts are not low
- "These programs are known to interfere with benchmark performance. Try closing them and running the benchmark again."
- "Did none if the above work? Please report an issue to help us make elm-benchmark better! <link>"
Standard deviation is larger than delta:
- "The benchmark is inconclusive."
- Should the delta even be shown in this scenario?

BrianHicks commented 6 years ago

FWIW I'm reducing these numbers down to two in the next version: runs per second and goodness of fit. Runs per second is pretty self-descriptive, but goodness of fit is not. In the new version, we vary sample size in order to generate a trend line, and goodness of fit is a measure of errors in the trend. It's expressed in terms of percent, and higher is better. So these advice will end up close to:

total samples are low (number TBD but related to samples/bucket): same advice as "run counts are low" above.
5% of buckets have points outside 2 sigma (exact numbers TBD): high outlier count, try re-running (just reloading the page will keep the JIT hot enough to avoid these, usually.)
goodness of fit is less than 95%: there may be interference on the system. Try closing programs or tabs that are consuming significant system resources (Slack, Spotify are typical candidates) and re-running.
goodness of fit is less than 85%: There's something really wrong, don't trust these results.
1. same advice on closing heavy tabs or programs
2. if that doesn't solve it, try increasing the sample time
3. if that doesn't solve it, show up in #elm-benchmark on the Elm Slack and we'll try to get you sorted out. There's probably some error this tool can't detect, or we need to account for your system setup in the sampling approach.

BrianHicks commented 6 years ago

Also, the new approach solves these in the following ways:

Standard deviations are large: trends on linear data are about equally susceptible to this problem, but goodness of fit is a single intuitive metric that we can easily generate advice from.
Standard deviation is larger than delta: varying sample size means we just don't have this problem… but we do have other ones. Goodness of fit measures this, too.
Run counts are low: we're going to make more smaller samples. This allows for larger benchmarks. It still has some problems, but hopefully in fewer cases.
System interference: a few outliers should not throw everything off.

In addition I'm adding lots of charts. Just looking at the data shows problems more often than you'd suspect, humans are very good at "hey, that's weird..." and not trusting the results. So for example, I can show the points. That shows outliers easily, as well as jags due to system spikes. If I show the trend line, it'll be obviously a good or bad fit (it's kinda susceptible to outliers.)

BrianHicks commented 6 years ago

moved to elm-explorations/benchmark#4

BrianHicks / elm-benchmark

Add warnings, errors, and tips to benchmark report #13