Approximate ratio masks significant benchmark changes

h-vetinari commented 6 years ago

ASV sometimes adds a ~ to ratios if there are many sample (with high variance, I'm guessing), but there's a bug somewhere how these results are evaluated when the significant changes are given out.

As an example from running the ASV in a pandas PR:

(pandas-dev) C:\[...]\pddev\asv_bench>asv continuous -f 1.1 upstream/master 0d3c6d21 -b ^strings.Cat
· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
·· Building 0d3c6d21 <cln_str_cat~2> for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
·· Installing 0d3c6d21 <cln_str_cat~2> into conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
· Running 2 total benchmarks (2 commits * 1 environments * 1 benchmarks)
[  0.00%] · For pandas commit b28cf5aa <master> (round 1/2):
[  0.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[  0.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 25.00%] ··· Running (strings.Cat.time_cat--).
[ 25.00%] · For pandas commit 0d3c6d21 <cln_str_cat~2> (round 1/2):
[ 25.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 25.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 50.00%] ··· Running (strings.Cat.time_cat--).
[ 50.00%] · For pandas commit 0d3c6d21 <cln_str_cat~2> (round 2/2):
[ 50.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 75.00%] ··· strings.Cat.time_cat                                                                                                                                                ok
[ 75.00%] ··· ============ ====== ============ ============== ============= ========== ============ ===========
              --                                                 na_rep / na_frac
              ------------------- -----------------------------------------------------------------------------
               other_cols   sep    None / 0.0   None / 0.001   None / 0.15   - / 0.0    - / 0.001     - / 0.15
              ============ ====== ============ ============== ============= ========== ============ ===========
                   0        None    7.81±2ms     10.9±0.8ms     11.7±0.8ms   9.38±0ms   10.9±0.6ms    12.5±0ms
                   0         ,      9.38±0ms      10.9±0ms      11.7±0.8ms   9.38±0ms    10.9±0ms     12.5±0ms
                   3        None    46.9±8ms      62.5±0ms       62.5±6ms    54.7±8ms    62.5±8ms    78.1±10ms
                   3         ,      78.1±0ms      78.1±0ms       62.5±8ms    78.1±8ms    78.1±0ms     93.8±0ms
              ============ ====== ============ ============== ============= ========== ============ ===========

[ 75.00%] · For pandas commit b28cf5aa <master> (round 2/2):
[ 75.00%] ·· Building for conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 75.00%] ·· Benchmarking conda-py3.6-Cython-matplotlib-numexpr-numpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[100.00%] ··· strings.Cat.time_cat                                                                                                                                                ok
[100.00%] ··· ============ ====== ============ ============== ============= ========== ============ ==========
              --                                                na_rep / na_frac
              ------------------- ----------------------------------------------------------------------------
               other_cols   sep    None / 0.0   None / 0.001   None / 0.15   - / 0.0    - / 0.001    - / 0.15
              ============ ====== ============ ============== ============= ========== ============ ==========
                   0        None    10.9±0ms      9.38±0ms      10.9±0.8ms   9.38±0ms   9.38±0.6ms   10.9±0ms
                   0         ,     9.38±0.6ms    14.1±0.8ms     14.1±0.6ms   9.38±0ms    9.38±0ms    10.9±0ms
                   3        None    62.5±6ms      46.9±6ms       46.9±0ms    54.7±8ms    46.9±8ms    62.5±0ms
                   3         ,      46.9±6ms      46.9±8ms       46.9±6ms    78.1±0ms    70.3±8ms    93.8±8ms
              ============ ====== ============ ============== ============= ========== ============ ==========

BENCHMARKS NOT SIGNIFICANTLY CHANGED.

Capital letters telling me everything is fine? Not so much unfortunately... (stars added manually)

(pandas-dev) C:\[...]\pddev\asv_bench>asv compare b28cf5aa 0d3c6d21

All benchmarks:

       before           after         ratio
     [b28cf5aa]       [0d3c6d21]
     <master>         <cln_str_cat~2>
         9.38±0ms         9.38±0ms       1.00    strings.Cat.time_cat(0, ',', '-', 0.0)
         9.38±0ms         10.9±0ms    **~1.17**  strings.Cat.time_cat(0, ',', '-', 0.001)
         10.9±0ms         12.5±0ms    **~1.14**  strings.Cat.time_cat(0, ',', '-', 0.15)
       9.38±0.6ms         9.38±0ms       1.00    strings.Cat.time_cat(0, ',', None, 0.0)
       14.1±0.8ms         10.9±0ms    **~0.78**  strings.Cat.time_cat(0, ',', None, 0.001)
       14.1±0.6ms       11.7±0.8ms    **~0.83**  strings.Cat.time_cat(0, ',', None, 0.15)
         9.38±0ms         9.38±0ms       1.00    strings.Cat.time_cat(0, None, '-', 0.0)
       9.38±0.6ms       10.9±0.6ms    **~1.17**  strings.Cat.time_cat(0, None, '-', 0.001)
         10.9±0ms         12.5±0ms    **~1.14**  strings.Cat.time_cat(0, None, '-', 0.15)
         10.9±0ms         7.81±2ms    **~0.71**  strings.Cat.time_cat(0, None, None, 0.0)
         9.38±0ms       10.9±0.8ms    **~1.17**  strings.Cat.time_cat(0, None, None, 0.001)
       10.9±0.8ms       11.7±0.8ms       1.07    strings.Cat.time_cat(0, None, None, 0.15)
         78.1±0ms         78.1±8ms       1.00    strings.Cat.time_cat(3, ',', '-', 0.0)
         70.3±8ms         78.1±0ms    **~1.11**  strings.Cat.time_cat(3, ',', '-', 0.001)
         93.8±8ms         93.8±0ms       1.00    strings.Cat.time_cat(3, ',', '-', 0.15)
         46.9±6ms         78.1±0ms    **~1.67**  strings.Cat.time_cat(3, ',', None, 0.0)
         46.9±8ms         78.1±0ms    **~1.67**  strings.Cat.time_cat(3, ',', None, 0.001)
         46.9±6ms         62.5±8ms    **~1.33**  strings.Cat.time_cat(3, ',', None, 0.15)
         54.7±8ms         54.7±8ms       1.00    strings.Cat.time_cat(3, None, '-', 0.0)
         46.9±8ms         62.5±8ms    **~1.33**  strings.Cat.time_cat(3, None, '-', 0.001)
         62.5±0ms        78.1±10ms    **~1.25**  strings.Cat.time_cat(3, None, '-', 0.15)
         62.5±6ms         46.9±8ms    **~0.75**  strings.Cat.time_cat(3, None, None, 0.0)
         46.9±6ms         62.5±0ms    **~1.33**  strings.Cat.time_cat(3, None, None, 0.001)
         46.9±0ms         62.5±6ms    **~1.33**  strings.Cat.time_cat(3, None, None, 0.15)

It just so happened that all the significant changes were hiding behind a tilde...

PS. I just installed from github a few days ago, I'm on 0.4.dev1533+5f9be296.

pv commented 6 years ago

You can do a run with --record-samplesand then use asv show --details to inspect the actually measured values. Without those, hard to see there's a bug... More likely, the measurement results are just too noisy to claim there has been a significant change in the result --- i.e. not changed significantly.

h-vetinari commented 6 years ago

OK, it could of course be that it was intentional - forget the word "bug" then.

But if something increases dramatically (noisy or not), most people likely really would like to know (and would be able to determine themselves if the noise is relevant or not), rather than falsely be placated by reading BENCHMARKS NOT SIGNIFICANTLY CHANGED.

I mean, isn't that the whole point of the cut-off (e.g.) -f 1.1? To throw away small small differences mostly due to noise? If something is suddenly 67% slower, I think that's relevant to know.

pv commented 6 years ago

Because of the noise, you don't know if something increased dramatically: the values printed are only the medians. If there's too much spread in the results to even state with confidence the two measurements gave different results, it's not possible to state there was a change larger than some ratio.

There can be fine tuning of the statistics, but for that specific results with the measured values would be needed.

h-vetinari commented 6 years ago

Sorry for the late reply.

Because of the noise, you don't know if something increased dramatically: the values printed are only the medians. If there's too much spread in the results to even state with confidence the two measurements gave different results, it's not possible to state there was a change larger than some ratio.

My point is that - despite the spread - the increase in the medians might be a real regression, and hiding those cases by default, and without an option to show them gives a false sense of security to the people using ASV. Maybe someone wants to be strict on those noisy cases as well...?

How about a parameter that allows to include these tests, either through a boolean, or through setting some threshold for the variance etc.?

pv commented 6 years ago

If the increase of median is not statistically significant, it is not statistically significant, and most likely just spurious reading. You will also get false positives this way.

The option to show everything already exists: asv continuous --no-only-changed.

What you probably are asking for is an option to provide a three-state report, change/no change/unsure.

An option to simply ignore statistics can also be added in principle.

airspeed-velocity / asv

Approximate ratio masks significant benchmark changes #752