catapult-project / catapult

Deprecated Catapult GitHub. Please instead use http://crbug.com "Speed>Benchmarks" component for bugs and https://chromium.googlesource.com/catapult for downloading and editing source code..
https://chromium.googlesource.com/catapult
BSD 3-Clause "New" or "Revised" License
1.93k stars 563 forks source link

[📍] Adjust significance thresholds #4508

Closed dave-2 closed 6 years ago

dave-2 commented 6 years ago

In http://go/pinpoint-data, I have this chart showing how our two significance thresholds divide the result into three sections. From top to bottom: the two distributions are the same, we need more data, and the two distributions are different. image One thing that's bothered me about this diagram is that the top threshold is a straight line that doesn't really match the shape of the curves below it.

What I did instead was run 100k simulations of MWU p-values, with two normal distributions that differ by 1σ, then took the 90th, 99th, and 99.9th percentiles of those simulations. That is, 99.9% of results (assuming the above distributions) will fall under the top curve in the chart below. The higher the percentile, the more the curve differs from a straight line.

The old threshold is shown here as a dashed line for comparison. (The bottom threshold can remain the same. As before, 99.9% of results will fall above the straight line at 0.001.) image

The simulation code I ran is here.

import sys

import numpy
from scipy import stats

from dashboard.pinpoint.models import mann_whitney_u

data = []
for _ in xrange(100000):
  data.append(([], []))

for repeat_count in xrange(1, 121):
  p_values = []
  for a, b in data:
    a.append(stats.norm.rvs())
    b.append(stats.norm.rvs(1))
    p_values.append(mann_whitney_u.MannWhitneyU(a, b))

  p900 = str(numpy.percentile(p_values, 90))
  p990 = str(numpy.percentile(p_values, 99))
  p999 = str(numpy.percentile(p_values, 99.9))
  print '\t'.join((str(repeat_count), p900, p990, p999))
  sys.stdout.flush()

@perezju @simonhatch @anniesullie

nedn commented 6 years ago

/sub

Discussed offline, this would also help a lot with reducing the runtime of functional bisect.