Deprecated Catapult GitHub. Please instead use http://crbug.com "Speed>Benchmarks" component for bugs and https://chromium.googlesource.com/catapult for downloading and editing source code..
In http://go/pinpoint-data, I have this chart showing how our two significance thresholds divide the result into three sections. From top to bottom: the two distributions are the same, we need more data, and the two distributions are different.
One thing that's bothered me about this diagram is that the top threshold is a straight line that doesn't really match the shape of the curves below it.
What I did instead was run 100k simulations of MWU p-values, with two normal distributions that differ by 1σ, then took the 90th, 99th, and 99.9th percentiles of those simulations. That is, 99.9% of results (assuming the above distributions) will fall under the top curve in the chart below. The higher the percentile, the more the curve differs from a straight line.
The old threshold is shown here as a dashed line for comparison. (The bottom threshold can remain the same. As before, 99.9% of results will fall above the straight line at 0.001.)
The simulation code I ran is here.
import sys
import numpy
from scipy import stats
from dashboard.pinpoint.models import mann_whitney_u
data = []
for _ in xrange(100000):
data.append(([], []))
for repeat_count in xrange(1, 121):
p_values = []
for a, b in data:
a.append(stats.norm.rvs())
b.append(stats.norm.rvs(1))
p_values.append(mann_whitney_u.MannWhitneyU(a, b))
p900 = str(numpy.percentile(p_values, 90))
p990 = str(numpy.percentile(p_values, 99))
p999 = str(numpy.percentile(p_values, 99.9))
print '\t'.join((str(repeat_count), p900, p990, p999))
sys.stdout.flush()
In http://go/pinpoint-data, I have this chart showing how our two significance thresholds divide the result into three sections. From top to bottom: the two distributions are the same, we need more data, and the two distributions are different. One thing that's bothered me about this diagram is that the top threshold is a straight line that doesn't really match the shape of the curves below it.
What I did instead was run 100k simulations of MWU p-values, with two normal distributions that differ by 1σ, then took the 90th, 99th, and 99.9th percentiles of those simulations. That is, 99.9% of results (assuming the above distributions) will fall under the top curve in the chart below. The higher the percentile, the more the curve differs from a straight line.
The old threshold is shown here as a dashed line for comparison. (The bottom threshold can remain the same. As before, 99.9% of results will fall above the straight line at 0.001.)
The simulation code I ran is here.
@perezju @simonhatch @anniesullie