Use regression size information in Pinpoint

dave-2 commented 6 years ago

Background

Pinpoint has two parameters than can be used to adjust the sensitivity of its statistics.

Significance level

The threshold probability at which we say "these samples are different". A higher threshold means that we need less evidence to say that two samples are different, thus increasing the sensitivity to regressions that are small compared to the noise.

However, the significance level is also the false positive rate. For this reason, I've fixed it at 0.001 and do not permit users to increase it. As an alternative, we let users increase the repeat count.

Repeat count

Increasing the repeat count increases the amount of evidence we have that the samples are different. Therefore, increasing the repeat count decreases the p-value, which also makes it more likely we'll come under the significance level, and increases the sensitivity to small regressions.

Mann-Whitney U generally doesn't produce p-values less than 0.001 for sample sizes less than ~8, so that's our effective minimum sample size. Currently, the default repeat count is 15.

Why can't we just always use a high repeat count?

It's not just that it takes longer to run the tests. It can also produce incorrect results.

We've found that metrics can have two kinds of variability.

The first is that one metric can produce a lot of values with a wide range within a test run (e.g. frame_times). I'll call this "intra-task noise".

The other is that a metric can vary a lot between runs. I'll call this "inter-task noise". It could be that the metric is very sensitive to minor changes in Chrome, or the state of the hardware it's running on (which describes all performance tests to some extent).

If we increase the repeat count, we become more robust to "intra-task noise", but we also become more sensitive to "inter-task noise", and therefore identify false culprits more often.

What we want

The dashboard can provide information to Pinpoint about the expected size of the regression. Pinpoint runs the minimum number of iterations (8) and estimates the amount of intra-task noise the metric has. If the size of the regression is small compared to the noise, it can automatically increase the repeat count to increase the sensitivity to small regressions.

This is particularly valuable for functional bisects, since the minimum repeat count varies greatly based on the flakiness.

100% failing test requires a repeat count >= ~6
50% failing test requires repeat count >= ~12
25% failing test requires a repeat count >= ~36

@nedn @zeptonaut @randalnephew Some background on statistics in bisect.

nedn commented 6 years ago

Wow, thanks for the great write-up, Dave!

If we increase the repeat count, we become more robust to "intra-task noise", but we also become more sensitive to "inter-task noise", and therefore identify false culprits more often.

I wonder whether this is because we use Mann-Whitney? Otherwise if we just use the average, the more repeat we have, the less variance of the average value would be, right?

dave-2 commented 6 years ago

@nedn That's theoretically true whether we use MWU or some other comparison of the averages, so it shouldn't make a difference there.

The results seem to show that sometimes the "inter-task noise" is stronger than the averaging. I'm not sure if that's still true if we do dozens or hundreds of repeats across many different devices, though. We'll need to do some more research. I also look forward to getting more devices so we can shard the iterations across more devices.

I do think we have some metrics that improve and regress with even minor, seemingly unrelated changes to Chrome. In that case, no amount of repeats would average out the changes. Microbenchmarks tend to show this effect more, but again, need more research to confirm.

zeptonaut commented 6 years ago

Inter-task noise seems to be what I'm most familiar with, and I'm not 100% sure I understand what intra-task noise is (possibly because I'm not familiar with the frame_times metric other than having heard the name a few times).

Is there any chance you could try to explain intratask noise more to me?

catapult-project / catapult