[📍] Incorrect results with metrics that produce hundreds of values

dave-2 commented 6 years ago

Problem

https://pinpoint-dot-chromeperf.appspot.com/job/15601101f80000 https://pinpoint-dot-chromeperf.appspot.com/job/14c618e9f80000

Here, we have two bisects on frame_times with culprits that were likely incorrect, producing more culprits than expected.

Diagnosis

frame_times produces 400 values per story run. With 15 repeats, we have 6000 values in each sample. The metric is a little noisy across runs, and it looks like the high number of samples gives the statistical comparison a false sense of confidence. While it's "technically correct" that the distributions ARE different (due to hardware or background differences), that's not the way we want to interpret the data.

Solutions

Maybe we should fudge the data a little bit when doing the statistical comparison. I'm hesitant to take the average of the 400-value samples, leading to an aggregate sample of 15 values, like recipe bisect did. I can think of two alternative approaches that trim the data while still retaining the shape.

In both cases, rather than just taking the first N values or N random values, I think it's safer to take a uniform, deterministic sample.

Option 1

We can trim the samples down to 100 values per story run. This means that there would be 1500 samples in aggregate. We could do that in read_value._ReadHistogramJsonValueExecution._Poll():

    result_values = tuple(result_values[i * len(result_values) / _MAX_SAMPLE_SIZE]
                          for i in xrange(_MAX_SAMPLE_SIZE))
    self._Complete(result_values=result_values)

Option 2

We can trim the 6000 samples to 1000 after aggregating across story runs. We could do that in job.Job._Compare():

      if len(values_a) > 1000:
        values_a = [values_a[i * len(values_a) / _MAX_SAMPLE_SIZE]
                    for i in xrange(_MAX_SAMPLE_SIZE)]
      if len(values_b) > 1000:
        values_b = [values_b[i * len(values_b) / _MAX_SAMPLE_SIZE]
                    for i in xrange(_MAX_SAMPLE_SIZE)]
      if _CompareValues(values_a, values_b) == _DIFFERENT:
        return _DIFFERENT

Pros and cons

In option 1, we only store the trimmed values. So the con is that the user would not be able to get the full data set. The trimming is invisible to them. The pro is that the Job has a lot less data, so it will be a lot smaller in the datastore and load significantly faster. In option 2, we only trim for the comparison, and store the full data set.

With "automatically increasing repeat count" (#3964) implemented: In option 1, the sample sizes continue to increase as the repeat count increases, whereas in option 2 they never exceed 1000 (for comparison). It's not clear to me which one will give better results.

Currently leaning towards option 1.

@anniesullie @simonhatch @perezju

perezju commented 6 years ago

Hmm, I'm not very happy with either solution, specially no. 1 because it leads to invisible data loss. I also think there are two different problems we're trying to solve here:

When we have too many data points; our current test is too sensitive and declares "different" even for distributions which we would like to characterize as "equal" nevertheless. Maybe the solution here would be to adjust our p-value thresholds instead?
When we have too many data points, they take up more memory become more costly to store and slower to load. Here maybe the solution is to store them as histograms instead? The whole point of histograms was to help in these particular cases.

On a semi-related note: on the little value histograms shown when clicking on a point, can we make them use fixed x/y-axes? When looking at your example jobs I was clicking back and forth to compare between two of them and thought: "hey, they are different! they even have a different mean!" but then noted that the x-axis was moving as well, so they we'rent that different after all.

Also would be awesome if we could somehow select two points and see an overlap of their two histograms.

dave-2 commented 6 years ago

https://pinpoint-dot-chromeperf.appspot.com/job/12a5bbd8040000

dave-2 commented 6 years ago

Adjusting the p-value thresholds sounds significantly more complicated because we'd then have to figure out how to tune the p-value thresholds to the number of data points. I'm really not sure what thresholds would be appropriate.

The histogram axes are issue #3888. My plan is to align them with the middle chart (which requires the ability to fully zoom out, #3877) and always show a comparison with the previous point when you have a point selected.

dave-2 commented 6 years ago

https://pinpoint-dot-chromeperf.appspot.com/job/14a634ba040000 https://pinpoint-dot-chromeperf.appspot.com/job/15b053ca040000 https://pinpoint-dot-chromeperf.appspot.com/job/17c0b0ea040000 https://pinpoint-dot-chromeperf.appspot.com/job/11d5d80a040000

dave-2 commented 6 years ago

Here's some on decode-lossless-webp, which produces something like 20 values per run. https://pinpoint-dot-chromeperf.appspot.com/job/169a2b17040000 https://pinpoint-dot-chromeperf.appspot.com/job/14c2e0f7040000

catapult-project / catapult