benchmark should provide report data on stability of test

jeffbcross commented 10 years ago

The benchmark reporter in benchmark/ currently runs and times a step n times, samples a subset of test runs to get mean time to run the test (as well as mean time to gc, how much garbage is generated, and how much memory is retained). In order for tests to be used confidently to measure performance differences between test runs, more data is needed for the report:

The report should indicate how much the test can be relied on, based on the consistency of sample results.
The report should indicate if the sample size is large enough to get a representative sample of the population

jeffbcross commented 10 years ago

I think what would best represent the reliability of a test is the Coefficient of Variation, whose value is standard deviation / mean, expressed as a percentage.

Reasoning: the usefulness of the standard deviation value itself is limited to also knowing the mean. I.e. What does a standard deviation of 10 mean to me if I don't know that the mean is 1500? The coefficient of variation value gives us the relative standard deviation (relative to the mean), giving us one number we can look at between versions of the code under test to ensure that the results of the tests are within the same range of reliability, and thus provide an appropriate level of confidence that the recorded sample metrics can be used to reliably indicate changes in performance from one version to another.

Margin of error and confidence intervals aren't as important to the reliability metric as previously thought, as those are more useful for determining how well the sample represents the population.

jbdeboer commented 10 years ago

Thank you for the explanation; sounds good.

@chirayuk could this be included in the metric server, or the benchmarks, or both?

jeffbcross commented 10 years ago

(UPDATED RESULTS WITH CORRECTED RUN) After applying the coefficient of variation calculation to the report, here is the result of a test run of 100+ with a sample of 100.

Averages: time: 656.96ms standard deviation: 48.579740375165116ms coefficient of variation: 7% gc: 80.718ms combined time+gc: 737.68ms

times: 632.32, 649.62, 659.68, 662.57, 662.32, 700.70, 666.90, 646.59, 643.66, 659.44, 647.97, 634.66, 625.29, 627.72, 629.50, 631.51, 638.58, 646.11, 650.14, 656.34, 662.45, 634.22, 642.01, 632.80, 629.86, 635.91, 626.16, 647.75, 627.33, 652.98, 647.17, 655.16, 644.71, 644.65, 749.35, 769.33, 982.54, 787.93, 842.00, 781.58, 694.50, 653.37, 675.30, 660.52, 677.45, 639.80, 639.40, 643.95, 624.15, 665.29, 643.05, 643.20, 633.13, 643.67, 645.02, 632.88, 641.18, 638.21, 656.90, 635.00, 650.10, 674.31, 693.53, 641.88, 644.86, 628.90, 606.66, 623.25, 621.96, 626.54, 687.60, 663.27, 665.25, 668.79, 633.55, 636.40, 640.38, 655.00, 636.89, 650.56, 617.80, 661.20, 626.28, 645.04, 646.89, 651.86, 641.59, 631.92, 642.30, 646.93, 646.43, 694.97, 672.86, 634.29, 625.54, 640.86, 640.70, 638.13, 654.68, 632.63

And here's a graph showing the times, the mean, and +/- coefficient of variation.

Noteworthy findings:

49 (49%) of the samples are outside the radius of the coefficient
14 outliers are above mean + radius of coefficient, 35 below it
One spike spanning about 9 samples, peaking at 50% above the mean, skewed standard deviation higher, causing most outliers to be below radius

jeffbcross commented 10 years ago

For information on how results change with different test configurations, here is how the data changes when running 25x, sampling the last 20 runs (same test as previous comment).

Averages time: 622.01ms standard deviation: 17.464536246301474ms coefficient of variation: 3% gc: 81.489ms combined time+gc: 703.50ms

times: 618.13, 650.21, 637.64, 636.38, 639.67, 606.36, 603.95, 603.82, 607.36, 626.80, 628.79, 665.70, 617.41, 617.57, 625.40, 622.12, 595.49, 617.82, 604.48, 615.14

35% of samples are outside the radius of the coefficient
3 outliers (15%) above the radius, and 4 below

jeffbcross commented 10 years ago

My takeaways from the example test runs:

I'm still of the impression that the coefficient of variation is the best value to judge the reliability of a test, presuming the sample size has been deemed appropriate
My gut is that a coefficient of variation should be less than 5% to be considered reliable (rendering the first test unreliable), but we should determine a more logical basis for this threshold, as well as consider if the threshold should be subjective to the test
Calculation of the appropriate sample size for a test should happen after a test run, and should take into account the number of outliers compared to number of test runs, as well as other considerations

jeffbcross commented 10 years ago

@jbdeboer suggests that Statistical Hypothesis Testing should be part of the overall strategy to determine if a test run is admissible.

tadas-subonis commented 10 years ago

My 2cents and take on how many runs there should be to get a significant result.

I think it is reasonable to assume that each of the performance tests should follow a normal distribution - centered about some mean and dispersed with some deviation. Also, all of the tests can be executed independently, so probably z-test could be used here. For n trials with 95% confidence we get this expression: 1.96 = (Sample_mean - Real_mean) / (sample_deviaton / sqrt(number_of_measurements)) . Using this we can express number of measurements as number_of_measurements ~ (2*sample_deviation/(Sample_mean - Real_mean))^2

Now the problem is to get a Real_mean. I don't think there is a way to know it unless to do infinitely many trials, but it should be reasonable to exclude outliers and expect that the resulting mean is the real mean. So my take would be that Real_mean = median(Sample_mean). So what the resulting expression means? We need less trials if there is bigger difference between sample mean and its median ( thus meaning that there are big outliers). I guess that makes sense, since if there are big jumps or drops in performance, we can detect them faster with higher confidence.

Also, it could be that this strategy would be a poor choice (especially real mean estimation) :).

http://en.wikipedia.org/wiki/Z-test https://www.statstodo.com/ZTest_Tab.php

On Wed, Jun 18, 2014 at 10:41 PM, Jeff Cross notifications@github.com wrote:

@jbdeboer https://github.com/jbdeboer suggests that Statistical Hypothesis Testing http://en.wikipedia.org/wiki/Statistical_hypothesis_testing should be part of the overall strategy to determine if a test run is admissible.

— Reply to this email directly or view it on GitHub https://github.com/angular/angular.dart/issues/1160#issuecomment-46498185 .

Kind Regards, Tadas Šubonis

dart-archive / angular.dart

benchmark should provide report data on stability of test #1160