Open jeffbcross opened 10 years ago
I think what would best represent the reliability of a test is the Coefficient of Variation, whose value is standard deviation / mean, expressed as a percentage.
Reasoning: the usefulness of the standard deviation value itself is limited to also knowing the mean. I.e. What does a standard deviation of 10 mean to me if I don't know that the mean is 1500? The coefficient of variation value gives us the relative standard deviation (relative to the mean), giving us one number we can look at between versions of the code under test to ensure that the results of the tests are within the same range of reliability, and thus provide an appropriate level of confidence that the recorded sample metrics can be used to reliably indicate changes in performance from one version to another.
Margin of error and confidence intervals aren't as important to the reliability metric as previously thought, as those are more useful for determining how well the sample represents the population.
Thank you for the explanation; sounds good.
@chirayuk could this be included in the metric server, or the benchmarks, or both?
(UPDATED RESULTS WITH CORRECTED RUN) After applying the coefficient of variation calculation to the report, here is the result of a test run of 100+ with a sample of 100.
Averages: time: 656.96ms standard deviation: 48.579740375165116ms coefficient of variation: 7% gc: 80.718ms combined time+gc: 737.68ms
times: 632.32, 649.62, 659.68, 662.57, 662.32, 700.70, 666.90, 646.59, 643.66, 659.44, 647.97, 634.66, 625.29, 627.72, 629.50, 631.51, 638.58, 646.11, 650.14, 656.34, 662.45, 634.22, 642.01, 632.80, 629.86, 635.91, 626.16, 647.75, 627.33, 652.98, 647.17, 655.16, 644.71, 644.65, 749.35, 769.33, 982.54, 787.93, 842.00, 781.58, 694.50, 653.37, 675.30, 660.52, 677.45, 639.80, 639.40, 643.95, 624.15, 665.29, 643.05, 643.20, 633.13, 643.67, 645.02, 632.88, 641.18, 638.21, 656.90, 635.00, 650.10, 674.31, 693.53, 641.88, 644.86, 628.90, 606.66, 623.25, 621.96, 626.54, 687.60, 663.27, 665.25, 668.79, 633.55, 636.40, 640.38, 655.00, 636.89, 650.56, 617.80, 661.20, 626.28, 645.04, 646.89, 651.86, 641.59, 631.92, 642.30, 646.93, 646.43, 694.97, 672.86, 634.29, 625.54, 640.86, 640.70, 638.13, 654.68, 632.63
And here's a graph showing the times, the mean, and +/- coefficient of variation.
Noteworthy findings:
For information on how results change with different test configurations, here is how the data changes when running 25x, sampling the last 20 runs (same test as previous comment).
Averages time: 622.01ms standard deviation: 17.464536246301474ms coefficient of variation: 3% gc: 81.489ms combined time+gc: 703.50ms
times: 618.13, 650.21, 637.64, 636.38, 639.67, 606.36, 603.95, 603.82, 607.36, 626.80, 628.79, 665.70, 617.41, 617.57, 625.40, 622.12, 595.49, 617.82, 604.48, 615.14
My takeaways from the example test runs:
@jbdeboer suggests that Statistical Hypothesis Testing should be part of the overall strategy to determine if a test run is admissible.
My 2cents and take on how many runs there should be to get a significant result.
I think it is reasonable to assume that each of the performance tests should follow a normal distribution - centered about some mean and dispersed with some deviation. Also, all of the tests can be executed independently, so probably z-test could be used here. For n trials with 95% confidence we get this expression: 1.96 = (Sample_mean - Real_mean) / (sample_deviaton / sqrt(number_of_measurements)) . Using this we can express number of measurements as number_of_measurements ~ (2*sample_deviation/(Sample_mean - Real_mean))^2
Now the problem is to get a Real_mean. I don't think there is a way to know it unless to do infinitely many trials, but it should be reasonable to exclude outliers and expect that the resulting mean is the real mean. So my take would be that Real_mean = median(Sample_mean). So what the resulting expression means? We need less trials if there is bigger difference between sample mean and its median ( thus meaning that there are big outliers). I guess that makes sense, since if there are big jumps or drops in performance, we can detect them faster with higher confidence.
Also, it could be that this strategy would be a poor choice (especially real mean estimation) :).
http://en.wikipedia.org/wiki/Z-test https://www.statstodo.com/ZTest_Tab.php
On Wed, Jun 18, 2014 at 10:41 PM, Jeff Cross notifications@github.com wrote:
@jbdeboer https://github.com/jbdeboer suggests that Statistical Hypothesis Testing http://en.wikipedia.org/wiki/Statistical_hypothesis_testing should be part of the overall strategy to determine if a test run is admissible.
— Reply to this email directly or view it on GitHub https://github.com/angular/angular.dart/issues/1160#issuecomment-46498185 .
Kind Regards, Tadas Šubonis
The benchmark reporter in benchmark/ currently runs and times a step n times, samples a subset of test runs to get mean time to run the test (as well as mean time to gc, how much garbage is generated, and how much memory is retained). In order for tests to be used confidently to measure performance differences between test runs, more data is needed for the report: