account for variance in samples

bengland2 commented 4 years ago

Current implementation of touchstone calculates averages and then compares them. This approach does not take into account variation in samples for baseline, and variation in new SUT. So you cannot tell if the change in average is statistically significant. There are established statistical methods for incorporating variance into the comparison, as described here:

https://mojo.redhat.com/docs/DOC-1089994

which is basically describing how to use scipy.ttest_ind() function. It would be also good to monitor the % deviation of the baseline and new-run samples to see whether we can determine whether a regression has occurred or not. This kind of analysis can prevent false positives and negatives and avoid wasting time on unnecessary investigations.

inevity commented 2 years ago

Can not access the link above. Can you post here?

bengland2 commented 2 years ago

@inevity sorry that link is not available anymore, mojo is gone and wasn't available outside Red Hat anyway. Here is the article:

simple performance regression passfail script Mojo.pdf

inevity commented 2 years ago

The t-test 's assumption is that

Normally distributed data
IID samples
Homogeneity of variance
see https://stanford-cs329s.github.io/slides/cs329s_13_slides_monitoring.pdf

So the sample data should be created by the stable workload, does it?
As for the avg comparision impl in the current master, what is the assumptions?

bengland2 commented 2 years ago

@inevity ,

normally distributed data - haven't experimentally demonstrated this, though I consider it likely
IID samples - that's part of why I introduced cache dropping to benchmark-operator, because I wanted samples to be IID for storage benchmarks.
homogeneity of variance - not familiar with this term, your link doesn't help.

So if you don't use a T-test, what's an alternative method for comparing 2 sets of samples to see if they are truly different from a statistical perspective? Just comparing averages is useless.

Here's a better online link about T-test (my original reference was Raj Jain's classic text " the Art of Computer Systems Performance Analysis", which is about 30 years old, but statistics hasn't changed that much in this area AFAICT).

inevity commented 2 years ago

Z-tests are closely related to t-tests, but t-tests are best performed when an experiment has a small sample size, less than 30. https://www.investopedia.com/terms/z/z-test.asp Do we need consider this case?

bengland2 commented 2 years ago

@inevity I don't think z-test sounds useful. Why? It is usually expensive in time and resources to generate a single sample, and we have many data points to cover, so in my experience we typically limit them to 3 samples for each data point. The standard deviation is barely meaningful with such a small set of samples but it's better than nothing (i.e. just comparing averages). The T-test at least takes the variance in samples into account and gives you some idea of whether you can be confident in saying that the two sets of tests have a significant difference in result. The script I linked to in the initial post makes it easy to try it out and see for yourself how well it works. See if you agree with its conclusions.

inevity commented 2 years ago

Just see that benchmark from google use u test. More specifically still, a t-test is only useful when we have normally distributed results to compare. Do you have any reason to assume a priori that the distributions of benchmark repetition results are normally distributed? It's true that means of samples from a distribution (which is what we're talking about) tend towards normal distribution (thanks, central limit theorem!), but how quickly, and how large each sample needs to be, and how many samples you need, depends on how skewed the original data is, iirc.

And u test have no assumption that the distributed mode and variance. So maybe more proper?

bengland2 commented 2 years ago

@inevity sorry I don't understand your last reply. Which benchmark from google? And if you are saying I'm assuming a normally-distributed set of results, I think I'm guilty of that. Perhaps I'll have to put this to the test. But still I think it's better than comparing averages of samples without regard for std. deviation. Don't let the best be the enemy of the better.

inevity commented 2 years ago

https://github.com/google/benchmark/pull/593 this pr use u-test to compare two sample.
and the debate is that 'More specifically still, a t-test is only useful when we have normally distributed results to compare. Do you have any reason to assume a priori that the distributions of benchmark repetition results are normally distributed? It's true that means of samples from a distribution (which is what we're talking about) tend towards normal distribution (thanks, central limit theorem!), but how quickly, and how large each sample needs to be, and how many samples you need, depends on how skewed the original data is, iirc'. So they use the u-test to compare. and the u-test is not to compare average. It just have no assumption that the distributed mode and varianceI

bengland2 commented 2 years ago

@inevity Now I understand what you are talking about. I've never heard of a u-test, something new for me. Here is the scipy package that you are referring to. This is an interesting proposal, I'll have to read about it and think about it a little more but I'm not that attached to using a T-test, I just was shopping for a statistical test that accounts for variance in comparing 2 sample sets, and python scipy implemented a T-test, but if u-test does same thing without assumptions about distribution of test results, then it sounds useful to me.

mfleader commented 2 years ago

Hypothesis testing for statistical significance is one of main sources for the statistical crisis in science.
You can use a general linear model to replace the t-test comparison of means
You can use a generalized linear model to change the normality assumption
The Mann-Whitney U test is a special case of a proportional model (which is to say it is still a generalized linear model)
If you don't use hypothesis testing and statistical significance, you have to come up with a decision function that you're optimizing with a model parameter that you've estimated from your data sample.
A lot of data related to computers is multimodal, and most out-of-box statistical models that we have access to assume unimodal (though, I think we can still glean some insight if we're careful)

mfleader commented 2 years ago

I prefer Bayesian methods for estimating generalized linear models of computer performance data, but frequentist non-parametric and semi-parametric models, like the U test, seem to have their use cases for recovering parameters of interest.

bengland2 commented 2 years ago

@mfleader I'm not sure what you mean by "generalized linear model", sorry. What's the simplest yet most reliable way of doing this? I thought u-test wasn't assuming anything about the distribution, unlike t-test which assumes normal distribution? So what decision function would you use? This seems extremely difficult to come up with, since you have to estimate it from your data sample, when we are trying to write code that is known to work without regard for the data sample.

@baul, I tried your mannwhitneyu() and I just replaced ttest_ind() with it, which means I already have a way to experiment with it and compare the two. They don't give the same answers, interestingly. When I get more time I'll try to run both against some real experimental data and see what happens.

mfleader commented 2 years ago

The decision function would be a function of the parameter we're testing, and it would compute something either we care about or the business cares about, like if there were some function that computed how much money it would cost the user for each potential microsecond increase in latency. I don't know enough about estimating costs in performance for a cloud platform to actually write that function, so I have ignored it by using the identity function or a negative identity function as a decision, or cost function. For example, I would use the identity function for estimating differences in latency because larger values are worse which translates to the group with the higher latency costing more because the cost function outputs a higher value for it. In general, you already do some of this when you're thinking about the cost to performance given a software change.

mfleader commented 2 years ago

With regards to the Mann-Whitney U Test, I was pointing out that it is arguably a special case of a proportional odds model, to say that we cannot entirely avoid assumptions about our data. We just need to understand and clearly communicate the consequences of the models that we choose to use. Given the opportunity, I believe we would want to use the more powerful statistical model, as in a general linear model, instead of a t-test or a Mann-Whitney U test.

mfleader commented 2 years ago

The t test and the underlying model.

cloud-bulldozer / benchmark-comparison

account for variance in samples #10