JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.41k stars 5.45k forks source link

var(), sd() and cov() definitions #5023

Closed nalimilan closed 10 years ago

nalimilan commented 10 years ago

Currently var() only allows using one definition of the variance, the unbiased estimate of the distribution variance based on an IID sample (dividing by N-1). This is also what R and Matlab do for var(), cov() and std().

OTOH, Numpy has chosen the more basic mathematical definition dividing by N, and offers an option to use N-1 instead [2]. It also uses this behavior for std(), but not for cov() [4].

Are there any plans to add such an option to Julia? Scaling values using the basic mathematical (N) definition is a relatively common operation. I would support using this definition by default at least for var() and std(), but that may be a little too late. ;-)

1: http://www.mathworks.fr/fr/help/matlab/ref/var.html 2: http://docs.scipy.org/doc/numpy/reference/generated/numpy.var.html 3: http://www.mathworks.fr/fr/help/matlab/ref/std.html 4: http://docs.scipy.org/doc/numpy/reference/generated/numpy.cov.html

EDIT: Fix mistake about Matlab.

johnmyleswhite commented 10 years ago

We used to offer this and got rid of it. I'd be happy to add back a keyword argument called population that allows one to produce descriptive rather than inferential statistics.

nalimilan commented 10 years ago

Thanks. So you'd follow R and Matlab's default behaviors?

Wondering about the name and the form of the keyword argument, population might do it, but it would feel slightly weird to say I have "population" values when I'm merely centering and scaling a set of scores. In general there is no reference population when you want to use the N definition of standard deviation. So I'd rather have a parameter corrected=true (that's the term used by e.g. Wikipedia [1]) that we turn to false when needed.

Also, do you think it would make sense to add an optional weights argument? This is often useful and many languages feel lame by not providing such elementary statistics easily.

1: http://en.wikipedia.org/wiki/Standard_deviation#Uncorrected_sample_standard_deviation

andreasnoack commented 10 years ago

I like the name corrected

aviks commented 10 years ago

corrected is what it was called earlier in base.

https://github.com/JuliaLang/julia/commit/d74ad6663cc94e3a41d8915aca3202a4dc1e7362#diff-b85f81409a2a21f2174ccb078b0d7030

+1 for weighted versions as well.

simonster commented 10 years ago

There is a wmean function in Stats that computes a weighted mean, although it doesn't accept a region argument.

RauliRuohonen commented 10 years ago

How about "unbiased" instead of "corrected"? The computations using N are perfectly correct, so one cannot really "correct" them - one can only compute something different, for a different purpose. The purpose of using N-1 is to obtain an unbiased estimator for the corresponding population quantity. Basically the only reason you'd want that is that you are averaging a lot of such estimators, and want the result to be more accurate. That requires the base estimator to be unbiased.

The clearest situation would be estimating the covariance matrix. The usual "N-1" formula gives you a horrible estimator in high dimensions, and you'll want to use a shrinkage estimator (e..g Ledoit-Wolf) unless you are doing averaging as explained above. Should the cov() function use a shrinkage estimator when one asks for a "corrected" result? If not, then clearly it is the unbiasedness that is the defining quality of the result, hence "unbiased" is a better term. If yes, then there still ought to be a more descriptive term ("corrected" how? there are many reasonable covariance matrix estimators, all different). Even "estimator" is more descriptive (the case with N isn't really intended as an estimator, it is more a property of the data set itself divorced from any notion of populations or sampling, much like its length, sum or mean is).

andreasnoack commented 10 years ago

My preference for corrected was to avoid unbiased. I guess it is impossible (or at least not desirable) with an exhaustive description as keyword. It might always be slightly misleading in some understanding of the function, but I prefer corrected, not because the result is correct, but because it is corrected. It is impossible to say if a quantity is unbiased unless you have complete knowledge about the device that produced the sample, but you can always say that the quantity is corrected (relative to another quantity).

nalimilan commented 10 years ago

Yeah, "unbiased" really implies that the other formula gives biased results, which is not the case when you are not estimating the population quantities. While the opposite to "unbiased" is "biased", the opposite to "corrected" is just "uncorrected", which means either "wrong" or "does not need correction to be right". So I see it as more neutral.

"estimator" would also make sense to me.

dmbates commented 10 years ago

And even if the term unbiased is used in the sense of the result of var() being an estimate of the variance in the mythical "random sample from a Gaussian distribution", the term only applies to var, not to sd.

In general unbiased is far too loaded a term. It does have a mathematic definition (i.e. a denotation) but it is the result of the old "get to the end of the proof, discover that you need another condition and then define that condition as the 'regular' case or the 'normal' case" approach to nomenclature. You give a mathematical definition to a name that brings a lot of emotional connotation, thereby making it seem virtuous.

I mean, doesn't everyone want their estimator to be "unbiased"? How could that not be a good thing? How could we possibly think that a "biased" estimator would be acceptable. It is just a mathematical property and not a particularly important one at that.

Sorry for getting on a soap box about this. My research is related to estimating "variance components" and people do all sorts of stupid things in this area because they haven't thought through the consequences of their assumptions and just go by gut instinct influenced by loaded terminology.

carlobaldassi commented 10 years ago

The original corrected was short for "Bessel corrected", which is also by far the most common correction one would apply in those cases. So I support using that as a keyword for lack of better, concise alternatives. BTW I must have missed the discussion where it was decided to get rid of that functionality, I've been wondering what was the reason for that.

dmbates commented 10 years ago

Whether you say it is a corrected estimate or an unbiased estimator the motivation still only applies to the result of var applied to a random (i.e. i.i.d.) sample from a Gaussian distribution. You could just as well use the term MLE (maximum likelihood estimator) for the formula with n in the denominator.

Having said all this, I am in favor of having the version with n - 1 in the denominator being the default. I'm just objecting to using an emotionally loaded name for the optional argument, if indeed the alternative calculation should be allowed.

By the way, the use of n - 1 actually occurs in varm which is called by the var methods for AbstractArray. It may be better to use n in the denominator for a call to varm, where the centering value is explicitly input and leave var using n - 1.

I'm beginning to understand the term "bikeshedding".

johnmyleswhite commented 10 years ago

FWIW, the term population doesn't have the normative overtones of either corrected or unbiased.

RauliRuohonen commented 10 years ago

I suppose you could call it "bias_corrected" or "bessel_corrected" or just "bessel" and still convey what exactly is being "corrected", or what correction is being applied (i.e. the most common correction for bias used in certain circumstances, as opposed to some other correction).

Of course the claim of unbiasedness requires assumptions (but not normality assumption, @dmbates). The whole claim that the modification is a correction at all (as opposed to making things worse) requires pretty much the same assumptions, so "correction" is not any better on that front. Also, for Gaussians the estimate with N is the maximum likelihood one, so one could argue that the N-1 estimate is worse as it is less likely, and being worse it is not a "correction" at all, and calling it a "correction" is more loaded than calling it "unbiased", as the latter is simply a fact.

As for sd(), one could call it "bessel_corrected" (or just "bessel"?), or simply leave the option out altogether. The N-1 tweak for that one is pretty much useless, as it does not correct the bias. People use it simply because it is the square root of an unbiased variance estimator, not because it is a good estimator of standard deviation as such. One can really correct the bias if one uses e.g. a normality assumption (in which case "unbiased" would work as a term), but nobody bothers, because nobody cares about standard deviations except as more human-friendly "pretty-printings" of variances. One almost always averages variances, not standard deviations, and unbiasedness really only matters when averaging (who cares about a factor of N/(N-1)≈1 when it doesn't accumulate anyway?). Leaving the whole correction out for this and requiring the user to use e.g. sqrt(var(x, unbiased=true)) wouldn't even be a bad choice if Bessel is never applied by default.

carlobaldassi commented 10 years ago

So, it seems to me that one very simple option here would be to basically revert https://github.com/JuliaLang/julia/commit/d74ad6663cc94e3a41d8915aca3202a4dc1e7362#diff-b85f81409a2a21f2174ccb078b0d7030 and call the argument e.g. bessel_correction; in this way it would show up nicely and unambiguously in the help, but nobody would need to type it, and we would also be free to change the name at any time without breaking anything, such that the functionality would be there and the bikeshedding could continue forever ;) (Also, a positional argument would be faster than a keyword, even though this is likely not a particularly relevant issue in this case. Also, it would be trivial to add a deprecation if we were to decide for a keyword argument at a later time.) The only possible drawback with this plan, if I'm not mistaken, is that one would need to be careful when dispatching, since a Bool would apply to the bessel_correction and anything else would be the region (or m in varm/stdm). There is no other ambiguity in the statistics function that I can see. Anyway, I hardly see this as a serious problem, as I don't see a use for passing a Bool as a region.

nalimilan commented 10 years ago

So, there would be a version with only region, a version with only bessel_correction (or any other name), and a version with both (in that order)?

Should weights be a keyword or a positional argument? Keyword would make things easier, but if we want to be consistent with e.g. a weighted mean function, and that we care about performance (I think we do), would a positional argument be really more efficient? In particular, mean() would easily be inlined by the compiler, but would that happen if weights are passed as a keyword argument? It would be silly to have to write the weighted sum and its division by hand, which is not as nice to read, just for efficiency.

stevengj commented 10 years ago

For the (probable minority) of users who need the 1/n scaling, why can't they just multiply by (n-1)/n? Unless n is really small, the cost of this scaling should be negligible compared to the computation of var etc, and if n is extremely small and the computation is performance critical then you might want to just inline the whole thing anyway. Is it really worth cluttering the API for a trivial scale-factor variation?

nalimilan commented 10 years ago

Is it so rare to center and scale a set of values that are absolutely not drawn from any population? Of course it's easy to get the correct value, but I just find it nicer to have code calling the stock function with a documented parameter than multiplying the result by a quantity somebody reading the code for the first time will wonder where it comes from. The API cluttering is minimal, since anyway there aren't so many parameters to pass to these functions.

Also, this may be useful if std() is further used by a convenience centering and scaling function, which couldn't support scaling without the N-1 correction otherwise. To me, making simple operations work with short and clean code is a plus.

carlobaldassi commented 10 years ago

(probable minority)[citation needed]

;)

Jokes aside, personally I can only provide anecdotal evidence, but I think it's somewhat telling that both numpy and Matlab give the option (and numpy goes as far as providing a generalization, via the ddof keyword). Others have already pointed out that there are fairly common situations where dividing by n is the desired behaviour. About using (n-1)/n, that has a number of problems I can think of, off the top of my head:

1 it's ugly; even more so for std where you would use sqrt((n-1)/n) 2 it needs special-casing for n=1, which is bad for generic programming 3 when the functions return arrays (e.g. var(a, 1) or cov with a a matrix), simple scaling means introducing temporary arrays; mapping (to avoid temporaries) means using anonymous functions (read: slow, verbose and ugly: map!(x->x*(size(a,1)-1)/size(a,1)), var(a,1)))

So I'd say that is a terrible option to offer.

stevengj commented 10 years ago

Well, you can always do scale!(var(a,1), (size(a,1)-1)/size(a,1)) to avoid the overhead of map. Or you can write a one-line function if you want it to look a bit cleaner. Saving the user from writing a trivial multiplication or a one-line function really doesn't seem worth the API complication to me, but if people have strong feelings about this then I won't argue further.

simonster commented 10 years ago

I believe that everything discussed here was fixed by @lindahua's recent work (8f4d9da6fc0e4d446e6fe06f15aa8a8221ca0bc6 and #6273). If not, let me know and I will reopen.

nalimilan commented 10 years ago

Yeah, I think his new functions cover everything here.