Closed nalimilan closed 10 years ago
We used to offer this and got rid of it. I'd be happy to add back a keyword argument called population
that allows one to produce descriptive rather than inferential statistics.
Thanks. So you'd follow R and Matlab's default behaviors?
Wondering about the name and the form of the keyword argument, population
might do it, but it would feel slightly weird to say I have "population" values when I'm merely centering and scaling a set of scores. In general there is no reference population when you want to use the N definition of standard deviation. So I'd rather have a parameter corrected=true
(that's the term used by e.g. Wikipedia [1]) that we turn to false
when needed.
Also, do you think it would make sense to add an optional weights argument? This is often useful and many languages feel lame by not providing such elementary statistics easily.
1: http://en.wikipedia.org/wiki/Standard_deviation#Uncorrected_sample_standard_deviation
I like the name corrected
corrected
is what it was called earlier in base.
+1 for weighted versions as well.
There is a wmean
function in Stats that computes a weighted mean, although it doesn't accept a region argument.
How about "unbiased" instead of "corrected"? The computations using N are perfectly correct, so one cannot really "correct" them - one can only compute something different, for a different purpose. The purpose of using N-1 is to obtain an unbiased estimator for the corresponding population quantity. Basically the only reason you'd want that is that you are averaging a lot of such estimators, and want the result to be more accurate. That requires the base estimator to be unbiased.
The clearest situation would be estimating the covariance matrix. The usual "N-1" formula gives you a horrible estimator in high dimensions, and you'll want to use a shrinkage estimator (e..g Ledoit-Wolf) unless you are doing averaging as explained above. Should the cov() function use a shrinkage estimator when one asks for a "corrected" result? If not, then clearly it is the unbiasedness that is the defining quality of the result, hence "unbiased" is a better term. If yes, then there still ought to be a more descriptive term ("corrected" how? there are many reasonable covariance matrix estimators, all different). Even "estimator" is more descriptive (the case with N isn't really intended as an estimator, it is more a property of the data set itself divorced from any notion of populations or sampling, much like its length, sum or mean is).
My preference for corrected
was to avoid unbiased
. I guess it is impossible (or at least not desirable) with an exhaustive description as keyword. It might always be slightly misleading in some understanding of the function, but I prefer corrected
, not because the result is correct, but because it is corrected. It is impossible to say if a quantity is unbiased unless you have complete knowledge about the device that produced the sample, but you can always say that the quantity is corrected (relative to another quantity).
Yeah, "unbiased" really implies that the other formula gives biased results, which is not the case when you are not estimating the population quantities. While the opposite to "unbiased" is "biased", the opposite to "corrected" is just "uncorrected", which means either "wrong" or "does not need correction to be right". So I see it as more neutral.
"estimator" would also make sense to me.
And even if the term unbiased
is used in the sense of the result of var()
being an estimate of the variance in the mythical "random sample from a Gaussian distribution", the term only applies to var
, not to sd
.
In general unbiased
is far too loaded a term. It does have a mathematic definition (i.e. a denotation) but it is the result of the old "get to the end of the proof, discover that you need another condition and then define that condition as the 'regular' case or the 'normal' case" approach to nomenclature. You give a mathematical definition to a name that brings a lot of emotional connotation, thereby making it seem virtuous.
I mean, doesn't everyone want their estimator to be "unbiased"? How could that not be a good thing? How could we possibly think that a "biased" estimator would be acceptable. It is just a mathematical property and not a particularly important one at that.
Sorry for getting on a soap box about this. My research is related to estimating "variance components" and people do all sorts of stupid things in this area because they haven't thought through the consequences of their assumptions and just go by gut instinct influenced by loaded terminology.
The original corrected
was short for "Bessel corrected", which is also by far the most common correction one would apply in those cases. So I support using that as a keyword for lack of better, concise alternatives. BTW I must have missed the discussion where it was decided to get rid of that functionality, I've been wondering what was the reason for that.
Whether you say it is a corrected
estimate or an unbiased estimator
the motivation still only applies to the result of var
applied to a random (i.e. i.i.d.) sample from a Gaussian distribution. You could just as well use the term MLE (maximum likelihood estimator) for the formula with n
in the denominator.
Having said all this, I am in favor of having the version with n - 1
in the denominator being the default. I'm just objecting to using an emotionally loaded name for the optional argument, if indeed the alternative calculation should be allowed.
By the way, the use of n - 1
actually occurs in varm
which is called by the var
methods for AbstractArray. It may be better to use n
in the denominator for a call to varm
, where the centering value is explicitly input and leave var
using n - 1
.
I'm beginning to understand the term "bikeshedding".
FWIW, the term population
doesn't have the normative overtones of either corrected
or unbiased
.
I suppose you could call it "bias_corrected" or "bessel_corrected" or just "bessel" and still convey what exactly is being "corrected", or what correction is being applied (i.e. the most common correction for bias used in certain circumstances, as opposed to some other correction).
Of course the claim of unbiasedness requires assumptions (but not normality assumption, @dmbates). The whole claim that the modification is a correction at all (as opposed to making things worse) requires pretty much the same assumptions, so "correction" is not any better on that front. Also, for Gaussians the estimate with N is the maximum likelihood one, so one could argue that the N-1 estimate is worse as it is less likely, and being worse it is not a "correction" at all, and calling it a "correction" is more loaded than calling it "unbiased", as the latter is simply a fact.
As for sd(), one could call it "bessel_corrected" (or just "bessel"?), or simply leave the option out altogether. The N-1 tweak for that one is pretty much useless, as it does not correct the bias. People use it simply because it is the square root of an unbiased variance estimator, not because it is a good estimator of standard deviation as such. One can really correct the bias if one uses e.g. a normality assumption (in which case "unbiased" would work as a term), but nobody bothers, because nobody cares about standard deviations except as more human-friendly "pretty-printings" of variances. One almost always averages variances, not standard deviations, and unbiasedness really only matters when averaging (who cares about a factor of N/(N-1)≈1 when it doesn't accumulate anyway?). Leaving the whole correction out for this and requiring the user to use e.g. sqrt(var(x, unbiased=true)) wouldn't even be a bad choice if Bessel is never applied by default.
So, it seems to me that one very simple option here would be to basically revert https://github.com/JuliaLang/julia/commit/d74ad6663cc94e3a41d8915aca3202a4dc1e7362#diff-b85f81409a2a21f2174ccb078b0d7030 and call the argument e.g. bessel_correction
; in this way it would show up nicely and unambiguously in the help, but nobody would need to type it, and we would also be free to change the name at any time without breaking anything, such that the functionality would be there and the bikeshedding could continue forever ;)
(Also, a positional argument would be faster than a keyword, even though this is likely not a particularly relevant issue in this case. Also, it would be trivial to add a deprecation if we were to decide for a keyword argument at a later time.)
The only possible drawback with this plan, if I'm not mistaken, is that one would need to be careful when dispatching, since a Bool
would apply to the bessel_correction
and anything else would be the region
(or m
in varm
/stdm
). There is no other ambiguity in the statistics function that I can see. Anyway, I hardly see this as a serious problem, as I don't see a use for passing a Bool
as a region.
So, there would be a version with only region
, a version with only bessel_correction
(or any other name), and a version with both (in that order)?
Should weights
be a keyword or a positional argument? Keyword would make things easier, but if we want to be consistent with e.g. a weighted mean function, and that we care about performance (I think we do), would a positional argument be really more efficient? In particular, mean()
would easily be inlined by the compiler, but would that happen if weights are passed as a keyword argument? It would be silly to have to write the weighted sum and its division by hand, which is not as nice to read, just for efficiency.
For the (probable minority) of users who need the 1/n scaling, why can't they just multiply by (n-1)/n? Unless n is really small, the cost of this scaling should be negligible compared to the computation of var
etc, and if n
is extremely small and the computation is performance critical then you might want to just inline the whole thing anyway. Is it really worth cluttering the API for a trivial scale-factor variation?
Is it so rare to center and scale a set of values that are absolutely not drawn from any population? Of course it's easy to get the correct value, but I just find it nicer to have code calling the stock function with a documented parameter than multiplying the result by a quantity somebody reading the code for the first time will wonder where it comes from. The API cluttering is minimal, since anyway there aren't so many parameters to pass to these functions.
Also, this may be useful if std()
is further used by a convenience centering and scaling function, which couldn't support scaling without the N-1 correction otherwise. To me, making simple operations work with short and clean code is a plus.
(probable minority)[citation needed]
;)
Jokes aside, personally I can only provide anecdotal evidence, but I think it's somewhat telling that both numpy and Matlab give the option (and numpy goes as far as providing a generalization, via the ddof
keyword). Others have already pointed out that there are fairly common situations where dividing by n
is the desired behaviour.
About using (n-1)/n
, that has a number of problems I can think of, off the top of my head:
1 it's ugly; even more so for std
where you would use sqrt((n-1)/n)
2 it needs special-casing for n=1
, which is bad for generic programming
3 when the functions return arrays (e.g. var(a, 1)
or cov
with a
a matrix), simple scaling means introducing temporary arrays; mapping (to avoid temporaries) means using anonymous functions (read: slow, verbose and ugly: map!(x->x*(size(a,1)-1)/size(a,1)), var(a,1))
)
So I'd say that is a terrible option to offer.
Well, you can always do scale!(var(a,1), (size(a,1)-1)/size(a,1))
to avoid the overhead of map
. Or you can write a one-line function if you want it to look a bit cleaner. Saving the user from writing a trivial multiplication or a one-line function really doesn't seem worth the API complication to me, but if people have strong feelings about this then I won't argue further.
I believe that everything discussed here was fixed by @lindahua's recent work (8f4d9da6fc0e4d446e6fe06f15aa8a8221ca0bc6 and #6273). If not, let me know and I will reopen.
Yeah, I think his new functions cover everything here.
Currently
var()
only allows using one definition of the variance, the unbiased estimate of the distribution variance based on an IID sample (dividing by N-1). This is also what R and Matlab do forvar()
,cov()
andstd()
.OTOH, Numpy has chosen the more basic mathematical definition dividing by N, and offers an option to use N-1 instead [2]. It also uses this behavior for
std()
, but not forcov()
[4].Are there any plans to add such an option to Julia? Scaling values using the basic mathematical (N) definition is a relatively common operation. I would support using this definition by default at least for
var()
andstd()
, but that may be a little too late. ;-)1: http://www.mathworks.fr/fr/help/matlab/ref/var.html 2: http://docs.scipy.org/doc/numpy/reference/generated/numpy.var.html 3: http://www.mathworks.fr/fr/help/matlab/ref/std.html 4: http://docs.scipy.org/doc/numpy/reference/generated/numpy.cov.html
EDIT: Fix mistake about Matlab.