brycebaril / node-stats-lite

A light statistical package that operates on Arrays.
MIT License
74 stars 11 forks source link

Variance calculation gives biased results for samples #2

Open ORBAT opened 9 years ago

ORBAT commented 9 years ago

The current method of calculating variance (and, by extension, standard deviation) is intended for sets that form the whole population. When dealing with a sample, i.e. you pick n elements out of k and you don't know the mean of the whole population, you need to apply Bessel's correction and divide by n-1 instead of n when taking the mean.

brycebaril commented 9 years ago

I wasn't familiar with Bessel's correction, though reading the wikipedia article I saw:

  1. it looks like it is only applicable for a subset of variance calculations (i.e. dealing with samples) which isn't necessarily true
  2. it comes with three caveats

How do you suggest we handle this? By adding additional methods for subsets, or perhaps by creating a subset-only version of this library?

ORBAT commented 9 years ago

Yeah, you only need to apply the correction if you're dealing with a sample out of a larger population and you don't know the mean.

One of the caveats is that Bessel's correction will give you an unbiased variance when you have samples, but it won't give you an unbiased standard deviation: there is no general method for calculating an unbiased sd in the first place. It does, however, correct some of the bias. There's also the question of which correction factor to use, but n-1 is good enough for most cases (and if someone needs something more sophisticated, it'll probably fall out of scope for stats-lite anyhow.)

A simple, backwards-compatible way of implementing this could be to have variance and stdev take an optional parameter sample (or bessel or whatever):

// Variance = average squared deviation from mean.
// If sample is true, vals represents a sample of a population, so Bessel's correction will be applied 
function variance(vals, sample) {
  vals = numbers(vals)
  var avg = mean(vals)
  var diffs = []
  for (var i = 0; i < vals.length; i++) {
    diffs.push(Math.pow((vals[i] - avg), 2))
  }
  var res = mean(diffs);
  if(sample) {
    res *= vals.length / (vals.length - 1);
  }
  return res;
}

// Standard Deviation = sqrt of variance.
// If sample is true, vals represents a sample of a population, so Bessel's correction will be applied
function stdev(vals, sample) {
  return Math.sqrt(variance(vals, sample))
}
brycebaril commented 9 years ago

Usually not a huge fan of polymorphic functions in Node where optimization matters due to the way V8 deoptimizes them.

That said I don't know how much of a concern it is in this case because in the same application the code would have to call it like variance(vals) and variance(vals, true) to cause a deopt. I don't know how likely that is to happen, and then that user could avoid the penalties by calling variance(vals, false) in the first case...

Will think about it.

In other news I just published v2.0.0 of this module with support for multi-modal mode distributions, but at the same time made it Node.js v4.0.0+ (for ES6 Sets) so that might impact your ability to immediately use a modified variance function.