HenrikBengtsson / matrixStats

R package: Methods that Apply to Rows and Columns of Matrices (and to Vectors)
https://cran.r-project.org/package=matrixStats
202 stars 33 forks source link

Error in zero variance in colVars() #228

Closed myles-lewis closed 1 year ago

myles-lewis commented 1 year ago

I have noticed an odd bug in colVars() which can occur with a column of certain non-integers which are identical and thus should give var=0. I was hoping to use the function to detect var=0 in large numbers of columns.

Reproducible example:

x <- matrix(rep(2.897823, 60), ncol=1)
var(c(x))
## 0
matrixStats::colVars(x)
## 3.208926e-30
Rfast::colVars(x)
## 1.830551e-14

Any suggestions why?

yaccos commented 1 year ago

This is not a bug in colVars() per se, but rather a demonstration of a fundamental limitation in how computers process numbers. Any interval on the real number line has an infinte number of numbers, but computers only have a limited space to store each of the numbers, in our case 64 bits. This means that most numbers must be rounded to the closest machine number which creates a round-off error. In your example, variance is zero, but round-off errors makes the answer somewhat off the mathematical true value. I would assume that the error of approximately $3\cdot10^{-30}$ is acceptable for most practical purposes.

For the reason mentioned, testing equality with floating point numbers is dangerous as even the slightest round-off error will make two otherwise equal number to be evaluated as unequal. Therefore, it is usually a good idea to include a small tolerance in the comparison. If you want to evaluated whether a vector of numbers contain the same elements, I would advocate for an approach such as:

x <- matrix(rep(2.897823, 60), ncol=3)
tol <- .Machine$double.eps ^ 0.5 # double.eps: The smallest number eps such that 1 + eps != 1 in 64 bit artithmetric
range_matrix <- colRanges(x) # Creates a matrix with two columns (min and max element in each column)
diff_ranges <- c(rowDiffs(range_matrix)) # Finds the difference between the max and min elements
diff_ranges < tol
## TRUE TRUE TRUE
myles-lewis commented 1 year ago

Thanks, that's fair enough and makes sense. I was wondering how base R gets the variance or sd correct as 0. scale() can be used to rapidly perform column SDs (they can be extracted using attr()) and it gives 0 for these types of data.

yaccos commented 1 year ago

After reading the source code of stats::var(), I found that the implementation is more sophisticated in order to reduce numerical inaccuracy. First, the mean is computer with a two-pass method. Second, when computing the variance, the intermediate results are calculated with type LDOUBLE which is a 80-bit precision floating point number before the answer is converted to a 64-bit float in the end. Still, even if the function stats::var() gave you the exact answer in this case, you can't trust that it will give you an answer without round-off error.

EDIT 2023-05-26: stats::var() source code is at https://github.com/wch/r-source/blob/trunk/src/library/stats/src/cov.c#LL467-L492 /HB

myles-lewis commented 1 year ago

Thank you for looking into this. Accepted, and I will use all of these options with a modicum of caution!

HenrikBengtsson commented 1 year ago

After reading the source code of stats::var(), I found that the implementation is more sophisticated in order to reduce numerical inaccuracy. First, the mean is computer with a two-pass method. Second, when computing the variance, the intermediate results are calculated with type LDOUBLE which is a 80-bit precision floating point number before the answer is converted to a 64-bit float in the end. ...

To achieve the same for matrixStats, we would have to (i) update:

https://github.com/HenrikBengtsson/matrixStats/blob/8a99ea214b07095d902cf87de18e2415568d9a30/src/rowVars_lowlevel_template.h#L99-L104

to do a second pass, and (ii) update:

https://github.com/HenrikBengtsson/matrixStats/blob/8a99ea214b07095d902cf87de18e2415568d9a30/src/rowVars_lowlevel_template.h#L106-L113

and previous code to use LDOUBLE instead of double.

To do this, we could introduce an optional refine = TRUE argument, just as we have for mean2(). This means that we should probably do the same for colMeans2() and rowMeans2().

HenrikBengtsson commented 1 year ago

First step in improving on this completed. I've added refine = TRUE to colMeans2() and rowMeans2(), e.g.

library(matrixStats)
x <- matrix(rep(2.897823, 60), ncol = 1)
colMeans(x) - colMeans2(x, refine = FALSE) ## the old behavior
# [1] 1.776357e-15
colMeans(x) - colMeans2(x, refine = TRUE)  ## the new default
# [1] 0

Next is to add refine to colVars() and rowVars() too. Until then, one can do:

library(matrixStats)
x <- matrix(rep(2.897823, 60), ncol = 1)
mu <- colMeans(x)
colVars(x, center = mu)
## [1] 0

which gives the same precision as var(), e.g.

> var(x) - colVars(x, center = mu)
     [,1]
[1,]    0

compared to:

> var(x) - colVars(x)
              [,1]
[1,] -3.208926e-30
HenrikBengtsson commented 1 year ago

Added refine = TRUE to rowVars() et al, which address the problem reported in this issue:

> library(matrixStats)
> x <- matrix(rep(2.897823, 60), ncol = 1)
> var(x) - colVars(x)
     [,1]
[1,]    0

> var(x) - colVars(x, refine = FALSE)
              [,1]
[1,] -3.208926e-30