Closed myles-lewis closed 1 year ago
This is not a bug in colVars()
per se, but rather a demonstration of a fundamental limitation in how computers process numbers. Any interval on the real number line has an infinte number of numbers, but computers only have a limited space to store each of the numbers, in our case 64 bits. This means that most numbers must be rounded to the closest machine number which creates a round-off error. In your example, variance is zero, but round-off errors makes the answer somewhat off the mathematical true value. I would assume that the error of approximately $3\cdot10^{-30}$ is acceptable for most practical purposes.
For the reason mentioned, testing equality with floating point numbers is dangerous as even the slightest round-off error will make two otherwise equal number to be evaluated as unequal. Therefore, it is usually a good idea to include a small tolerance in the comparison. If you want to evaluated whether a vector of numbers contain the same elements, I would advocate for an approach such as:
x <- matrix(rep(2.897823, 60), ncol=3)
tol <- .Machine$double.eps ^ 0.5 # double.eps: The smallest number eps such that 1 + eps != 1 in 64 bit artithmetric
range_matrix <- colRanges(x) # Creates a matrix with two columns (min and max element in each column)
diff_ranges <- c(rowDiffs(range_matrix)) # Finds the difference between the max and min elements
diff_ranges < tol
## TRUE TRUE TRUE
Thanks, that's fair enough and makes sense. I was wondering how base R gets the variance or sd correct as 0. scale()
can be used to rapidly perform column SDs (they can be extracted using attr()
) and it gives 0 for these types of data.
After reading the source code of stats::var()
, I found that the implementation is more sophisticated in order to reduce numerical inaccuracy. First, the mean is computer with a two-pass method. Second, when computing the variance, the intermediate results are calculated with type LDOUBLE
which is a 80-bit precision floating point number before the answer is converted to a 64-bit float in the end. Still, even if the function stats::var()
gave you the exact answer in this case, you can't trust that it will give you an answer without round-off error.
EDIT 2023-05-26: stats::var()
source code is at https://github.com/wch/r-source/blob/trunk/src/library/stats/src/cov.c#LL467-L492 /HB
Thank you for looking into this. Accepted, and I will use all of these options with a modicum of caution!
After reading the source code of
stats::var()
, I found that the implementation is more sophisticated in order to reduce numerical inaccuracy. First, the mean is computer with a two-pass method. Second, when computing the variance, the intermediate results are calculated with typeLDOUBLE
which is a 80-bit precision floating point number before the answer is converted to a 64-bit float in the end. ...
To achieve the same for matrixStats, we would have to (i) update:
to do a second pass, and (ii) update:
and previous code to use LDOUBLE
instead of double
.
To do this, we could introduce an optional refine = TRUE
argument, just as we have for mean2()
. This means that we should probably do the same for colMeans2()
and rowMeans2()
.
First step in improving on this completed. I've added refine = TRUE
to colMeans2()
and rowMeans2()
, e.g.
library(matrixStats)
x <- matrix(rep(2.897823, 60), ncol = 1)
colMeans(x) - colMeans2(x, refine = FALSE) ## the old behavior
# [1] 1.776357e-15
colMeans(x) - colMeans2(x, refine = TRUE) ## the new default
# [1] 0
Next is to add refine
to colVars()
and rowVars()
too. Until then, one can do:
library(matrixStats)
x <- matrix(rep(2.897823, 60), ncol = 1)
mu <- colMeans(x)
colVars(x, center = mu)
## [1] 0
which gives the same precision as var()
, e.g.
> var(x) - colVars(x, center = mu)
[,1]
[1,] 0
compared to:
> var(x) - colVars(x)
[,1]
[1,] -3.208926e-30
Added refine = TRUE
to rowVars()
et al, which address the problem reported in this issue:
> library(matrixStats)
> x <- matrix(rep(2.897823, 60), ncol = 1)
> var(x) - colVars(x)
[,1]
[1,] 0
> var(x) - colVars(x, refine = FALSE)
[,1]
[1,] -3.208926e-30
I have noticed an odd bug in
colVars()
which can occur with a column of certain non-integers which are identical and thus should give var=0. I was hoping to use the function to detect var=0 in large numbers of columns.Reproducible example:
Any suggestions why?