emmanuelparadis / ape

Analysis of Phylogenetics and Evolution
https://emmanuelparadis.github.io/
GNU General Public License v2.0
54 stars 11 forks source link

Rounding error deep in 'pcoa' #17

Closed cjcarlson closed 3 years ago

cjcarlson commented 3 years ago

Hi there!

I love the package, and am still learning some of the functionality.

I think I've found a rounding error. It addresses the underlying issue in

The centre() function internal to ape::pcoa() returns non-symmetric matrices due to rounding errors. centre() re-centers the matrix but the sigfigs aren't quite conserved through the whole matrix somehow? As a result, you can still get complex eigenvalues from a matrix that, when it enters the function, is symmetric and non-negative (which shouldn't be possible in principle but is a result of the transformation that calls centre).

Attached is a zip with a reproducible R script and .RDS so you can check this out. I think fixing the rounding error is probably pretty easy? I'm very excited to dig deeper into this.

Thanks so much!

APE.zip

emmanuelparadis commented 3 years ago

Hi Colin. Thanks for your appreciation!

The double-centering operation in cmdscale is done by a C code which can be called directly with the ::: operator (not usable in a package!). Here's a simple example with 5 random variables (after loading the code of centre in R since it is internal to pcoa):

R> n <- 5L
R> x <- rnorm(n)
R> d <- as.matrix(dist(x))
R> ddc1 <- .Call(stats:::C_DoubleCentre, d)
R> ddc2 <- centre(d, n)
R> ddc1 - ddc2
              1             2            3             4             5
1 -1.110223e-16 -5.551115e-17 0.000000e+00  0.000000e+00 -2.775558e-17
2 -5.551115e-17  0.000000e+00 5.204170e-18 -7.806256e-18  2.775558e-17
3  0.000000e+00 -1.214306e-17 0.000000e+00 -1.387779e-17 -1.734723e-17
4  0.000000e+00  1.387779e-17 2.775558e-17  1.387779e-17  1.127570e-17
5 -2.775558e-17  4.163336e-17 1.214306e-17  1.301043e-18  2.775558e-17

I repeated this code with n = 500 and the matrix returned by centre is more symmetric than the other one:

R> isSymmetric(ddc1, tol = 1e-15)
[1] FALSE
R> isSymmetric(ddc2, tol = 1e-15)

So I'm not sure the code wriiten by Pierre is an issue. I tried with the data posted by M. Riera on StackOverflow (713 x 713 matrix) and found no error, but cmdscalereturned zero colums, and pcoa returned a single column with all coordinates zero.

With the data you posted here: it worked for me with both functions. Here is the output from the last lines of your script:

R> isSymmetric(c)
[1] TRUE
R> c[3,1] == c[1,3]
[1] FALSE
R> c[3,1] - c[1,3]
[1] 9.325873e-15

So I'm not sure where's the problem. From your files' encoding it seems you both are under Windows (I'm under Linux), which may be an explanation.

cjcarlson commented 3 years ago

I am indeed on a Windows machine! But I'm pretty sure this is why we're getting complex eigenvalues. If you want to prove it to yourself (on a Windows machine), this is what convinced me (tacks onto the code I posted)

> D.eig <- eigen(c)
> is.complex(D.eig$vectors)
[1] TRUE
> is.complex(D.eig$values)
[1] TRUE
> 
> c[lower.tri(c)] = t(c)[lower.tri(c)]
> D.eig <- eigen(c)
> is.complex(D.eig$vectors)
[1] FALSE
> is.complex(D.eig$values)
[1] FALSE

Maybe it's just that the tolerance on my machine is different? But it's doing the exact calculation far enough out that this happens

cjcarlson commented 3 years ago

If I add this to the pcoa function:

  centre <- function(D, n) {
    One <- matrix(1, n, n)
    mat <- diag(n) - One/n
    mat.cen <- mat %*% D %*% mat
    mat.cen[lower.tri(mat.cen)] <- t(mat.cen)[lower.tri(mat.cen)]
    return(mat.cen)
  }

I can run the function correctly! So I'm 98% sure this is it

emmanuelparadis commented 3 years ago

I think this has something to do with LAPACK (used by eigen() and maybe %*%). Under Ubuntu, R depends on pre-installed LAPACK libraries, whereas under Windows these are delivered with R. This can explain some tiny differences. Your suggestion is a very good idea: I'm pushing a new version here in a moment. Thanks!