emmanuelparadis / ape

analysis of phylogenetics and evolution
http://ape-package.ird.fr/
GNU General Public License v2.0
52 stars 11 forks source link

is.ultrametric unexpected behavior #122

Open pedrosenna opened 1 month ago

pedrosenna commented 1 month ago

Dear Emmanuel,

first of all, thanks for this amazing package.

I just found an unexpected behavior when i was trying to run splits::gmyc() over a set of post-burnin files from BEAST2 using purrr::map(), which is the tidyverse's lapply() function.

The problem seems to be in the is.ultrametric() evaluation made prior to the analysis. For example, running over a set of 1000 ultrametric trees generated using BEAST2:

Evaluating a multiPhylo object using is.ultrametric():

all(is.ultrametric(trees))

[1] TRUE

Evaluating a multiPhylo object using is.ultrametric() inside lapply():

suppressWarnings(all(lapply(trees, is.ultrametric)))

[1] TRUE

Evaluating a multiPhylo object using is.ultrametric() inside purrr::map():

suppressWarnings(all(purrr::map(trees, is.ultrametric)))

[1] FALSE

I really don't understand what's going on here. Any ideas?

emmanuelparadis commented 1 month ago

Dear Pedro,

Thanks for your appreciation (always appreciated!)

I run the following code (using sapply and unlist to avoid warnings) a couple of minutes and it run fine even with 100 tips (though it's a bit slow...):

i <- 0
repeat {
    i <- i + 1
    trees <- replicate(1000, rcoal(10), FALSE)
    class(trees) <- "multiPhylo"
    stopifnot(all(is.ultrametric(trees)))
    stopifnot(all(sapply(trees, is.ultrametric)))
    stopifnot(all(unlist(purrr::map(trees, is.ultrametric))))
}

Just in case: you can change the argument option of is.ultrametric(). From ?is.ultrametric:

The test is based on the distances from each tip to the root and a criterion: if ‘option = 1’, the criterion is the scaled range ((max - min/max)), if ‘option = 2’, the variance is used (this was the method used until ape 3.5). The default criterion is invariant to linear changes of the branch lengths.

Best,

Emmanuel

pedrosenna commented 1 month ago

Dear Emmanuel,

Thanks for the quick response! After running the same code you provided on my notebook, i wondered if that could be somehow related to how i am importing the tree files into R. I am importing BEAST2 trees using ape::read.nexus(). I did a test converting the BEAST2 files into newick format using FigTree and purrr:map() worked as intended.

Here's a reproducible example using your code.

trees <- replicate(1000, rcoal(10), FALSE)

class(trees) <- "multiPhylo"

write.nexus(trees, file = "data/test/random_trees.nex")

trees2 <- read.nexus(file= "data/test/random_trees.nex")

identical(trees, trees2) # FALSE

all(is.ultrametric(trees2)) # TRUE

all(sapply(trees2, is.ultrametric)) # TRUE

all(unlist(purrr::map(trees2, is.ultrametric))) # FALSE

I wonder why trees and trees2 fails to pass through identical() testing. File sizes also differ a lot (2.3 MB vs 1.5 MB).

pedrosenna commented 1 month ago

Dear Emmanuel,

here's another example of the same behavior after saving trees in nexus format and then reading back to R.

library(ape)
library(tidyverse)

set.seed(123)
trees <- replicate(1000, rcoal(10), FALSE)

write.nexus(trees, file = "random_trees.nex")
write.tree(trees, file = "random_trees.nwk")

trees_nex <- read.nexus("random_trees.nex")
trees_nwk <- read.tree("random_trees.nwk")

all(unlist(map(trees_nex, is.ultrametric))) # FALSE
all(unlist(map(trees_nwk, is.ultrametric))) # TRUE

Also, when comparing different str() levels between trees, i found this:

identical(trees_nex, trees_nwk)

[1] FALSE

identical(trees_nex[1], trees_nwk[1])

[1] FALSE

identical(trees_nex[[1]], trees_nwk[[1]])

[1] TRUE

str(trees_nex[1])

Class "multiPhylo" List of 1 $ :List of 3 ..$ edge : int [1:18, 1:2] 11 12 12 11 13 13 14 15 15 16 ... ..$ edge.length: num [1:18] 2.827 0.119 0.119 2.726 0.22 ... ..$ Nnode : int 9 ..- attr(, "class")= chr "phylo" ..- attr(, "order")= chr "cladewise"

  • attr(*, "TipLabel")= chr [1:10] "t10" "t7" "t1" "t8" ...
str(trees_nwk[1])

Class "multiPhylo" List of 1 $ :List of 4 ..$ edge : int [1:18, 1:2] 11 12 12 11 13 13 14 15 15 16 ... ..$ edge.length: num [1:18] 2.827 0.119 0.119 2.726 0.22 ... ..$ Nnode : int 9 ..$ tip.label : chr [1:10] "t10" "t7" "t1" "t8" ... ..- attr(, "class")= chr "phylo" ..- attr(, "order")= chr "cladewise"

Edit: Just found a workaround checking tracerer::parse_beast_trees.R here

# Use c to strip the state names and convert it to a pure multiPhylo
  trees_nex <- c(trees_nex)

all(unlist(map(trees_nex, is.ultrametric)))

[1] TRUE

emmanuelparadis commented 1 month ago

Dear Pedro,

I think there are several issues behind these results.

1) The order of the elements in "phylo" objects is not mandatory, so you cannot always rely on identical() for comparing trees: use all.equal() instead (see ?all.equal.phylo for details). Of course, if identical() returns TRUE, so does all.equal().

2) When printing numerical (ie, real, not integers) values in a file, there is (often) the issue of how many digits to print which may "distort" the values, eg, printing 16 digits is OK but not 10 digits:

R> as.numeric(sprintf("%.16f", pi)) == pi
[1] TRUE
R> as.numeric(sprintf("%.10f", pi)) == pi
[1] FALSE

write.tree() has the option digits = 10 (for the branch lengths) which can be modified. However, write.nexus() has no such option and 10 digits are printed (unless you modify the code yourself).

If you just need to save trees to use them later in R/ape, my recommendation is to use saveRDS(): this is faster, the file is compressed, and there is no loss in numerical precision for the branch lengths.

3) Because assessing ultrametricity is a statistical question, you may have encountered something like the followings:

R> tr <- rcoal(10)
R> is.ultrametric(tr, options = 1)
[1] TRUE
R> is.ultrametric(tr, options = 2)
[1] TRUE
R> ts <- read.tree(text = write.tree(tr, digits = 6))
R> is.ultrametric(ts, option = 1)
[1] FALSE
R> is.ultrametric(ts, option = 2)
[1] TRUE

Best,

Emmanuel