KlausVigo / phangorn

Phylogenetic analysis in R
http://klausvigo.github.io/phangorn/
202 stars 38 forks source link

Duplicated sites are ignored when computing parsimony using "site" option #171

Closed cwbcm closed 1 month ago

cwbcm commented 1 month ago

Hi,

Thank you for making this great package! I am trying to compute parsimony for a list of traits using the parsimony() function. I found that the overall parsimony score doesn't match the sum of by site parsimony score.

library(phangorn)

set.seed(3)
data(Laurasiatherian)
dm <- dist.hamming(Laurasiatherian)
tree <- NJ(dm)
# compute overall parsimony and by site parsimony
parsimony(tree, Laurasiatherian) 
parsimony(tree, Laurasiatherian, site = "site") |> sum()

Here is the output

> parsimony(tree, Laurasiatherian) 
[1] 9796
> parsimony(tree, Laurasiatherian, site = "site") |> sum()
[1] 9566

When I dig deeper, I found that duplicated traits are not computed with the site="site" option.

# generate a duplicated trait
trait_mat <- as.character(Laurasiatherian)
trait_mat_test <- cbind(trait_mat[,1], trait_mat[,1])
trait_test <- as.phyDat(trait_mat_test)

parsimony(tree, trait_test) 
parsimony(tree, trait_test, site = "site") 

When computing overall parsimony, both sites are included. But only one parsimony score is returned when using the site = "site" option.

> parsimony(tree, trait_test) 
[1] 32
> parsimony(tree, trait_test, site = "site") 
[1] 16

I am not sure if this is an intended feature or a bug.

Here is my session info

> sessionInfo()
R version 4.3.3 (2024-02-29)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.5

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Chicago
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] phangorn_2.11.1 ape_5.8        

loaded via a namespace (and not attached):
 [1] digest_0.6.34     igraph_2.0.3      codetools_0.2-19  fastmatch_1.1-4   ranger_0.16.0     Matrix_1.6-5     
 [7] lattice_0.22-5    magrittr_2.0.3    glue_1.7.0        tictoc_1.2.1      parallel_4.3.3    pkgconfig_2.0.3  
[13] generics_0.1.3    lifecycle_1.0.4   cli_3.6.2         grid_4.3.3        compiler_4.3.3    rstudioapi_0.16.0
[19] tools_4.3.3       nlme_3.1-164      Rcpp_1.0.12       quadprog_1.5-8    rlang_1.1.3      
KlausVigo commented 1 month ago

Hi @cwbcm,

it is a feature, but it should be documented better. The phyDat object stores site patterns only once avoiding unnecessary computations, but has usually two additional attributes weight and index:

weight <- attr(Laurasiatherian, "weight")
index <- attr(Laurasiatherian, "index")  

storing the weight and position of each site pattern.

Hi,

Thank you for making this great package! I am trying to compute parsimony for a list of traits using the parsimony() function. I found that the overall parsimony score doesn't match the sum of by site parsimony score.

library(phangorn)

set.seed(3)
data(Laurasiatherian)
dm <- dist.hamming(Laurasiatherian)
tree <- NJ(dm)
# compute overall parsimony and by site parsimony
parsimony(tree, Laurasiatherian) 
parsimony(tree, Laurasiatherian, site = "site") |> sum()

So you could do either

> (parsimony(tree, Laurasiatherian, site = "site") * weight) |> sum()
[1] 9796

or

> parsimony(tree, Laurasiatherian, site = "site")[index] |> sum()
[1] 9796

Kind regards, Klaus

cwbcm commented 1 month ago

Hi Klaus,

Thank you so much for the clarification!

Thanks, Chen