Phylogenetic analysis in R
Duplicated sites are ignored when computing parsimony using "site" option #171

Thank you for making this great package! I am trying to compute parsimony for a list of traits using the parsimony() function. I found that the overall parsimony score doesn't match the sum of by site parsimony score.


dm <- dist.hamming(Laurasiatherian)
tree <- NJ(dm)
# compute overall parsimony and by site parsimony
parsimony(tree, Laurasiatherian) 
parsimony(tree, Laurasiatherian, site = "site") |> sum()

Here is the output

> parsimony(tree, Laurasiatherian) 
[1] 9796
> parsimony(tree, Laurasiatherian, site = "site") |> sum()
[1] 9566

When I dig deeper, I found that duplicated traits are not computed with the site="site" option.

# generate a duplicated trait
trait_mat <- as.character(Laurasiatherian)
trait_mat_test <- cbind(trait_mat[,1], trait_mat[,1])
trait_test <- as.phyDat(trait_mat_test)

parsimony(tree, trait_test) 
parsimony(tree, trait_test, site = "site") 

When computing overall parsimony, both sites are included. But only one parsimony score is returned when using the site = "site" option.

> parsimony(tree, trait_test) 
[1] 32
> parsimony(tree, trait_test, site = "site") 
[1] 16

I am not sure if this is an intended feature or a bug.

Here is my session info

Hi @cwbcm,

it is a feature, but it should be documented better. The phyDat object stores site patterns only once avoiding unnecessary computations, but has usually two additional attributes weight and index:

weight <- attr(Laurasiatherian, "weight")
index <- attr(Laurasiatherian, "index")  

storing the weight and position of each site pattern.


So you could do either

> (parsimony(tree, Laurasiatherian, site = "site") * weight) |> sum()
[1] 9796


> parsimony(tree, Laurasiatherian, site = "site")[index] |> sum()
[1] 9796

Kind regards, Klaus

Hi Klaus,

Thank you so much for the clarification!

Thanks, Chen