emmanuelparadis / ape

analysis of phylogenetics and evolution
http://ape-package.ird.fr/
GNU General Public License v2.0
52 stars 11 forks source link

Unexpected value in dist.dna on sequences with gaps #126

Open samcwiley opened 1 month ago

samcwiley commented 1 month ago

Hi, I really love using ape and have been using lots of the functions for calculating distance between sequences; thank you for this. I've noticed the nucleotide substitution models that include nucleotide base frequencies (namely Felsenstein-81 and Tamura-Nei 93) are giving me unexpected values for sequences with gaps when pairwise deletion is turned on. In testing, both dist.dna(dna_sequences, model = "F81", pairwise.deletion=FALSE) and dist.dna(dna_sequences, model = "F81", pairwise.deletion=TRUE) return the same value for pairs of sequences with gaps, and I believe this is due to the base frequency tabulations are still counting bases corresponding to a base with a gap (they are not getting deleted after all.)

emmanuelparadis commented 1 month ago

Hi, Thanks for the appreciation!

and I believe this is due to the base frequency tabulations are still counting bases corresponding to a base with a gap (they are not getting deleted after all.)

You're correct: the base frequencies are calculated with all the data, whatever the value of pairwise.deletion. The logic behind this is that in these models base frequencies are assumed to be constant over all sequences, so it's better to use all possible observed bases to estimate them.

You can see what are the possible impacts of this by computing the distances dropping all sequences but a pair, something like this:

n <- nrow(dna_sequences)
D <- numeric(n * (n - 1) / 2)
k <- 1
for (i in 1:(n - 1)) {
   for (j in (i + 1):n) {
      D[k] <- dist.dna(dna_sequences[c(i, j), ], "F81", ......>)
      k <- k + 1
    }
}

(This could be a bit slow if you have a very big data set.) D can be compared directly with the output of D0 <- dist.dna(dna_sequences, "F81", ......) for instance:

plot(D, D0); abline(0, 1, lty = 3)

Cheers, Emmanuel