emmanuelparadis / pegas

Population and Evolutionary Genetics Analysis System
GNU General Public License v2.0
27 stars 10 forks source link

How to get the same result as DnaSp in haplotype extraction? #75

Closed jaum20 closed 1 year ago

jaum20 commented 1 year ago

The same matrix (73 DNA sequences) produces very different results in the two programs. In DNAsp, 4 haplotypes are generated (gaps not considered and excluding sites with gaps / missing data) and with pegas I got 67 haplotypes!

I tried all sort of things: replace gaps with Ns, question marks, and all combinations of the strict and trailingGapsAsN arguments, all give me dozens of haplotypes but none gives me the 4 haplotypes wich works well for my data.

How can I exclude the sites with gaps and missing data to get the same result as DNAsp?

emmanuelparadis commented 1 year ago

It's hard to answer your question since I don't know the algorithm used by DNAsp to extract haplotypes. You may have a look at the function del.colgapsonly in ape but it considers only gaps. You can a finer control of which sites to delete by applying the function base.freq to each column of the data. More simply, you can just export the haplotypes from DNAsp into a FASTA file.

jaum20 commented 1 year ago

Thank you for your answer.

More simply, you can just export the haplotypes from DNAsp into a FASTA file.

The export file options in DNAsp are nexus, arlequin and Roehl (network). How can I import the nexus file generated by DNAsp into R as an 'haplotype' object?

emmanuelparadis commented 1 year ago

ape has the function read.nexus.data, see its help page. So you may try something like:

library(pegas) # will load ape too
x <- read.nexus.data("myfilefromDNAsp.nex")
x <- nexus2DNAbin(x)
h <- haplotype(x)
jaum20 commented 1 year ago
h <- haplotype(x)

Unfortunately that not worked:

In haplotype.DNAbin(x) : no segregating site detected with these options

emmanuelparadis commented 1 year ago

If you can share your data, I'll have a look

jaum20 commented 1 year ago

Sorry for the late response. Can you provide an e-mail address? I do not feel comfortable sharing a public link with my sequences

emmanuelparadis commented 1 year ago

My adress is in the DESCRIPTION file as maintainer of the package :)

jaum20 commented 1 year ago

I just sent you an email with the data. Sorry for the delay

emmanuelparadis commented 1 year ago

No worry for the delay :) I had a look at your data. Here's what you can do:

x <- read.dna("ITS_pruned.DnaSP.fas", "f")
image(x)

It shows that several positions (around 80-180) are polymorphic but some sequences have long stretches of gaps at these positions. So it makes sense that DNAsp misses these. Just to be clear on the way pegas::haplotype() handles these N's which are inserted in place of the leading/trailing gaps, in the following alignment the 1st position is considered as polymorphic but not the 2nd one:

NN
GA
AA

(If you set the option strict = TRUE both positions are treated as polymorphic). This generalizes to other ambiguous bases; in the next alignment, the 1st position is treated as polymorphic but not the 2nd one:

YR
GG
AA

If you want to reproduce the results from DNAsp, the function ape::del.colgapsonly can be used:

haplotype(del.colgapsonly(x, threshold = 0.01))

which outputs 4 haplotypes (but considers 390 positions instead of the 858 in the oiriginal data set).

jaum20 commented 1 year ago

Thank you very much for your help and explanation 🤝