emmanuelparadis / pegas

Population and Evolutionary Genetics Analysis System
GNU General Public License v2.0
27 stars 10 forks source link

Duplicated Sequence for MJN #87

Open Chatchamew opened 6 months ago

Chatchamew commented 6 months ago

Excuse me, Mr. Emmanuel. I am trying to your function mjn() but it always tells me “maybe there are duplicated sequences in your data”. My data is a set of 108 sequences; some of which are identical to each other. Does this mean that I have to make each haplotype have only one sequence? Or is there something I miss? Thank you in advance.

emmanuelparadis commented 6 months ago

Hello, Try the function haplotype() (in pegas too) on your 108 sequences. Once you did it, you can check all sequences are distinct with:

h <- haplotype(<<you sequence data name>>)
all(dist.dna(h, "n") > 0)

If the result is TRUE, you should be able to use mjn(h). Best, E.

Chatchamew commented 6 months ago

It came out as “FALSE”. I have already used “strict = TRUE” in the haplotype() function. By the way, some haplotypes are different through only deletions. Is this also why mjn() didn’t work?

emmanuelparadis commented 6 months ago

I suggest you try:

all(dist.dna(h, "n", pairwise.deletion = TRUE) > 0)

And also:

image(h)

It seems you read the help page ?haplotype so that you understand that trailing/leading gaps are a problem when identifying haplotypes. The above commands should help you to assess the situation with your data.

Chatchamew commented 6 months ago

I have tried all() and the result came up as “FALSE”, unfortunately. I have also tried image() and there are several gaps and degenerate bases (namely N and Y). I also tried trailingGapAsN = TRUE, and the all() still came up as “FALSE”. Any advice? I’m so sorry for wasting your time.

emmanuelparadis commented 5 months ago

It seems you have a very difficult data set, so your inferences will be necessarily limited.

Chatchamew commented 5 months ago

Roger that. I have dived into the data and found out that all pairs of haplotypes that have dist.dna = 0 differ only in either deletion or insertion. I really wish that you might try giving us an option to use mjn() despite the deletion, because the haplotype() function can separate them nicely.

emmanuelparadis commented 5 months ago

Have a look at this function in ape: DNAbin2indel. With it, you can then create a binary matrix indicating presence/absence of indels. pegas::mjn() can also analyse binary (0/1) data.

Chatchamew commented 5 months ago

I have tried a subset of my data with only sixteen sequences. I have checked the all(dist.dna()) function. If I used pairwise.deletion = TRUE, the all() function came up as TRUE. If I used pairwise.deletion = FALSE, the all() function came up as FALSE. When I used mjn(), it said “duplicate” again. Is there anything I can do?

By the way, can mjn() consider both base difference and deletion/insertion at the same time? I have tried DNAbin2indel() and the matrix I got is concerned only on deletion/insertion? Can I mix this matrix with base difference matrix and make it go through mjn()?

emmanuelparadis commented 5 months ago

Two other functions from ape that could help you with your data: latag2n() and solveAmbiguousBases() (maybe you already found them in the meantime).

You can try rmst() (also in pegas): it requires distances (unlike mjn()). It's not the same algorithm of course but it can sometimes give the same network (see the last example in ?rmst).

Chatchamew commented 5 months ago

Thank you so much for your response. I am now using rmst(). The only problem I have is that it doesn't generate median vectors. I will try to tackle with more sequences in the future. By the way, I would love to try mjn() that considers both base difference and base deletion/insertion at the same time. That would be revolutionary!

emmanuelparadis commented 5 months ago

By the way, I would love to try mjn() that considers both base difference and base deletion/insertion at the same time. That would be revolutionary!

There is the difficulty that indels within or on the head/tail of sequences should be treated differently. Another difficulty is that indels may also have substitutions (either before the deletion, or after the insertion). This could affect the time-reversibility of the model but I'm not sure how this is critical to the median-vectors. The revolution has to wait a bit!