PoonLab / vindels

Developing an empirical model of sequence insertion and deletion in virus genomes
1 stars 0 forks source link

Handling superinfection cases #82

Closed ArtPoon closed 5 years ago

ArtPoon commented 5 years ago

Large data sets like 111848 can be split into two or more subsets that each correspond to a different transmitted founder variant.

jpalmer37 commented 5 years ago

I'm prohibited from analyzing the superinfection data set (111848.fasta) using RAPR because RAPR doesn't allow more than 800 sequences to be processed at once.

" Sorry! We could not process your request. The number of sequences in you input (=1033) exceeded the allowed maximum number of sequences (=800). "

Do you know of any alternative programs I could use for recombination detection?

ArtPoon commented 5 years ago

Try running it with 800 and see how the output looks

jpalmer37 commented 5 years ago

I finally got RAPR to work after running into problems. I used a subset of 600 randomly sampled sequences from patient 111848.

For context, I specified two consensus sequences that were generated from the two different populations. These two populations were divided by manually selecting the sequences in the alignment (and verified by checking the populations with a test tree).

Over 50% of the sequences were recombinant hits (307/600) according to RAPR. Seems like this is too high. This is the link to the result on LANL.

And this shows the two lineages in the tree: sample combined

What are your thoughts? I downloaded and installed RDP4 while I was struggling to get RAPR working. Let me know if you'd like me to try that instead.

Just a side note, 111848 was found to have 7 T/F viruses when I examined the paper further. Not sure if this might be affecting the result.

jpalmer37 commented 5 years ago

Out of curiosity, I installed and ran RDP4 on the full alignment file (1031 sequences + 2 consensus) and got this as a result:

Screenshot from 2019-10-02 18-08-42

However, I'm unsure whether I formatted the data in the analysis correctly.

jpalmer37 commented 5 years ago

RDP4 analysis was used as a general guideline to find sequences worth investigating. I still relied on manual screening of the phylogenetic tree containing all sequences. I removed all tips falling along the longest branch that separated the two distinct populations and recorded them in a filtering document.