ababaian / serratus

Ultra-deep search for novel viruses
http://serratus.io
GNU General Public License v3.0
253 stars 33 forks source link

Regional blacklist #79

Closed rcedgar closed 4 years ago

rcedgar commented 4 years ago

In #47 we discussed how to implement the blacklist. @ababaian suggested (is already using?) a bed file to mask regions in valid Cov sequences which give FP host hits. I would recommend a different approach, as follows.

If a region is bad, other similar regions are certainly also bad. There are many very similar sequences in the pan-genome. Therefore, the region blacklist should be sequence fragments, not their coordinates.

When a problem fragment is identified, its sequence (not its coordinates) should be added to that database and all hits to the problem sequence in the pan-genome reference should be masked. This accomplishes the same goal as a bed file (it masks the instance of the fragment that was found by hand), and is almost certainly better because it masks all similar fragments which are currently present or added as the reference is updated. Currently, these must be identified by hand.

Reviewing the summarizer reports, I see signs of problem fragments distributed across several reference sequences. Consider SRR11119416 as a typical example (pasted at end of this comment). The coverage of MG013973.1 looks very convincing -- it's uniform across the entire sequence (1,407nt) except for a bump at the right-hand end. There is also a bump at the end of KX285004.1 and beginning of KU664332.1, which otherwise have zero coverage. There are also a few bumps in the middle of other sequences, but end-bumps seem to be favored. This pattern suggests that bumps in different sequences are similar to each other and are likely explained by something like a poly-A tail, meaning an mRNA segment ("doohickey") which may not be part of the CDS and is found in both host and viral transcripts. Poly-As as such should be dust-masked, but I'll bet there are some other mRNA doohickies causing similar problems.

acc=KU664332.1;hits=57335;len=2098;depth=4.1e+03;pctid=93.2;tax=28295;cov=0.0938;coverage=OOo_____________________________;desc=Porcine epidemic diarrhea virus; acc=MG013973.1;hits=1407;len=1797;depth=117;pctid=90.7;tax=11120;cov=1.0000;coverage=...............................O;desc=Infectious bronchitis virus; acc=KX285004.1;hits=302;len=316;depth=143;pctid=99.4;tax=983929;cov=0.0312;coverage=_______________________________O;desc=Chaerephon bat coronavirus/Kenya/KY22/2006; acc=MK000573.1;hits=13;len=1147;depth=1.7;pctid=95.1;tax=28295;cov=0.2188;coverage=____O_____O_____________...O.___;desc=Porcine epidemic diarrhea virus; acc=LC506915.1;hits=4;len=2328;depth=0.258;pctid=99.8;tax=31631;cov=0.0625;coverage=__________o_o___________________;desc=Human coronavirus OC43; acc=MH921428.1;hits=2;len=416;depth=0.721;pctid=96.0;tax=1508220;cov=0.0312;coverage=_______________________________o;desc=Bat coronavirus; acc=KX285218.1;hits=2;len=349;depth=0.86;pctid=98.7;tax=393045;cov=0.0312;coverage=_______________________________o;desc=Bat coronavirus HKU6; acc=pan_genome;hits=2;len=30000;depth=0.01;pctid=100.0;tax=?;cov=0.0312;coverage=_________________o______________;desc=Pan-genome; acc=MT246461.1;hits=2;len=30375;depth=0.00988;pctid=100.0;tax=2697049;cov=0.0312;coverage=_________________o______________;desc=Severe acute respiratory syndrome coronavirus 2;

ababaian commented 4 years ago

This is now "solved" with current blacklist annotation, suggest closing.

rcedgar commented 4 years ago

Agreed, this is solved until proven otherwise in which case new issue. Closing.