ababaian / serratus

Ultra-deep search for novel viruses
http://serratus.io
GNU General Public License v3.0
255 stars 33 forks source link

cov3a Manual Blacklisting #124

Closed ababaian closed 4 years ago

ababaian commented 4 years ago

Blacklist Accessions

Blacklist Regions

ababaian commented 4 years ago

@taltman / @JustinChu / @rcedgar When you get a chance could you take a look at SRR8421353

The very 3' end of many PEDV genomes are being hit here which in theory is where max coverage should be of the genome but then there's almost no reads elsewhere. The fact multiple genomes are hit means it is a conserved sequence. Example of a read (with 100 identity to PEDV) is below. There are 2735 reads hitting the pan-genome.

Hap name: null
Dist: 0
Read name = SRR8421353.9733618
Sample = SAMN10716021
Library = SRR8421353
Read group = SRR8421353
Read length = 55bp
----------------------
Mapping = Primary @ MAPQ 1
Reference span = KM609206.1:27,983-28,037 (-) = 55bp
Cigar = 55M
Clipping = None
----------------------
XG = 0
NM = 0
XM = 0
XN = 0
XO = 0
AS = 110
XS = 110
YT = UU
Hidden tags: MD, RG<hr>Location = KM609206.1:27,989
Base = C @ QV 39

Alignment start position = KM609206.1:27983
ATTTGACTCAAGGACTGTTAGTAACTGAAGACCTGACGGTGTTGATATGGATACG

In "non virus" BLAST I could not find any homology to this sequence. Perhaps if we assemble that library and see where this sequence is in the contigs we can get abetter understanding if it's FP or not.

rcedgar commented 4 years ago

There are very short hits to mammals. Bowtie2 can see these if there is a ~20nt seed. This would be automatically masked by my proposed screening procedure :-)

image

ababaian commented 4 years ago

The CIGAR on the alignment to the PEDV genomes is 55M. Until we know why those thousands of those reads there, I would want to investigate.

rcedgar commented 4 years ago

I didn't check pig. We can't address every biological question that comes up -- we can safely mask ANY 55nt sequence that causes trouble whether we understand it or not because we need much more than this to do any useful virus biology. The goal of the first pass is detection, not providing useful alignments for downstream analysis, that's a separate problem.

ababaian commented 4 years ago

This is in fact a PEDV experiment, but why it's not matching the remaining bits of the genome is something we should find out. Perhaps they were using a PEDV that is not well characterized/deposited in GenBank.

Edit: Appears it's also a non-standard chemistry of a cell-line infection that focuses on 3' ends so that explains what we're seeing. IVT-SAPAS.

rcedgar commented 4 years ago

SRR8421353 shows hits to cirovirus and the host genome, not PEDV. Pig virus, pig host -- not a coincidence, there are surely PEDV insertions into pig. The vets mis-diagnosed PEDV in that piglet. Very helpful having a mega-genome...

ababaian commented 4 years ago

Update complete. Next set of blacklists are for cov4