Closed ababaian closed 4 years ago
@taltman / @JustinChu / @rcedgar When you get a chance could you take a look at SRR8421353
The very 3' end of many PEDV genomes are being hit here which in theory is where max coverage should be of the genome but then there's almost no reads elsewhere. The fact multiple genomes are hit means it is a conserved sequence. Example of a read (with 100 identity to PEDV) is below. There are 2735 reads hitting the pan-genome.
Hap name: null
Dist: 0
Read name = SRR8421353.9733618
Sample = SAMN10716021
Library = SRR8421353
Read group = SRR8421353
Read length = 55bp
----------------------
Mapping = Primary @ MAPQ 1
Reference span = KM609206.1:27,983-28,037 (-) = 55bp
Cigar = 55M
Clipping = None
----------------------
XG = 0
NM = 0
XM = 0
XN = 0
XO = 0
AS = 110
XS = 110
YT = UU
Hidden tags: MD, RG<hr>Location = KM609206.1:27,989
Base = C @ QV 39
Alignment start position = KM609206.1:27983
ATTTGACTCAAGGACTGTTAGTAACTGAAGACCTGACGGTGTTGATATGGATACG
In "non virus" BLAST I could not find any homology to this sequence. Perhaps if we assemble that library and see where this sequence is in the contigs we can get abetter understanding if it's FP or not.
There are very short hits to mammals. Bowtie2 can see these if there is a ~20nt seed. This would be automatically masked by my proposed screening procedure :-)
The CIGAR on the alignment to the PEDV genomes is 55M
. Until we know why those thousands of those reads there, I would want to investigate.
I didn't check pig. We can't address every biological question that comes up -- we can safely mask ANY 55nt sequence that causes trouble whether we understand it or not because we need much more than this to do any useful virus biology. The goal of the first pass is detection, not providing useful alignments for downstream analysis, that's a separate problem.
This is in fact a PEDV experiment, but why it's not matching the remaining bits of the genome is something we should find out. Perhaps they were using a PEDV that is not well characterized/deposited in GenBank.
Edit: Appears it's also a non-standard chemistry of a cell-line infection that focuses on 3' ends so that explains what we're seeing. IVT-SAPAS.
SRR8421353 shows hits to cirovirus and the host genome, not PEDV. Pig virus, pig host -- not a coincidence, there are surely PEDV insertions into pig. The vets mis-diagnosed PEDV in that piglet. Very helpful having a mega-genome...
Update complete. Next set of blacklists are for cov4
Blacklist Accessions
DL231478.1
: Recombinant raccoon pox viruses and their use as an effective vaccine. Picks up a lot of junkBlacklist Regions
MK562374.1:472-561
JB181528.1:11-200
Patent sequence with plasmidJB181528.1:3650-4300
'Patent sequence with some junk vector