Closed rcedgar closed 4 years ago
I think adding these as particular meta-data terms to #45 is certainly a good idea.
It might take a bit more data runs for us to finish the blacklist. Also what you're doing is a great way of identifying these outliers. The sequence you mention FV537213 looks to be some mammalian related sequence and can be blacklisted.
I think removing all of the patent sequences is throwing the baby out with the bathwater. I'm sure there are many researchers who have a lot of CoV (SARS etc..) sequences in private databases and only release little bits of it when required for a patent and that is what we're seeing. It would be good to see what the distribution of these are across the phylogenetic tree but it might be an important source of variation in distal sequences with otherwise no public data available.
@taltman suggested adding low-complexity mask to the pan-genome which might be able to take care of some mis-alignment / error-prone areas. We can release soft/hard masked versions of the pan-genome with these as we do currently.
Agreed with low-complexity masking. I think using dust (the default nt masker in BLAST) should catch poly-A and poly-T plus less obvious cases. From memory, it's hard / impossible to find the dust code, usearch has an identical implementation if you need an alternative (fastx_mask command).
It's actually easy. The dustmasker executable is bundled with the NCBI-Blast+ package in Ubuntu (and I'm sure it's the same for the RPM).
Confirmed. Moral: don't trust my memory.
might be useful to include outgroups in the tree analysis. Something in Torovirinae would be useful to identify coronas within coronaviridae
The optimal split for Serratus will be guided by whether there are mammal-associated Torovirinae. Or for a more specific criterion, pathogenic mammal-associated Torovirinae.
We're doing family-level already (Coronaviridae
) so 3-4 representative members from Arteri/Mesoni/Roni- each would be the move to root us. Anything more dissimilar we can can consider outlier and/or non-CoV.
Edit: My mistake. As per this page we do not have Toroviruses and should include a sampling of them.
The tree should be seeded with reference seqs too. Maybe build a reference tree first then map the mined reads onto the reference tree with something like EPA or pplacer. Maybe this should be a separate issue?
Looks to me like this is digressing into a discussion of how to predict taxonomy when we find virus in a dataset. If so, then I think this should be a separate issue. Building a good tree is very difficult; the starting point should be trees for protein-coding genes noting that they may be discordant because of horizontal transfer and the fact that there are always many branching errors in predicted trees. Tree placement does not work well for taxonomy prediction; see these references for 16S which is surely the best-studied tree: (https://doi.org/10.7717/peerj.5030, https://peerj.com/articles/4652/).
KY419108.1 is "Porcine hemagglutinating encephalomyelitis virus strain PHEV CoV USA-15TOSU24992, complete genome."
but is only 751 nt compared to 30k for other genomes. Is this actually (a) complete and (b) Cov as claimed by the description?
It looks like a CoV and certainly not a complete genome. The smallest mammallian virus I know of is Circoviruses and they are ~1050 nt genomes.
Add KC786228.1:49-191
which has cryptic homology to E.coli / rRNA
Example read that was blast'd with soft-clip overhangs:
TCTATTCGTCGGCAGCGTCAGATGTGTATAAGAGACAGGTGTAGCGCGCGTGCAGCCCCGGACATCTAAGGGCAC
From #76:
rce AX191449.1 is an exact nt match to AJ295749.2 Rattus norvegicus mRNA for xylosyltransferase II. AX191447.1 is an exact nt match to AJ295748.1Rattus norvegicus mRNA for xylosyltransferase I. HV449436.1 is an exact nt match to NM_022255.1Rattus norvegicus G-protein coupled receptor 173.
KC786228.1
which matches fungal rRNA and gives a very strong false
I think I see one of the sources of weird entries in the reference database.
Check out this NCBI Entrez query: Coronaviridae NOT Coronaviridae[PORGN]
https://www.ncbi.nlm.nih.gov/nuccore/?term=Coronaviridae+NOT+Coronaviridae%5BPORGN%5D
So, the correct query is Coronaviridae[PORGN]
, not just a free-text Coronaviridae
.
I would advocate for restricting the queries to complete viral genomes, as most of the single-gene entries are redundant: Coronaviridae[PORGN] and "complete genome"
: https://www.ncbi.nlm.nih.gov/nuccore/?term=Coronaviridae%5BPORGN%5D+and+%22complete+genome%22
I can make this reproducible using BioPython or EDirect (once I find some spare time…)
Redundant entries are handled by the usearch clustering step -- any gene which matches a full length genome will be removed automatically. This leaves a few problematic genes / fragments including some patent sequences which have to be identified manually (i.e., visually). To avoid these problems, I would also have preferred to use full-length genomes, but @ababaian is adamant that we should be all-inclusive, presumably with the goal of finding the most distant possible relatives. I would appreciate a more detailed explanation when he has the time, but in the mean time I'm happy to follow his lead because the process of weeding out the junk seems to be going well enough.
Regarding the Entrez query, "complete genome" is not reliable, some are called "whole genome" and (from memory) there are a few other variants (?whole segment). Also from memory, I think there is an assembly_summary.txt file somewhere which gives the status of all the virus assemblies (for sure there is one for bacteria).
JB181528.1:3692-4258
is from a patent vaccine for Spike protein with a perfect BLAST match to a cloning "Vector pMRS101, complete sequence"DL231478.1:8,567-8,706
is at the end of the sequence, also likely to be a cloning artifact.NC_001461:4,993-5,267
in BVDV is a near perfect match to heat shock proteinNC_003391:9,229-9,922
in cowpox matches TMBIB4NC_001405:19,281-19,320
in adenovirus 4 has cryptic homology to cloning vectorMK204388.1:1-64
, specifically 35-55 is giving a seed match to Canis lupus familiaris breed Labrador retriever chromosome 08a
or LncRNA 2063.5 in human. This came up in library rabbit ear: SRR5786563
. Also.JB181528.1:3,675-4,258
shows match to a cloning vector pMD19.CS480537.1:30,130-30,332
and CS480537.1:37,234-37,315
E.coliDL231478.1:1-42
DL231478.1:8,568-8,687
end of the fragment are cloning vectorsLR721664.1:31,112-31,189
is a close match to a canine X2 lncRNA.KF530130.1:4,852-4,896
in Cat CoV gives thousands of FP hits to ferret scRNAseq (SRR6662917).MH002340.1:30,202-30,242
shows multi-library cryptic homology to DDRGK gene in pig as in SRR4344100
LQ338105.1:14-59
has a match in X2/antibody gene in CHO SRR9621030
KY967725.1:17,220-17,250
matches NOV gene in pig. SRR8573571
DI086074.1:1-168
match to E.coli.FV537211.1
and FV537213.1
have a very high AT content and are picking up all polyT signals.MG600026.1:3,926-3,981
is already partiall masked for TTTT, extend to this regionThese are now included in cov3m blacklisting.
I found a patent sequence (FV537213.1) in cov2.fa which is obviously non-biological (no Cs!) and causes problems for sequence search and comparison algorithms because it is low complexity. Perhaps we should delete all patent sequences? I found a total of 1,683 cov2.fa sequences with "patent" somewhere in the Genbank record. A counter-example might be A22378.1 which I don't understand very well -- it's a patented vaccine sequence. Looks like @ababaian already deleted many of the vaccine sequences. Maybe "vaccine" and "patent" should be added to #45?