Updates for cov3r pan-genome

rcedgar commented 4 years ago

I found a patent sequence (FV537213.1) in cov2.fa which is obviously non-biological (no Cs!) and causes problems for sequence search and comparison algorithms because it is low complexity. Perhaps we should delete all patent sequences? I found a total of 1,683 cov2.fa sequences with "patent" somewhere in the Genbank record. A counter-example might be A22378.1 which I don't understand very well -- it's a patented vaccine sequence. Looks like @ababaian already deleted many of the vaccine sequences. Maybe "vaccine" and "patent" should be added to #45?

ababaian commented 4 years ago

I think adding these as particular meta-data terms to #45 is certainly a good idea.

It might take a bit more data runs for us to finish the blacklist. Also what you're doing is a great way of identifying these outliers. The sequence you mention FV537213 looks to be some mammalian related sequence and can be blacklisted.

I think removing all of the patent sequences is throwing the baby out with the bathwater. I'm sure there are many researchers who have a lot of CoV (SARS etc..) sequences in private databases and only release little bits of it when required for a patent and that is what we're seeing. It would be good to see what the distribution of these are across the phylogenetic tree but it might be an important source of variation in distal sequences with otherwise no public data available.

ababaian commented 4 years ago

@taltman suggested adding low-complexity mask to the pan-genome which might be able to take care of some mis-alignment / error-prone areas. We can release soft/hard masked versions of the pan-genome with these as we do currently.

rcedgar commented 4 years ago

Agreed with low-complexity masking. I think using dust (the default nt masker in BLAST) should catch poly-A and poly-T plus less obvious cases. From memory, it's hard / impossible to find the dust code, usearch has an identical implementation if you need an alternative (fastx_mask command).

taltman commented 4 years ago

It's actually easy. The dustmasker executable is bundled with the NCBI-Blast+ package in Ubuntu (and I'm sure it's the same for the RPM).

rcedgar commented 4 years ago

Confirmed. Moral: don't trust my memory.

charlescongxu commented 4 years ago

might be useful to include outgroups in the tree analysis. Something in Torovirinae would be useful to identify coronas within coronaviridae

taltman commented 4 years ago

The optimal split for Serratus will be guided by whether there are mammal-associated Torovirinae. Or for a more specific criterion, pathogenic mammal-associated Torovirinae.

ababaian commented 4 years ago

We're doing family-level already (Coronaviridae) so 3-4 representative members from Arteri/Mesoni/Roni- each would be the move to root us. Anything more dissimilar we can can consider outlier and/or non-CoV.

Edit: My mistake. As per this page we do not have Toroviruses and should include a sampling of them.

charlescongxu commented 4 years ago

The tree should be seeded with reference seqs too. Maybe build a reference tree first then map the mined reads onto the reference tree with something like EPA or pplacer. Maybe this should be a separate issue?

rcedgar commented 4 years ago

Looks to me like this is digressing into a discussion of how to predict taxonomy when we find virus in a dataset. If so, then I think this should be a separate issue. Building a good tree is very difficult; the starting point should be trees for protein-coding genes noting that they may be discordant because of horizontal transfer and the fact that there are always many branching errors in predicted trees. Tree placement does not work well for taxonomy prediction; see these references for 16S which is surely the best-studied tree: (https://doi.org/10.7717/peerj.5030, https://peerj.com/articles/4652/).

rcedgar commented 4 years ago

KY419108.1 is "Porcine hemagglutinating encephalomyelitis virus strain PHEV CoV USA-15TOSU24992, complete genome." but is only 751 nt compared to 30k for other genomes. Is this actually (a) complete and (b) Cov as claimed by the description?

ababaian commented 4 years ago

It looks like a CoV and certainly not a complete genome. The smallest mammallian virus I know of is Circoviruses and they are ~1050 nt genomes.

ababaian commented 4 years ago

Add KC786228.1:49-191 which has cryptic homology to E.coli / rRNA

Screenshot from 2020-05-06 22-35-11

Example read that was blast'd with soft-clip overhangs:

TCTATTCGTCGGCAGCGTCAGATGTGTATAAGAGACAGGTGTAGCGCGCGTGCAGCCCCGGACATCTAAGGGCAC

ababaian commented 4 years ago

From #76:

rce AX191449.1 is an exact nt match to AJ295749.2 Rattus norvegicus mRNA for xylosyltransferase II. AX191447.1 is an exact nt match to AJ295748.1Rattus norvegicus mRNA for xylosyltransferase I. HV449436.1 is an exact nt match to NM_022255.1Rattus norvegicus G-protein coupled receptor 173.

KC786228.1 which matches fungal rRNA and gives a very strong false

taltman commented 4 years ago

I think I see one of the sources of weird entries in the reference database.

Check out this NCBI Entrez query: Coronaviridae NOT Coronaviridae[PORGN]

https://www.ncbi.nlm.nih.gov/nuccore/?term=Coronaviridae+NOT+Coronaviridae%5BPORGN%5D

So, the correct query is Coronaviridae[PORGN], not just a free-text Coronaviridae.

I would advocate for restricting the queries to complete viral genomes, as most of the single-gene entries are redundant: Coronaviridae[PORGN] and "complete genome" : https://www.ncbi.nlm.nih.gov/nuccore/?term=Coronaviridae%5BPORGN%5D+and+%22complete+genome%22

I can make this reproducible using BioPython or EDirect (once I find some spare time…)

rcedgar commented 4 years ago

Redundant entries are handled by the usearch clustering step -- any gene which matches a full length genome will be removed automatically. This leaves a few problematic genes / fragments including some patent sequences which have to be identified manually (i.e., visually). To avoid these problems, I would also have preferred to use full-length genomes, but @ababaian is adamant that we should be all-inclusive, presumably with the goal of finding the most distant possible relatives. I would appreciate a more detailed explanation when he has the time, but in the mean time I'm happy to follow his lead because the process of weeding out the junk seems to be going well enough.

rcedgar commented 4 years ago

Regarding the Entrez query, "complete genome" is not reliable, some are called "whole genome" and (from memory) there are a few other variants (?whole segment). Also from memory, I think there is an assembly_summary.txt file somewhere which gives the status of all the virus assemblies (for sure there is one for bacteria).

ababaian commented 4 years ago

CoV

JB181528.1:3692-4258 is from a patent vaccine for Spike protein with a perfect BLAST match to a cloning "Vector pMRS101, complete sequence"
DL231478.1:8,567-8,706 is at the end of the sequence, also likely to be a cloning artifact.

Mega

NC_001461:4,993-5,267 in BVDV is a near perfect match to heat shock protein
NC_003391:9,229-9,922 in cowpox matches TMBIB4
NC_001405:19,281-19,320 in adenovirus 4 has cryptic homology to cloning vector

ababaian commented 4 years ago

Cryptic Homologies

MK204388.1:1-64, specifically 35-55 is giving a seed match to Canis lupus familiaris breed Labrador retriever chromosome 08a or LncRNA 2063.5 in human. This came up in library rabbit ear: SRR5786563. Also.
JB181528.1:3,675-4,258 shows match to a cloning vector pMD19.
CS480537.1:30,130-30,332 and CS480537.1:37,234-37,315 E.coli
Start and DL231478.1:1-42 DL231478.1:8,568-8,687 end of the fragment are cloning vectors
LR721664.1:31,112-31,189 is a close match to a canine X2 lncRNA.
KF530130.1:4,852-4,896 in Cat CoV gives thousands of FP hits to ferret scRNAseq (SRR6662917).
MH002340.1:30,202-30,242 shows multi-library cryptic homology to DDRGK gene in pig as in SRR4344100
LQ338105.1:14-59 has a match in X2/antibody gene in CHO SRR9621030
KY967725.1:17,220-17,250 matches NOV gene in pig. SRR8573571
DI086074.1:1-168 match to E.coli.

Crazy AT Conent

FV537211.1 and FV537213.1 have a very high AT content and are picking up all polyT signals.
MG600026.1:3,926-3,981 is already partiall masked for TTTT, extend to this region

ababaian commented 4 years ago

These are now included in cov3m blacklisting.

ababaian / serratus