ababaian / serratus

Ultra-deep search for novel viruses
http://serratus.io
GNU General Public License v3.0
250 stars 32 forks source link

Assemblies with >1 Cov #212

Open rcedgar opened 3 years ago

rcedgar commented 3 years ago

We appear to have an unsolved problem with assemblies that have multiple Covs.

From @rchikhi in an earlier issue: "Among the 10,816 datasets of the master table, 272 (2.4%) of them have CoV contigs of total size longer than 50 kbp (arbitrary threshold at which there's likely >=2 genomes). Yet among those 272, 208 (76%) accessions have >= 2 contigs longer than 20kbp. So if we decided to try to separate genome, maybe taking the contigs longer than 20kbp is a viable strategy."

If there are 272, then IMO we do need to split them into two assemblies because there are several downstream analyses that assume there is only one virus per SRA. Serratax is one of them, and if I understood correctly then darth is another. I will need PFAM alignments separately if there are two good viruses in one SRA, at a minimum the RdRps if there are two. Regardless of what else we do, I think it would be a good idea to check if there are two good RdRp alignments to ensure we don't lose good novel Covs. Maybe the CS output can tell us if there are two RdRps.

@ababaian suggest you offer guidance here.

asl commented 3 years ago

We do have information about which contigs do contain RdRp in bgc_statistics.txt.

rchikhi commented 3 years ago

For the record, I vote on dropping those accessions from the initial preprint. Viral contig binning sounds like a can a worms

rcedgar commented 3 years ago

Disagree. If we have two global RdRp HMM alignments that are <97% identical to each other, we have two species. This seems pretty solid to me. This scenario is not unlikely in something a virome or environmental bat poop sample.

ababaian commented 3 years ago

I'd say for immediacy we go with one, then we swing back and pick up the stragglers in a second pass. I would wager 99% of the time this is going to be the same CoV just a strain variant. Leave one of these issues open and we will return to it in a second pass.

rcedgar commented 3 years ago

I would agree, except this is a closely related issue to Frank's missing RdRp. Why not kill two pols with one stone and capture all the RdRps in the Cov contigs.

ababaian commented 3 years ago

I mean we cluster by OTU later so we can get rid of duplicates that way so it could work yes. We'll just need to decide if we report how many unique SRA we have or how many unique contigs for count data.

asl commented 3 years ago

So, there are few issues here. Actually the Frank problem is due to the subtle problem in the way how assemblies were run, so scaffolder got turned off. We might want to re-run of of the assemblies including Frank to obtain a single scaffold in the results. I believe I saw at least one other dataset that would benefit from it.

Certainly, using all RdRps is another story and is useful for tree building, etc.

taltman commented 3 years ago

I wanted to share a specific example of an assembly with multiple contigs, seemingly coming from different genomes.

Here is the FTR table output of VADR for SRR8389793:

(base) ubuntu@ip-172-31-65-128:~/repos/darth/test/assemblies/SRR8389793/SRR8389793$ awk -v OFS="\t" '{ print $1, $2, $3, $5, $6, $7 }' SRR8389793.vadr.ftr 
#       seq     seq     ftr     ftr     ftr
#idx    name    len     model   type    name
#---    ---------------------------------------------------     ----    ---------       -----------     -----------------------------------------------
1.1     NODE_9_length_984_cluster_9_candidate_1_domains_2       984     NC_035191       gene    gene.1
1.2     NODE_9_length_984_cluster_9_candidate_1_domains_2       984     NC_035191       CDS     ORF1ab_polyprotein
#
2.1     NODE_18_length_279_cluster_18_candidate_1_domains_1     279     NC_009019       gene    orf1ab
2.2     NODE_18_length_279_cluster_18_candidate_1_domains_1     279     NC_009019       CDS     orf1ab_polyprotein
#
3.1     NODE_19_length_275_cluster_19_candidate_1_domains_1     275     NC_016994       gene    orf1ab
3.2     NODE_19_length_275_cluster_19_candidate_1_domains_1     275     NC_016994       CDS     replicase_polyprotein
#
4.1     NODE_6_length_1483_cluster_6_candidate_1_domains_2      1483    NC_016996       gene    orf1ab
4.2     NODE_6_length_1483_cluster_6_candidate_1_domains_2      1483    NC_016996       CDS     replicase_polyprotein
#
5.1     NODE_8_length_1253_cluster_8_candidate_1_domains_2      1253    NC_045512       gene    ORF1ab
5.2     NODE_8_length_1253_cluster_8_candidate_1_domains_2      1253    NC_045512       CDS     ORF1ab_polyprotein
5.3     NODE_8_length_1253_cluster_8_candidate_1_domains_2      1253    NC_045512       mat_peptide     endoRNAse
5.4     NODE_8_length_1253_cluster_8_candidate_1_domains_2      1253    NC_045512       mat_peptide     2'-O-ribose_methyltransferase
#
7.1     NODE_16_length_275_cluster_16_candidate_1_domains_1     275     NC_003045       gene    orf1ab
7.2     NODE_16_length_275_cluster_16_candidate_1_domains_1     275     NC_003045       CDS     orf1ab_polyprotein
7.3     NODE_16_length_275_cluster_16_candidate_1_domains_1     275     NC_003045       CDS     orf1a_polyprotein
7.4     NODE_16_length_275_cluster_16_candidate_1_domains_1     275     NC_003045       mat_peptide     coronavirus_nsp3_(HD2):GBSEP:hydrophobic_domain

If you look at the 4th column, it is the "model" column. This is the NCBI Nucleotide accession for the closest-matching RefSeq model (a VADR model is an organized set of CMs). You'll notice that there are six contigs with ORF 1ab hits, all of which have different models as the closest-matching.