Open rcedgar opened 4 years ago
We do have information about which contigs do contain RdRp in bgc_statistics.txt.
For the record, I vote on dropping those accessions from the initial preprint. Viral contig binning sounds like a can a worms
Disagree. If we have two global RdRp HMM alignments that are <97% identical to each other, we have two species. This seems pretty solid to me. This scenario is not unlikely in something a virome or environmental bat poop sample.
I'd say for immediacy we go with one, then we swing back and pick up the stragglers in a second pass. I would wager 99% of the time this is going to be the same CoV just a strain variant. Leave one of these issues open and we will return to it in a second pass.
I would agree, except this is a closely related issue to Frank's missing RdRp. Why not kill two pols with one stone and capture all the RdRps in the Cov contigs.
I mean we cluster by OTU later so we can get rid of duplicates that way so it could work yes. We'll just need to decide if we report how many unique SRA we have or how many unique contigs for count data.
So, there are few issues here. Actually the Frank problem is due to the subtle problem in the way how assemblies were run, so scaffolder got turned off. We might want to re-run of of the assemblies including Frank to obtain a single scaffold in the results. I believe I saw at least one other dataset that would benefit from it.
Certainly, using all RdRps is another story and is useful for tree building, etc.
I wanted to share a specific example of an assembly with multiple contigs, seemingly coming from different genomes.
Here is the FTR table output of VADR for SRR8389793
:
(base) ubuntu@ip-172-31-65-128:~/repos/darth/test/assemblies/SRR8389793/SRR8389793$ awk -v OFS="\t" '{ print $1, $2, $3, $5, $6, $7 }' SRR8389793.vadr.ftr
# seq seq ftr ftr ftr
#idx name len model type name
#--- --------------------------------------------------- ---- --------- ----------- -----------------------------------------------
1.1 NODE_9_length_984_cluster_9_candidate_1_domains_2 984 NC_035191 gene gene.1
1.2 NODE_9_length_984_cluster_9_candidate_1_domains_2 984 NC_035191 CDS ORF1ab_polyprotein
#
2.1 NODE_18_length_279_cluster_18_candidate_1_domains_1 279 NC_009019 gene orf1ab
2.2 NODE_18_length_279_cluster_18_candidate_1_domains_1 279 NC_009019 CDS orf1ab_polyprotein
#
3.1 NODE_19_length_275_cluster_19_candidate_1_domains_1 275 NC_016994 gene orf1ab
3.2 NODE_19_length_275_cluster_19_candidate_1_domains_1 275 NC_016994 CDS replicase_polyprotein
#
4.1 NODE_6_length_1483_cluster_6_candidate_1_domains_2 1483 NC_016996 gene orf1ab
4.2 NODE_6_length_1483_cluster_6_candidate_1_domains_2 1483 NC_016996 CDS replicase_polyprotein
#
5.1 NODE_8_length_1253_cluster_8_candidate_1_domains_2 1253 NC_045512 gene ORF1ab
5.2 NODE_8_length_1253_cluster_8_candidate_1_domains_2 1253 NC_045512 CDS ORF1ab_polyprotein
5.3 NODE_8_length_1253_cluster_8_candidate_1_domains_2 1253 NC_045512 mat_peptide endoRNAse
5.4 NODE_8_length_1253_cluster_8_candidate_1_domains_2 1253 NC_045512 mat_peptide 2'-O-ribose_methyltransferase
#
7.1 NODE_16_length_275_cluster_16_candidate_1_domains_1 275 NC_003045 gene orf1ab
7.2 NODE_16_length_275_cluster_16_candidate_1_domains_1 275 NC_003045 CDS orf1ab_polyprotein
7.3 NODE_16_length_275_cluster_16_candidate_1_domains_1 275 NC_003045 CDS orf1a_polyprotein
7.4 NODE_16_length_275_cluster_16_candidate_1_domains_1 275 NC_003045 mat_peptide coronavirus_nsp3_(HD2):GBSEP:hydrophobic_domain
If you look at the 4th column, it is the "model" column. This is the NCBI Nucleotide accession for the closest-matching RefSeq model (a VADR model is an organized set of CMs). You'll notice that there are six contigs with ORF 1ab hits, all of which have different models as the closest-matching.
We appear to have an unsolved problem with assemblies that have multiple Covs.
From @rchikhi in an earlier issue: "Among the 10,816 datasets of the master table, 272 (2.4%) of them have CoV contigs of total size longer than 50 kbp (arbitrary threshold at which there's likely >=2 genomes). Yet among those 272, 208 (76%) accessions have >= 2 contigs longer than 20kbp. So if we decided to try to separate genome, maybe taking the contigs longer than 20kbp is a viable strategy."
If there are 272, then IMO we do need to split them into two assemblies because there are several downstream analyses that assume there is only one virus per SRA. Serratax is one of them, and if I understood correctly then darth is another. I will need PFAM alignments separately if there are two good viruses in one SRA, at a minimum the RdRps if there are two. Regardless of what else we do, I think it would be a good idea to check if there are two good RdRp alignments to ensure we don't lose good novel Covs. Maybe the CS output can tell us if there are two RdRps.
@ababaian suggest you offer guidance here.