Closed rcedgar closed 3 years ago
I see that the assembly file is very large. In fact, if you look at the deflines:
root@5c364609a0eb:/output# ls -lh SRR9156994.coronaspades.gene_clusters.checkv_filtered.fa
-rw-rw-r-- 1 1000 1000 59K Jul 14 23:23 SRR9156994.coronaspades.gene_clusters.checkv_filtered.fa
root@5c364609a0eb:/output# egrep "^>" SRR9156994.fna
>SRR9156994.coronaspades.NODE_1_length_30093_cluster_1_candidate_1_domains_34
>SRR9156994.coronaspades.NODE_7_length_30093_cluster_7_candidate_1_domains_1
Looks like some kind of duplicate? Seems unlikely that two unique CoVs would have the exact same length. Perhaps this is an interesting case for Anton and team? @rchikhi, can you make sure this isn't a CheckV filtering error?
This particular case looks a bit strange, we will investigate it. However, note that several multivirome datasets do have several species inside. SRR2010686 is an example.
to be clear: this is not a checkv filtering error
VADR is unable to handle weird cases like this, when the "genome" really contains two viral genomes. This is something that needs to be fixed upstream of DARTH, so that it only receives one genome at a time. Reassigning to @rchikhi.
I'd recommend to drop those multi-covgenomes samples for now. And look at them in a later pass (after initial submission). Any thoughts ? @taltman @asl @ababaian @rcedgar
If we don't drop them, we'll need to have a discussion on how to separate genomes. When a ~29kbp contig is present in a sample (among other contigs), it's clear this one can be annotated on its own. But many samples are a collection of smaller e.g. 1-2 kbp contigs. I don't think it makes sense to run VADR on each of these contigs separately. Idealy we'd need some sort of 'viral binning'.
To give some perspective: among the 10,816 datasets of the master table, 272 (2.4%) of them have CoV contigs of total size longer than 50 kbp (arbitrary threshold at which there's likely >=2 genomes). Yet among those 272, 208 (76%) accessions have >= 2 contigs longer than 20kbp. So if we decided to try to separate genome, maybe taking the contigs longer than 20kbp is a viable strategy.
If there are 272, then IMO we do need to split them into two assemblies because there are several downstream analyses that assume there is only one virus per SRA. Serratax is one of them, and if I understood correctly then darth is another. I will need PFAM alignments separately if there are two good viruses in one SRA, at a minimum the RdRps if there are two. Regardless of what else we do, I think it would be a good idea to check if there are two good RdRp alignments to ensure we don't lose good novel Covs. Maybe the CS output can tell us if there are two RdRps.
@taltman I think this problem is fixed now, can I close the issue?
The immediate issue is fixed with #216. The side-bar about multi-CoV assemblies has been moved to #212. Closing.
Rayan posted this example output from darth:
s3://serratus-public/assemblies/other/SRR9156994.coronaspades/
This seems to be lacking fasta alignments from pfam, while other sets included them. The Stockholm file does have alignments.