Missing fasta alignments in darth output

ababaian / serratus

Ultra-deep search for novel viruses

http://serratus.io

GNU General Public License v3.0

250 stars 32 forks source link

Missing fasta alignments in darth output #201

Closed rcedgar closed 3 years ago

rcedgar commented 3 years ago

Rayan posted this example output from darth:

s3://serratus-public/assemblies/other/SRR9156994.coronaspades/

This seems to be lacking fasta alignments from pfam, while other sets included them. The Stockholm file does have alignments.

taltman commented 3 years ago

I see that the assembly file is very large. In fact, if you look at the deflines:

root@5c364609a0eb:/output# ls -lh SRR9156994.coronaspades.gene_clusters.checkv_filtered.fa
-rw-rw-r-- 1 1000 1000 59K Jul 14 23:23 SRR9156994.coronaspades.gene_clusters.checkv_filtered.fa
root@5c364609a0eb:/output# egrep "^>" SRR9156994.fna
>SRR9156994.coronaspades.NODE_1_length_30093_cluster_1_candidate_1_domains_34
>SRR9156994.coronaspades.NODE_7_length_30093_cluster_7_candidate_1_domains_1

Looks like some kind of duplicate? Seems unlikely that two unique CoVs would have the exact same length. Perhaps this is an interesting case for Anton and team? @rchikhi, can you make sure this isn't a CheckV filtering error?

asl commented 3 years ago

This particular case looks a bit strange, we will investigate it. However, note that several multivirome datasets do have several species inside. SRR2010686 is an example.

rchikhi commented 3 years ago

to be clear: this is not a checkv filtering error

taltman commented 3 years ago

VADR is unable to handle weird cases like this, when the "genome" really contains two viral genomes. This is something that needs to be fixed upstream of DARTH, so that it only receives one genome at a time. Reassigning to @rchikhi.

rchikhi commented 3 years ago

I'd recommend to drop those multi-covgenomes samples for now. And look at them in a later pass (after initial submission). Any thoughts ? @taltman @asl @ababaian @rcedgar

If we don't drop them, we'll need to have a discussion on how to separate genomes. When a ~29kbp contig is present in a sample (among other contigs), it's clear this one can be annotated on its own. But many samples are a collection of smaller e.g. 1-2 kbp contigs. I don't think it makes sense to run VADR on each of these contigs separately. Idealy we'd need some sort of 'viral binning'.

rchikhi commented 3 years ago

To give some perspective: among the 10,816 datasets of the master table, 272 (2.4%) of them have CoV contigs of total size longer than 50 kbp (arbitrary threshold at which there's likely >=2 genomes). Yet among those 272, 208 (76%) accessions have >= 2 contigs longer than 20kbp. So if we decided to try to separate genome, maybe taking the contigs longer than 20kbp is a viable strategy.

rcedgar commented 3 years ago

If there are 272, then IMO we do need to split them into two assemblies because there are several downstream analyses that assume there is only one virus per SRA. Serratax is one of them, and if I understood correctly then darth is another. I will need PFAM alignments separately if there are two good viruses in one SRA, at a minimum the RdRps if there are two. Regardless of what else we do, I think it would be a good idea to check if there are two good RdRp alignments to ensure we don't lose good novel Covs. Maybe the CS output can tell us if there are two RdRps.

rcedgar commented 3 years ago

@taltman I think this problem is fixed now, can I close the issue?

taltman commented 3 years ago

The immediate issue is fixed with #216. The side-bar about multi-CoV assemblies has been moved to #212. Closing.