ababaian / serratus

Ultra-deep search for novel viruses
http://serratus.io
GNU General Public License v3.0
250 stars 32 forks source link

Missing RdRP domain in assembly annotations #213

Closed rcedgar closed 3 years ago

rcedgar commented 3 years ago

Frank is missing his RdRP:

aws s3 cp s3://serratus-public/assemblies/annotations/ERR2756788.fa.darth.stripped.tar.gz .
tar -zxvf ERR2756788.fa.darth.stripped.tar.gz ERR2756788.darth/pfam/alignments.fasta
grep "^>" ERR2756788.darth/pfam/alignments.fasta
CoV_NSP2_C CoV_NSP2_N
CoV_NSP3_C
CoV_NSP4_N
CoV_peptidase Macro

This problem appears to be quite rare, I spot-checked several other diverged assemblies and they all had RdRP.

asl commented 3 years ago

Here is HMM annotation from bgc_statistics.txt:

CoV_NSP4_C-Peptidase_C30-CoV_M-CoV_NSP7-CoV_NSP8-CoV_NSP9-CoV_NSP10-CoV_RPol_N-RdRP_1-CoV_E-CoV_E-Viral_helicase1-CoV_NSP6-CoV_NSP6-CoV_M-CoV_NSP15_N-CoV_NSP15_M-CoV_NSP15_C-CoV_Methyltr_2-CoV_S1_C-CoV_S2-CoV_E-CoV_NSP6-CoV_E-CoV_M-CoV_nucleocap-CoV_NSP6-CoV_NSP6

There is RdRP_1 there.

Frank contains 2 contigs and pfam/alignments seems to contains matches from the second one, not for the first one. @rchikhi What happens with the alignments if there are multiple contigs?

rcedgar commented 3 years ago

Before running the PFAM HMMs, ORFs should be extracted from ALL contigs, just just the designated "best" (longest?) contig because there may be taxonomic genes in several contigs. For running PFAM, a generic ORF-finder is fine, no need to worry about frameshifts -- this should be simple & fast AFAICS.

EDIT -- I mean all contigs verified as Cov by CS HMMs/checkv, not all contigs found for everything in the dataset.

rchikhi commented 3 years ago

@rchikhi What happens with the alignments if there are multiple contigs?

I didn't have the time yet to analyze the following results, but I re-ran Darth on just the 20kbp contig of Frank, and on just the 8kbp contig of Frank:

https://serratus-public.s3.amazonaws.com/assemblies/other/ERR2756788.coronaspades/ERR2756788.20kbp_contig.darth.tar.gz https://serratus-public.s3.amazonaws.com/assemblies/other/ERR2756788.coronaspades/ERR2756788.8kbp_contig.darth.tar.gz

And here is Darth when given as input both contigs. (alignments.fasta)

rchikhi commented 3 years ago

OK I see here. It's clear that Darth only created alignments.fasta from the 8kbp contig only, as @asl suspected. @taltman, could darth be modified to create an alignments.fasta file for each contig instead?

taltman commented 3 years ago

@asl Feel free to direct annotation questions to me going forward. Thanks!

@rcedgar Interestingly, I get a different result with DARTH when annotating the Fr4NK.fa sequence that @ababaian posted on the wiki:

>RdRP_1
---LSTGMTKQTVKPGHFNQEFYEFLKGKgfFNEGSSLTLKHFFFAQKDDAAIKDFDFYRYNRTTMLDICQARVAYKVvTHYFDCYEGGCISAKDVVVTNLNKSAGYPLNKLG--KAGLYYESLSYDEQDHLYALTKRN--ILPTMTQLNLKYAIS-----------GKERARTVGGVSLLSTMTTRQFHQKHLKSIVNTRNATV-VIGTTKFYGGWDNMLNTLISGveNPCLMGWDYPKCDRALPSMIRMISAMILGSKHVTcCTASDKY\
YRLCNELaqVLTEVVYSNGGFYFKPGGTTSGDATTAYANSVFNIfqavsaninrlltvdsyaihnESVKSLQRQLYDNCYRATSVDATFVSDYYQFlrkhfsMMiLSDDGVVCYNKDYADMgyvadIGAFKAALYYQnNVFMSTAKCWVETDLSKGPHEFCSQHTLQIVDQdGKYYLPYPDPSRIISAGVFVDDVAKTDsvvlLERYVSLAIDAYPLSKHPDPEYQKVFYTMLEWV-

So I expect that the difference here has to do with the assembly. I'll re-run DARTH on the latest assembly, and see if I can reproduce the issue you're seeing.

Let me explain some more background on what DARTH is doing.

Eric created coronavirus models for use with VADR based on RefSeq annotations of exemplar coronavirus species. Some of them for whatever reason didn't have models for the maturation proteins on the polyproteins. The other non-polyprotein CDS regions seem to be fine. So to "patch" this issue expediently I am doing an additional annotation of the ORF 1ab polyprotein using Pfam, which is turned into the "alignments.fasta" file.

rcedgar commented 3 years ago

@taltman Suggest it would be simpler and more robst to use a generic ORF-finder like EMBOSS getorf as a query for the PFAM HMMs, makes no assumptions about the genome. I'm not worried about frame-shifts, they don't occur inside the "domains" by PFAM's definition.

asl commented 3 years ago

@asl Feel free to direct annotation questions to me going forward. Thanks!

Yes sure. I was just not expected to see Frank assembled into 2 contigs. But this is a separate story. We already figured out with @rchikhi the reason.

taltman commented 3 years ago

@rcedgar Let me get to the bottom of this, and then we can figure out what to do. I'd really rather not reinvent VADR if I can avoid it. For consistent annotations it is important for the scans to happen on a per-CDS basis, otherwise there's no clear algorithm for reconciling VADR and gene calls from another tool. I know it would solve the problem that you have, but let's see if we can solve both problems (taxonomic domain alignments and consistent annotations) with one fix.

taltman commented 3 years ago

In coming up with a fix for annotating fragmented genomes (#216), I implemented a fix for this as well.

Now, there is a transeq sub-directory in the annotation output, with an alignments.fasta file that has the aligned hits to the Pfam models, which are now ran against the 6-frame translation of the assembly.

Once @rchikhi has a chance to incorporate into his Batch script and reprocess the assemblies, @rcedgar please confirm the fix and close this issue. Thanks!

rchikhi commented 3 years ago

The PFAM alignments are ready in s3://serratus-public/assemblies/annotations/. They are the files ending with darth.alignments.fasta (note: important to use *.darth.alignments.fasta and not just *.alignments.fasta, see the table below why). There are 4,833 darth.alignments.fasta files (@rcedgar is this more accessions than what you had before?).

S3 folder structure:

File name Explanation
[accession].fa.darth.alignments.fasta ->> USE THIS FILE <<- It is a copy of transeq.alignments.fasta , when it exists, or a copy of pfam.alignments.fasta otherwise.
[accession].fa.darth.input_md5 MD5 sum of the input contigs file given as input to Darth
[accession].fa.darth.stripped.tar.gz Darth output, without read alignments
[accession].fa.darth.tar.gz Darth output, full
[accession].fa.darth.pfam.alignments.fasta (only for debugging purposes:) legacy alignments.fasta file present in the pfam/ folder of Darth output. Present in either single-contig or multi-contig assemblies. Can be used in single-contig assemblies, but not multi-contig assemblies. Prefer transeq.alignments.fasta when it is present.
[accession].fa.darth.transeq.alignments.fasta (only for debugging purposes:) newer alignments.fasta file present in the transeq/ folder of Darth output. Made for multi-contig assemblies. Should always be used when present.
[accession].fa.serraplace.tar.gz Serraplace output, full
[accession].fa.serratax.final Serratax's tax.final file
[accession].fa.serratax.tar.gz Serratax output, full
rcedgar commented 3 years ago

@rchikhi Before there were 10,887 [accession].fa.darth.stripped.tar.gz files of which 4,413 were not empty. Assuming your 4,833 are not empty (I haven't checked), then I have more now.

taltman commented 3 years ago

I believe this is been addressed. Please feel free to reopen if you see more issues in a specific SRA.

taltman commented 3 years ago

Reopening to confirm fix.

taltman commented 3 years ago

Confirming that a single RdRP_1 is found for Frankie, and at the expected length:

(base) ubuntu@ip-172-31-65-128:~/repos/darth/test/assemblies/ERR2756788/transeq$ grep -A 1 RdRP_1 alignments.fasta 
>RdRP_1 ERR2756788.coronaspades.NODE_1_length_29164_cluster_1_candidate_1_domains_36_4/4498-4985
---LSTGMTKQTVKPGHFNQEFYEFLKGKgfFNEGSSLTLKHFFFAQKDDAAIKDFDFYRYNRTTMLDICQARVAYKVvTHYFDCYEGGCISAKDVVVTNLNKSAGYPLNKLG--KAGLYYESLSYDEQDHLYALTKRN--ILPTMTQLNLKYAIS-----------GKERARTVGGVSLLSTMTTRQFH\
QKHLKSIVNTRNATV-VIGTTKFYGGWDNMLNTLISGveNPCLMGWDYPKCDRALPSMIRMISAMILGSKHVTcCTASDKYYRLCNELaqVLTEVVYSNGGFYFKPGGTTSGDATTAYANSVFNIfqavsaninrlltvdsyaihnESVKSLQRQLYDNCYRATSVDATFVSDYYQFlrkhfsMMiLSDD\
GVVCYNKDYADMgyvadIGAFKAALYYQnNVFMSTAKCWVETDLSKGPHEFCSQHTLQIVDQdGKYYLPYPDPSRIISAGVFVDDVAKTDsvvlLERYVSLAIDAYPLSKHPDPEYQKVFYTMLEWV-
(base) ubuntu@ip-172-31-65-128:~/repos/darth/test/assemblies/ERR2756788/transeq$ grep -A 1 RdRP_1 alignments.fasta | egrep -v "^>" | tr -d '[a-z]' | awk '{ print length }'
461