ababaian / serratus

Ultra-deep search for novel viruses
http://serratus.io
GNU General Public License v3.0
250 stars 32 forks source link

Capture multiple RdRp's from one assembly #223

Closed rcedgar closed 3 years ago

rcedgar commented 3 years ago

Spot-check suggests that at most one RdRp per assembly is captured in the most recent PFAM alignments. If there is >1, we should capture all of them. Punt for now, revisit for next round of analysis after PSI-Serratus. See discussion in #212.

rcedgar commented 3 years ago

Some of these viromes have hundreds of RdRps. If there are multiple species here, which seems likely, then IMO we should tackle these sooner rather than later.

SRR11097768     4323
SRR10854863     1005
SRR10854864     690
ERR3661016      556
SRR11647857     552
SRR11648360     550
taltman commented 3 years ago

The latest version of Darth captures multiple domain hits for a given Pfam model across multiple contigs. Closing for now. Please feel free to reopen if you catch this not working in a specific SRA.

taltman commented 3 years ago

Reopening to confirm fix.

taltman commented 3 years ago

Adding the following accessions as cases to double-check, as reported on Slack by @rcedgar : ERR963087 SRR10252889 ERR3460960 SRR6713876

taltman commented 3 years ago

Testing fix, one comment per accession tested:

taltman commented 3 years ago

ERR963087:

Two RdRP_1 entries found:

(base) ubuntu@ip-172-31-65-128:~/repos/darth/test/assemblies/ERR963087/transeq$ grep -A 1 RdRP_1 alignments.fasta
>RdRP_1 ERR963087.coronaspades.NODE_7_length_13230_cluster_7_candidate_1_domains_1_3/1884-2373
-AALTTGLTFQTVRPGNFNQDFYDFVVSKgfFKEGSSVTLKHFFFAQDGNAAITDYNYYSYNLPTMCDIKQMLFCMEVvNKYFEIYDGGCLNASEVVVNNLDKSAGHPFNKFG--KARVYYESMSYQEQDELFAMTKRN--VIPTMTQMNLKYAIS-----------AKNRARTVAGVSILSTMTNRQYH\
QKMLKSMAATRGATC-VIGTTKFYGGWDFMLKTLYKDvdNPHLMGWDYPKCDRAMPNMCRIFASLILARKHGTcCTTRDRFYRLANECaqVLSEYVLCGGGYYVKPGGTSSGDATTAYANSVFNIlqattanvsalmgangnK--------IvdkevkdmqfdlYVNVYRSTSPDPKFVDKYYAFlnkhf\
sMMiLSDDGVVCYNSDYAAKgyiagIQNFKETLYYQnNVFMSEAKCWVETDLKKGPHEFCSQHTLYIKDGdDGYFLPYPDPSRILSAGCFVDDIVKTDgtlmVERFVSLAIDAYPLTKHEDIEYQNVFWVYLQYI-
--
>RdRP_1 ERR963087.coronaspades.NODE_1_length_13230_cluster_1_candidate_1_domains_18_3/1884-2373
-AALTTGLTFQTVRPGNFNQDFYDFVVSKgfFKEGSSVTLKHFFFAQDGNAAITDYNYYSYNLPTMCDIKQMLFCMEVvNKYFEIYDGGCLNASEVVVNNLDKSAGHPFNKFG--KARVYYESMSYQEQDELFAMTKRN--VIPTMTQMNLKYAIS-----------AKNRARTVAGVSILSTMTNRQYH\
QKMLKSMAATRGATC-VIGTTKFYGGWDFMLKTLYKDvdNPHLMGWDYPKCDRAMPNMCRIFASLILARKHGTcCTTRDRFYRLANECaqVLSEYVLCGGGYYVKPGGTSSGDATTAYANSVFNIlqattanvsalmgangnK--------IvdkevkdmqfdlYVNVYRSTSPDPKFVDKYYAFlnkhf\
sMMiLSDDGVVCYNSDYAAKgyiagIQNFKETLYYQnNVFMSEAKCWVETDLKKGPHEFCSQHTLYIKDGdDGYFLPYPDPSRILSAGCFVDDIVKTDgtlmVERFVSLAIDAYPLTKHEDIEYQNVFWVYLQYI-

... and they have the expected length:

(base) ubuntu@ip-172-31-65-128:~/repos/darth/test/assemblies/ERR963087/transeq$ grep -A 1 RdRP_1 alignments.fasta | egrep -v "^>" | tr -d '[a-z]' | awk '{ print length }'
461
2
461
taltman commented 3 years ago

SRR10252889 looks good now:

(base) ubuntu@ip-172-31-65-128:~/repos/darth/test/assemblies/SRR10252889/transeq$ grep -A 1 RdRP_1 alignments.fasta 
>RdRP_1 SRR10252889.coronaspades.NODE_1_length_29957_cluster_1_candidate_1_domains_35_1/4890-5379
-AALTTGLTFQTVRPGNFNQDFYDFVVSKgfFKEGSSVTLKHFFFAQDGNAAITDYNYYSYNLPTMCDIKQMLFCMEVvNKYFEIYDGGCLNASEVVVNNLDKSAGHPFNKFG--KARVYYESMSYQEQDELFAMTKRN--VIPTMTQMNLKYAIS-----------AKNRARTVAGVSILSTMTNRQYH\
QKMLKSMAATRGATC-VIGTTKFYGGWDFMLKTLYKDvdNPHLMGWDYPKCDRAMPNMCRIFASLILARKHGTcCTTRDRFYRLANECaqVLSEYVLCGGGYYVKPGGTSSGDATTAYANSVFNIlqattanvsalmgangnK--------IvdkevkdmqfdlYVNVYRSTSPDPKFVDKYYAFlnkhf\
sMMiLSDDGVVCYNSDYAAKgyiagIQNFKETLYYQnNVFMSEAKCWVETDLKKGPHEFCSQHTLYIKDGdDGYFLPYPDPSRILSAGCFVDDIVKTDgtlmVERFVSLAIDAYPLTKHEDIEYQNVFWVYLQYI-
--
>RdRP_1 SRR10252889.coronaspades.NODE_6_length_25745_cluster_6_candidate_1_domains_1_1/4524-5013
-AALTTGLTFQTVRPGNFNQDFYDFVVSKgfFKEGSSVTLKHFFFAQDGNAAITDYNYYSYNLPTMCDIKQMLFCMEVvNKYFEIYDGGCLNASEVVVNNLDKSAGHPFNKFG--KARVYYESMSYQEQDELFAMTKRN--VIPTMTQMNLKYAIS-----------AKNRARTVAGVSILSTMTNRQYH\
QKMLKSMAATRGATC-VIGTTKFYGGWDFMLKTLYKDvdNPHLMGWDYPKCDRAMPNMCRIFASLILARKHGTcCTTRDRFYRLANECaqVLSEYVLCGGGYYVKPGGTSSGDATTAYANSVFNIlqattanvsalmgangnK--------IvdkevkdmqfdlYVNVYRSTSPDPKFVDKYYAFlnkhf\
sMMiLSDDGVVCYNSDYAAKgyiagIQNFKETLYYQnNVFMSEAKCWVETDLKKGPHEFCSQHTLYIKDGdDGYFLPYPDPSRILSAGCFVDDIVKTDgtlmVERFVSLAIDAYPLTKHEDIEYQNVFWVYLQYI-
(base) ubuntu@ip-172-31-65-128:~/repos/darth/test/assemblies/SRR10252889/transeq$ grep -A 1 RdRP_1 alignments.fasta | egrep -v "^>" | tr -d '[a-z]' | awk '{ print length }'
461
2
461
taltman commented 3 years ago

ERR3460960 looks good too:

(base) ubuntu@ip-172-31-65-128:~/repos/darth/test/assemblies/ERR3460960/transeq$ grep -A 1 RdRP_1 alignments.fasta 
>RdRP_1 ERR3460960.coronaspades.NODE_1_length_28843_cluster_1_candidate_1_domains_34_6/4546-5031
-----TGLTSQTVKPGHFNKEFYDFLRSQgfFDEGSELTLKHFFFTQKGDAAIKDFDYYRYNRPTMLDIGQARVAYQVaARYFDCYEGGCITSREVVVTNLNKSAGWPLNKFG--KAGLYYESISYEEQDAIFSLTKRN--ILPTMTQLNLKYAIS-----------GKERARTVGGVSLLATMTTRQFH\
QKCLKSIVATRNATV-VIGTTKFYGGWDNMLKNLMADvdDPKLMGWDYPKCDRAMPSMIRMLSAMILGSKHVTcCTASDKFYRLSNELaqVLTEVVYSNGGFYFKPGGTTSGDATTAYANSVFNIfqavssnincvlsvnssncnnFNVKKLQRQLYDNCYRNSNVDESFVDDFYGYlqkhfsmMILSDD\
SVVCYNKTYAGLgyiadISAFKATLYYQnGVFMSTAKCWTEEDLSIGPHEFCSQHTMQIVDEnGKYYLPYPDPSRIISAGVFVDDITKTDavilLERYVSLAIDAYPLSKHPKPEYRKVFYALLDWV-
--
>RdRP_1 ERR3460960.coronaspades.NODE_3_length_28843_cluster_3_candidate_1_domains_1_3/4546-5031
-----TGLTSQTVKPGHFNKEFYDFLRSQgfFDEGSELTLKHFFFTQKGDAAIKDFDYYRYNRPTMLDIGQARVAYQVaARYFDCYEGGCITSREVVVTNLNKSAGWPLNKFG--KAGLYYESISYEEQDAIFSLTKRN--ILPTMTQLNLKYAIS-----------GKERARTVGGVSLLATMTTRQFH\
QKCLKSIVATRNATV-VIGTTKFYGGWDNMLKNLMADvdDPKLMGWDYPKCDRAMPSMIRMLSAMILGSKHVTcCTASDKFYRLSNELaqVLTEVVYSNGGFYFKPGGTTSGDATTAYANSVFNIfqavssnincvlsvnssncnnFNVKKLQRQLYDNCYRNSNVDESFVDDFYGYlqkhfsmMILSDD\
SVVCYNKTYAGLgyiadISAFKATLYYQnGVFMSTAKCWTEEDLSIGPHEFCSQHTMQIVDEnGKYYLPYPDPSRIISAGVFVDDITKTDavilLERYVSLAIDAYPLSKHPKPEYRKVFYALLDWV-
(base) ubuntu@ip-172-31-65-128:~/repos/darth/test/assemblies/ERR3460960/transeq$ grep -A 1 RdRP_1 alignments.fasta | egrep -v "^>" | tr -d '[a-z]' | awk '{ print length }'
461
2
461
taltman commented 3 years ago

SRR6713876 looks good too, three RdRP_1 domains found:

(base) ubuntu@ip-172-31-65-128:~/repos/darth/test/assemblies/SRR6713876/transeq$ grep -A 1 RdRP_1 alignments.fasta 
>RdRP_1 SRR6713876.coronaspades.NODE_3_length_28164_cluster_3_candidate_1_domains_1_3/4616-5100
------GMTNQTVKPGHFNKEFYDFLLEQgfFSEGSELTLKHFFFAQKGDAAVKDFDYYRYNRPTVLDICQARVVYQIvQRYFDIYEGGCITAKEVVVTNLNKSAGYPLNKFG--KAGLYYESLSYEEQDELYAYTKRN--ILPTMTQLNLKYAIS-----------GKERARTVGGVSLLSTMTTRQYH\
QKHLKSIVNTRGASV-VIGTTKFYGGWDNMLKNLIDGveNPCLMGWDYPKCDRALPNMIRMISAMILGSKHTTcCSSTDRFFRLCNELaqVLTEVVYSNGGFYLKPGGTTSGDATTAYANSVFNIfqavsanvnkllsvdsnvchnLEVKQLQRKLYECCYRSTTVDDQFVVEYYGYlrkhfsMMiLSDD\
GVVCYNNDYASLgyvadLNAFKAVLYYQnNVFMSASKCWIEPDINKGPHEFCSQHTMQIVDKdGTYYLPYPDPSRILSAGVFVDDVVKTDavvlLERYVSLAIDAYPLSKHENPEYKKVFYVLLDWV-
--
>RdRP_1 SRR6713876.coronaspades.NODE_2_length_28164_cluster_2_candidate_1_domains_9_3/4616-5100
------GMTNQTVKPGHFNKEFYDFLLEQgfFSEGSELTLKHFFFAQKGDAAVKDFDYYRYNRPTVLDICQARVVYQIvQRYFDIYEGGCITAKEVVVTNLNKSAGYPLNKFG--KAGLYYESLSYEEQDELYAYTKRN--ILPTMTQLNLKYAIS-----------GKERARTVGGVSLLSTMTTRQYH\
QKHLKSIVNTRGASV-VIGTTKFYGGWDNMLKNLIDGveNPCLMGWDYPKCDRALPNMIRMISAMILGSKHTTcCSSTDRFFRLCNELaqVLTEVVYSNGGFYLKPGGTTSGDATTAYANSVFNIfqavsanvnkllsvdsnvchnLEVKQLQRKLYECCYRSTTVDDQFVVEYYGYlrkhfsMMiLSDD\
GVVCYNNDYASLgyvadLNAFKAVLYYQnNVFMSASKCWIEPDINKGPHEFCSQHTMQIVDKdGTYYLPYPDPSRILSAGVFVDDVVKTDavvlLERYVSLAIDAYPLSKHENPEYKKVFYVLLDWV-
--
>RdRP_1 SRR6713876.coronaspades.NODE_1_length_28164_cluster_1_candidate_1_domains_32_3/4616-5100
------GMTNQTVKPGHFNKEFYDFLLEQgfFSEGSELTLKHFFFAQKGDAAVKDFDYYRYNRPTVLDICQARVVYQIvQRYFDIYEGGCITAKEVVVTNLNKSAGYPLNKFG--KAGLYYESLSYEEQDELYAYTKRN--ILPTMTQLNLKYAIS-----------GKERARTVGGVSLLSTMTTRQYH\
QKHLKSIVNTRGASV-VIGTTKFYGGWDNMLKNLIDGveNPCLMGWDYPKCDRALPNMIRMISAMILGSKHTTcCSSTDRFFRLCNELaqVLTEVVYSNGGFYLKPGGTTSGDATTAYANSVFNIfqavsanvnkllsvdsnvchnLEVKQLQRKLYECCYRSTTVDDQFVVEYYGYlrkhfsMMiLSDD\
GVVCYNNDYASLgyvadLNAFKAVLYYQnNVFMSASKCWIEPDINKGPHEFCSQHTMQIVDKdGTYYLPYPDPSRILSAGVFVDDVVKTDavvlLERYVSLAIDAYPLSKHENPEYKKVFYVLLDWV-
(base) ubuntu@ip-172-31-65-128:~/repos/darth/test/assemblies/SRR6713876/transeq$ grep -A 1 RdRP_1 alignments.fasta | egrep -v "^>" | tr -d '[a-z]' | awk '{ print length }'
461
2
461
2
461
taltman commented 3 years ago

I spot-checked one of the accessions that @rcedgar reported had 'hundreds' of RdRPs: SRR11648360. It had no RdRPs found. What was the basis for saying that there are hundreds of RdRP domains in the assembly?

asl commented 3 years ago

bgc_statistics.txt of SRR11648360 shows # RdRP_1-domains - 1100

asl commented 3 years ago

I'm having:

Internal pipeline statistics summary:
-------------------------------------
Query model(s):                            1  (461 nodes)
Target sequences:                       3840  (3432786 residues searched)
Passed MSV filter:                       577  (0.15026); expected 76.8 (0.02)
Passed bias filter:                      567  (0.147656); expected 76.8 (0.02)
Passed Vit filter:                       550  (0.143229); expected 3.8 (0.001)
Passed Fwd filter:                       549  (0.142969); expected 0.0 (1e-05)
Initial search space (Z):               3840  [actual number of targets]
Domain search space  (domZ):             548  [number of targets reported over threshold]

foo.sto.txt foo.tblout.txt

taltman commented 3 years ago

Here is what I get from Pfam:

#                                                                            --- full sequence --- -------------- this domain -------------   hmm coord   ali coord   env coord
# target name        accession   tlen query name           accession   qlen   E-value  score  bias   #  of  c-Evalue  i-Evalue  score  bias  from    to  from    to  from    to  acc descripti\
on of target
#------------------- ---------- ----- -------------------- ---------- ----- --------- ------ ----- --- --- --------- --------- ------ ----- ----- ----- ----- ----- ----- ----- ---- ---------\
------------
NODE_69_length_335_cluster_69_candidate_1_domains_1_3 -            111 CoV_NSP7             PF08716.11    83   2.1e-14   42.8   0.2   1   1   4.2e-15   2.5e-14   42.5   0.2     3    83    16\
    96    14    96 0.97 -
#
# Program:         hmmsearch
# Version:         3.1b2 (February 2015)
# Pipeline mode:   SEARCH
# Query file:      /root/data/Pfam-A.SARS-CoV-2.hmm
# Target file:     trans-6-frame.fasta
# Option settings: hmmsearch -A match-alignments.sto --domtblout hmmsearch-matches.txt --cut_ga /root/data/Pfam-A.SARS-CoV-2.hmm trans-6-frame.fasta
# Current dir:     /output/transeq
# Date:            Tue Jul 28 06:41:27 2020
# [ok]
taltman commented 3 years ago

@asl, what do you get when you do a Pfam on this assembly? s3://serratus-rayan/master_table_assemblies/SRR11648360.fa

asl commented 3 years ago

Mine is

transeq -frame 6 SRR11648360.coronaspades.gene_clusters.fa SRR11648360.coronaspades.gene_clusters.aa
taltman commented 3 years ago

Can you confirm that SRR11648360.coronaspades.gene_clusters.fa is bit-identical with s3://serratus-rayan/master_table_assemblies/SRR11648360.fa?

taltman commented 3 years ago

I think some terminology might help clarify things. Are you testing the "raw" assembly, or the "filtered" assembly (using checkv or BGC approach)? And is the file that I am copying from S3 using the same filtering approach as the (filtered?) assembly that you are analyzing?

taltman commented 3 years ago

@asl Unless I hear otherwise from you, I believe the point that you are raising is an orthogonal concern to the multiple RdRP issue. I am going to create a separate issue for your point, so that I can commit my code fixes and ask @rchikhi to confirm my fixes before closing the issue.

taltman commented 3 years ago

Adding in analysis of the following accessions, because @rcedgar noticed non-standard alignment length: SRR5912616 SRR5912645 SRR6713708

taltman commented 3 years ago

SRR5912616: The alignments have the correct length if we also take care to remove '.' (period) entries:

(base) ubuntu@ip-172-31-65-128:~/repos/darth/test/assemblies/SRR5912616/transeq$ grep -A 1 RdRP_1 alignments.fasta 
>RdRP_1 SRR5912616.coronaspades.NODE_2_length_30135_cluster_2_candidate_1_domains_27_2/4875-5364
-AALTTGLTFQTVRPGNFNQDFYDFVVSKgfFKEGSSVTLKHFFFAQDGNAAITDYNYYSYNLPTMCDIKQMLFCMEVvNKYFEIYDGGCLNASEVVVNNLDKSAGHPFNKFG--KARVYYESMSYQEQDELFAMTKRN--VIPTMTQMNLKYAIS-----------AKNRARTVAGVSILSTMTNRQYH\
QKMLKSMAATRGATC-VIGTTKFYGGWDFMLKTLYKDvdNPHLMGWDYPKCDRAMPNMCRIFASLILARKHGTcCTTRDRFYRLANECaqVLSEYVLCGGGYYVKPGGTSSGDATTAYANSVFNIlqattanv....salmgangnK--------IvdkevkdmqfdlYVNVYRSTSPDPKFVDKYYAFl\
nk.hfsMMiLSDDGVVCYNSDYAAKgyiagIQNFKETLYYQnNVFMSEAKCWVETDLKKGPHEFCSQHTLYIKDGdDGYFLPYPDPSRILSAGCFVDDIVKTDgtlmVERFVSLAIDAYPLTKHEDIEYQNVFWVYLQYI-
--
>RdRP_1 SRR5912616.coronaspades.NODE_1_length_27639_cluster_1_candidate_1_domains_36_1/4608-5093
-----TGLTSQTVKPGHFNKEFYDFLRSQgfFDEGSELTLKHFFFTQKGDAAIKDFDYYRYNRPTMLDIGQARVAYQVaSRYFDCYEGGCITSREVVVTNLNKSAGWPLNKFG--KAGLYYESISYEEQDAMFALTKRN--ILPTMTQLNLKYAIS-----------GKERARTVGGVSLLATMTTRQFH\
QKCLKSIVATRNATV-VIGTTKFYGGWDNMLKNLIADvdDPKLMGWDYPKCDRAMPSMIRMLSAMILGSKHVTcCTASDKFYRLSNELaqVLTEVVYSNGGFYFKPGGTTSGDATTAYANSVFNIfqavssninrilsvnssncnnLNVKKLQKQL............YDNCYRNSNVDESFVDDFYGYl\
qkhfsmMI.LSDDGVVCYNKIYAELgyiadISAFKATLYYQnGVFMSTAKCWTEEDLSVGPHEFCSQHTMQIVDEnGKYYLPYPDPSRIISAGVFVDDITKTDavilLERYVSLAIDAYPLSKHPKPEYRKVFYALLDWV-
--
>RdRP_1 SRR5912616.coronaspades.NODE_3_length_30135_cluster_3_candidate_1_domains_11_6/4875-5364
-AALTTGLTFQTVRPGNFNQDFYDFVVSKgfFKEGSSVTLKHFFFAQDGNAAITDYNYYSYNLPTMCDIKQMLFCMEVvNKYFEIYDGGCLNASEVVVNNLDKSAGHPFNKFG--KARVYYESMSYQEQDELFAMTKRN--VIPTMTQMNLKYAIS-----------AKNRARTVAGVSILSTMTNRQYH\
QKMLKSMAATRGATC-VIGTTKFYGGWDFMLKTLYKDvdNPHLMGWDYPKCDRAMPNMCRIFASLILARKHGTcCTTRDRFYRLANECaqVLSEYVLCGGGYYVKPGGTSSGDATTAYANSVFNIlqattanv....salmgangnK--------IvdkevkdmqfdlYVNVYRSTSPDPKFVDKYYAFl\
nk.hfsMMiLSDDGVVCYNSDYAAKgyiagIQNFKETLYYQnNVFMSEAKCWVETDLKKGPHEFCSQHTLYIKDGdDGYFLPYPDPSRILSAGCFVDDIVKTDgtlmVERFVSLAIDAYPLTKHEDIEYQNVFWVYLQYI-
(base) ubuntu@ip-172-31-65-128:~/repos/darth/test/assemblies/SRR5912616/transeq$ grep -A 1 RdRP_1 alignments.fasta | egrep -v "^>" | tr -d '[a-z]' | awk '{ print length }'
466
2
474
2
466
(base) ubuntu@ip-172-31-65-128:~/repos/darth/test/assemblies/SRR5912616/transeq$ grep -A 1 RdRP_1 alignments.fasta | egrep -v "^>" | tr -d '[a-z.]' | awk '{ print length }'
461
2
461
2
461
taltman commented 3 years ago

Removal of pesky periods for SRR5912645 seems to do the trick as well:

(base) ubuntu@ip-172-31-65-128:~/repos/darth/test/assemblies/SRR5912645/transeq$ grep -A 1 RdRP_1 alignments.fasta 
>RdRP_1 SRR5912645.coronaspades.NODE_2_length_29985_cluster_2_candidate_1_domains_34_2/4846-5335
-AALTTGLTFQTVRPGNFNQDFYDFVVSKgfFKEGSSVTLKHFFFAQDGNAAITDYNYYSYNLPTMCDIKQMLFCMEVvNKYFEIYDGGCLNASEVVVNNLDKSAGHPFNKFG--KARVYYESMSYQEQDELFAMTKRN--VIPTMTQMNLKYAIS-----------AKNRARTVAGVSILSTMTNRQYH\
QKMLKSMAATRGATC-VIGTTKFYGGWDFMLKTLYKDvdNPHLMGWDYPKCDRAMPNMCRIFASLILARKHGTcCTTRDRFYRLANECaqVLSEYVLCGGGYYVKPGGTSSGDATTAYANSVFNIlqattanv....salmgangnK--------IvdkevkdmqfdlYVNVYRSTSPDPKFVDKYYAFl\
nk.hfsMMiLSDDGVVCYNSDYAAKgyiagIQNFKETLYYQnNVFMSEAKCWVETDLKKGPHEFCSQHTLYIKDGdDGYFLPYPDPSRILSAGCFVDDIVKTDgtlmVERFVSLAIDAYPLTKHEDIEYQNVFWVYLQYI-
--
>RdRP_1 SRR5912645.coronaspades.NODE_1_length_27457_cluster_1_candidate_1_domains_38_5/4586-5071
-----TGLTSQTVKPGHFNKEFYDFLRSQgfFDEGSELTLKHFFFTQKGDAAIKDFDYYRYNRPTMLDIGQARVAYQVaSRYFDCYEGGCITSREVVVTNLNKSAGWPLNKFG--KAGLYYESISYEEQDAMFALTKRN--ILPTMTQLNLKYAIS-----------GKERARTVGGVSLLATMTTRQFH\
QKCLKSIVATRNATV-VIGTTKFYGGWDNMLKNLIADvdDPKLMGWDYPKCDRAMPSMIRMLSAMILGSKHVTcCTASDKFYRLSNELaqVLTEVVYSNGGFYFKPGGTTSGDATTAYANSVFNIfqavssninrilsvnssncnnLNVKKLQKQL............YDNCYRNSNVDESFVDDFYGYl\
qkhfsmMI.LSDDGVVCYNKIYAELgyiadISAFKATLYYQnGVFMSTAKCWTEEDLSVGPHEFCSQHTMQIVDEnGKYYLPYPDPSRIISAGVFVDDITKTDavilLERYVSLAIDAYPLSKHPKPEYRKVFYALLDWV-
(base) ubuntu@ip-172-31-65-128:~/repos/darth/test/assemblies/SRR5912645/transeq$ grep -A 1 RdRP_1 alignments.fasta | egrep -v "^>" | tr -d '[a-z.]' | awk '{ print length }'
461
2
461
taltman commented 3 years ago

... and same scenario for SRR6713708:

(base) ubuntu@ip-172-31-65-128:~/repos/darth/test/assemblies/SRR6713708/transeq$ grep -A 1 RdRP_1 alignments.fasta 
>RdRP_1 SRR6713708.coronaspades.NODE_1_length_28131_cluster_1_candidate_1_domains_36_6/4604-5088
------GMTNQTVKPGHFNKEFYDFLLEQgfFSEGSELTLKHFFFAQKGDAAVKDFDYYRYNRPTVLDICQARVVYQIvQRYFDIYEGGCITAKEVVVTNLNKSAGYPLNKFG--KAGLYYESLSYEEQDELYAYTKRN--ILPTMTQLNLKYAIS-----------GKERARTVGGVSLLSTMTTRQYH\
QKHLKSIVNTRGASV-VIGTTKFYGGWDNMLKNLIDGveNPCLMGWDYPKCDRALPNMIRMISAMILGSKHTTcCSSTDRFFRLCNELaqVLTEVVYSNGGFYLKPGGTTSGDATTAYANSVFNIfqavsanvnkllsvdsnvchnLEVKQLQRKL.....................YECCYRSTTVDDQ\
FVVEYYGYlrk.hfsMMiLSDDGVVCYNNDYASLgyvadLNAFKAVLYYQnNVFMSASKCWIEPDINKGPHEFCSQHTMQIVDKdGTYYLPYPDPSRILSAGVFVDDVVKTDavvlLERYVSLAIDAYPLSKHENPEYKKVFYVLLDWV-
--
>RdRP_1 SRR6713708.coronaspades.NODE_2_length_12004_cluster_2_candidate_1_domains_14_3/881-1342
-----SGIAKQSVKPGHFNQHFYKHLLDS.nLLDQLGIDIRHFYYMQDGEAAITDYSYYRYNTPTMVDIKMFLFCLEVaDKYLEPYEGGCINAQSVVVSNLDKSAGYPFNKLG--KARNYYDMTH-AEQNQLFEYTKRN--VLPTLTQMNLKYAIS-----------AKDRARTVAGVSIISTMTNRQYH\
QKMLKSISLARNQTI-VIGTTKFYGGWDNMLRRLMCNinNPILVGWDYPKCDRSMPNMLRIAASCLLARKHTC.CNQSQRFYRLANECcqVLSEVVVSGNNLYVKPGGTSSGDATTAYANSVFNI.....................LQVVSANVATflststtthlnkdiadlhrslYEDIYRGDSNDIT\
VINRFYQHlqsyfglMI.LSDDGVACIDSAVAKAgavadLDGFRDILFYQnNVYMADSKCWTETDMNVGPHEFCSQHTVLAEHDgKPYYLPYPDVSRILGACIFVDDVNKADpvqnLERYISLAIDAY----------------------
(base) ubuntu@ip-172-31-65-128:~/repos/darth/test/assemblies/SRR6713708/transeq$ grep -A 1 RdRP_1 alignments.fasta | egrep -v "^>" | tr -d '[a-z.]' | awk '{ print length }'
461
2
461
taltman commented 3 years ago

@rcedgar For your reference, as per this Wikipedia page on Stockholm format, the '.' is another valid character for representing an insertion, like '-'.

taltman commented 3 years ago

@rchikhi I just pushed my code changes to the Darth BitBucket. Reassigning this issue to you to confirm the fix. Please review my commit, incorporate, and test on the assemblies above. Note that I created an optional 7th argument to darth.sh that, when it equals 'canonicalize_only', it will just do the canonicalization step, and then exit, which makes confirming these fixes locally very fast, as you don't have to wait for VADR to complete.

taltman commented 3 years ago

@rchikhi Please try to do a quick test of my code ASAP, so that I have a chance to fix any bugs before 10 AM PDT, Tuesday morning. Thanks!

rchikhi commented 3 years ago

The new code implements the same awk fix as the one I ran yesterday. (I manually reviewed your commits, I don't think I missed any key difference but you never know..) So the data I handed to @rcedgar yesterday should be 'good', execpt for the '.' insertion character thing. To be clear, the tr -d '[a-z.]' part that deletes the '.' characters hasn't been implemented in darth (or canonicalize_contigs), but instead Tomer intends that the '.' characters should be parsed (or replaced by '') downnstream by @rcedgar (as part of #229).

taltman commented 3 years ago

The only tweak is that I make the deflines unique by doing 'print ">" pfam[id] " " id'. This is convenient, as it allows @rcedgar to see where these different RdRP_1's are coming from in the assembly.

There's nothing wrong with the '.' character; it is valid Stockholm format. The OTU pipeline must handle it gracefully, as you mentioned, as part of #229.

rcedgar commented 3 years ago

Dotted gaps are unusual (arguably non-standard) in a2m, but agreed I can delete them, and I should have caught this myself after Artem & Rayan's fix, my bad. Looks to be working now, closing this issue.