Closed rcedgar closed 3 years ago
Some of these viromes have hundreds of RdRps. If there are multiple species here, which seems likely, then IMO we should tackle these sooner rather than later.
SRR11097768 4323
SRR10854863 1005
SRR10854864 690
ERR3661016 556
SRR11647857 552
SRR11648360 550
The latest version of Darth captures multiple domain hits for a given Pfam model across multiple contigs. Closing for now. Please feel free to reopen if you catch this not working in a specific SRA.
Reopening to confirm fix.
Adding the following accessions as cases to double-check, as reported on Slack by @rcedgar : ERR963087 SRR10252889 ERR3460960 SRR6713876
Testing fix, one comment per accession tested:
ERR963087:
Two RdRP_1 entries found:
(base) ubuntu@ip-172-31-65-128:~/repos/darth/test/assemblies/ERR963087/transeq$ grep -A 1 RdRP_1 alignments.fasta
>RdRP_1 ERR963087.coronaspades.NODE_7_length_13230_cluster_7_candidate_1_domains_1_3/1884-2373
-AALTTGLTFQTVRPGNFNQDFYDFVVSKgfFKEGSSVTLKHFFFAQDGNAAITDYNYYSYNLPTMCDIKQMLFCMEVvNKYFEIYDGGCLNASEVVVNNLDKSAGHPFNKFG--KARVYYESMSYQEQDELFAMTKRN--VIPTMTQMNLKYAIS-----------AKNRARTVAGVSILSTMTNRQYH\
QKMLKSMAATRGATC-VIGTTKFYGGWDFMLKTLYKDvdNPHLMGWDYPKCDRAMPNMCRIFASLILARKHGTcCTTRDRFYRLANECaqVLSEYVLCGGGYYVKPGGTSSGDATTAYANSVFNIlqattanvsalmgangnK--------IvdkevkdmqfdlYVNVYRSTSPDPKFVDKYYAFlnkhf\
sMMiLSDDGVVCYNSDYAAKgyiagIQNFKETLYYQnNVFMSEAKCWVETDLKKGPHEFCSQHTLYIKDGdDGYFLPYPDPSRILSAGCFVDDIVKTDgtlmVERFVSLAIDAYPLTKHEDIEYQNVFWVYLQYI-
--
>RdRP_1 ERR963087.coronaspades.NODE_1_length_13230_cluster_1_candidate_1_domains_18_3/1884-2373
-AALTTGLTFQTVRPGNFNQDFYDFVVSKgfFKEGSSVTLKHFFFAQDGNAAITDYNYYSYNLPTMCDIKQMLFCMEVvNKYFEIYDGGCLNASEVVVNNLDKSAGHPFNKFG--KARVYYESMSYQEQDELFAMTKRN--VIPTMTQMNLKYAIS-----------AKNRARTVAGVSILSTMTNRQYH\
QKMLKSMAATRGATC-VIGTTKFYGGWDFMLKTLYKDvdNPHLMGWDYPKCDRAMPNMCRIFASLILARKHGTcCTTRDRFYRLANECaqVLSEYVLCGGGYYVKPGGTSSGDATTAYANSVFNIlqattanvsalmgangnK--------IvdkevkdmqfdlYVNVYRSTSPDPKFVDKYYAFlnkhf\
sMMiLSDDGVVCYNSDYAAKgyiagIQNFKETLYYQnNVFMSEAKCWVETDLKKGPHEFCSQHTLYIKDGdDGYFLPYPDPSRILSAGCFVDDIVKTDgtlmVERFVSLAIDAYPLTKHEDIEYQNVFWVYLQYI-
... and they have the expected length:
(base) ubuntu@ip-172-31-65-128:~/repos/darth/test/assemblies/ERR963087/transeq$ grep -A 1 RdRP_1 alignments.fasta | egrep -v "^>" | tr -d '[a-z]' | awk '{ print length }'
461
2
461
SRR10252889 looks good now:
(base) ubuntu@ip-172-31-65-128:~/repos/darth/test/assemblies/SRR10252889/transeq$ grep -A 1 RdRP_1 alignments.fasta
>RdRP_1 SRR10252889.coronaspades.NODE_1_length_29957_cluster_1_candidate_1_domains_35_1/4890-5379
-AALTTGLTFQTVRPGNFNQDFYDFVVSKgfFKEGSSVTLKHFFFAQDGNAAITDYNYYSYNLPTMCDIKQMLFCMEVvNKYFEIYDGGCLNASEVVVNNLDKSAGHPFNKFG--KARVYYESMSYQEQDELFAMTKRN--VIPTMTQMNLKYAIS-----------AKNRARTVAGVSILSTMTNRQYH\
QKMLKSMAATRGATC-VIGTTKFYGGWDFMLKTLYKDvdNPHLMGWDYPKCDRAMPNMCRIFASLILARKHGTcCTTRDRFYRLANECaqVLSEYVLCGGGYYVKPGGTSSGDATTAYANSVFNIlqattanvsalmgangnK--------IvdkevkdmqfdlYVNVYRSTSPDPKFVDKYYAFlnkhf\
sMMiLSDDGVVCYNSDYAAKgyiagIQNFKETLYYQnNVFMSEAKCWVETDLKKGPHEFCSQHTLYIKDGdDGYFLPYPDPSRILSAGCFVDDIVKTDgtlmVERFVSLAIDAYPLTKHEDIEYQNVFWVYLQYI-
--
>RdRP_1 SRR10252889.coronaspades.NODE_6_length_25745_cluster_6_candidate_1_domains_1_1/4524-5013
-AALTTGLTFQTVRPGNFNQDFYDFVVSKgfFKEGSSVTLKHFFFAQDGNAAITDYNYYSYNLPTMCDIKQMLFCMEVvNKYFEIYDGGCLNASEVVVNNLDKSAGHPFNKFG--KARVYYESMSYQEQDELFAMTKRN--VIPTMTQMNLKYAIS-----------AKNRARTVAGVSILSTMTNRQYH\
QKMLKSMAATRGATC-VIGTTKFYGGWDFMLKTLYKDvdNPHLMGWDYPKCDRAMPNMCRIFASLILARKHGTcCTTRDRFYRLANECaqVLSEYVLCGGGYYVKPGGTSSGDATTAYANSVFNIlqattanvsalmgangnK--------IvdkevkdmqfdlYVNVYRSTSPDPKFVDKYYAFlnkhf\
sMMiLSDDGVVCYNSDYAAKgyiagIQNFKETLYYQnNVFMSEAKCWVETDLKKGPHEFCSQHTLYIKDGdDGYFLPYPDPSRILSAGCFVDDIVKTDgtlmVERFVSLAIDAYPLTKHEDIEYQNVFWVYLQYI-
(base) ubuntu@ip-172-31-65-128:~/repos/darth/test/assemblies/SRR10252889/transeq$ grep -A 1 RdRP_1 alignments.fasta | egrep -v "^>" | tr -d '[a-z]' | awk '{ print length }'
461
2
461
ERR3460960 looks good too:
(base) ubuntu@ip-172-31-65-128:~/repos/darth/test/assemblies/ERR3460960/transeq$ grep -A 1 RdRP_1 alignments.fasta
>RdRP_1 ERR3460960.coronaspades.NODE_1_length_28843_cluster_1_candidate_1_domains_34_6/4546-5031
-----TGLTSQTVKPGHFNKEFYDFLRSQgfFDEGSELTLKHFFFTQKGDAAIKDFDYYRYNRPTMLDIGQARVAYQVaARYFDCYEGGCITSREVVVTNLNKSAGWPLNKFG--KAGLYYESISYEEQDAIFSLTKRN--ILPTMTQLNLKYAIS-----------GKERARTVGGVSLLATMTTRQFH\
QKCLKSIVATRNATV-VIGTTKFYGGWDNMLKNLMADvdDPKLMGWDYPKCDRAMPSMIRMLSAMILGSKHVTcCTASDKFYRLSNELaqVLTEVVYSNGGFYFKPGGTTSGDATTAYANSVFNIfqavssnincvlsvnssncnnFNVKKLQRQLYDNCYRNSNVDESFVDDFYGYlqkhfsmMILSDD\
SVVCYNKTYAGLgyiadISAFKATLYYQnGVFMSTAKCWTEEDLSIGPHEFCSQHTMQIVDEnGKYYLPYPDPSRIISAGVFVDDITKTDavilLERYVSLAIDAYPLSKHPKPEYRKVFYALLDWV-
--
>RdRP_1 ERR3460960.coronaspades.NODE_3_length_28843_cluster_3_candidate_1_domains_1_3/4546-5031
-----TGLTSQTVKPGHFNKEFYDFLRSQgfFDEGSELTLKHFFFTQKGDAAIKDFDYYRYNRPTMLDIGQARVAYQVaARYFDCYEGGCITSREVVVTNLNKSAGWPLNKFG--KAGLYYESISYEEQDAIFSLTKRN--ILPTMTQLNLKYAIS-----------GKERARTVGGVSLLATMTTRQFH\
QKCLKSIVATRNATV-VIGTTKFYGGWDNMLKNLMADvdDPKLMGWDYPKCDRAMPSMIRMLSAMILGSKHVTcCTASDKFYRLSNELaqVLTEVVYSNGGFYFKPGGTTSGDATTAYANSVFNIfqavssnincvlsvnssncnnFNVKKLQRQLYDNCYRNSNVDESFVDDFYGYlqkhfsmMILSDD\
SVVCYNKTYAGLgyiadISAFKATLYYQnGVFMSTAKCWTEEDLSIGPHEFCSQHTMQIVDEnGKYYLPYPDPSRIISAGVFVDDITKTDavilLERYVSLAIDAYPLSKHPKPEYRKVFYALLDWV-
(base) ubuntu@ip-172-31-65-128:~/repos/darth/test/assemblies/ERR3460960/transeq$ grep -A 1 RdRP_1 alignments.fasta | egrep -v "^>" | tr -d '[a-z]' | awk '{ print length }'
461
2
461
SRR6713876 looks good too, three RdRP_1 domains found:
(base) ubuntu@ip-172-31-65-128:~/repos/darth/test/assemblies/SRR6713876/transeq$ grep -A 1 RdRP_1 alignments.fasta
>RdRP_1 SRR6713876.coronaspades.NODE_3_length_28164_cluster_3_candidate_1_domains_1_3/4616-5100
------GMTNQTVKPGHFNKEFYDFLLEQgfFSEGSELTLKHFFFAQKGDAAVKDFDYYRYNRPTVLDICQARVVYQIvQRYFDIYEGGCITAKEVVVTNLNKSAGYPLNKFG--KAGLYYESLSYEEQDELYAYTKRN--ILPTMTQLNLKYAIS-----------GKERARTVGGVSLLSTMTTRQYH\
QKHLKSIVNTRGASV-VIGTTKFYGGWDNMLKNLIDGveNPCLMGWDYPKCDRALPNMIRMISAMILGSKHTTcCSSTDRFFRLCNELaqVLTEVVYSNGGFYLKPGGTTSGDATTAYANSVFNIfqavsanvnkllsvdsnvchnLEVKQLQRKLYECCYRSTTVDDQFVVEYYGYlrkhfsMMiLSDD\
GVVCYNNDYASLgyvadLNAFKAVLYYQnNVFMSASKCWIEPDINKGPHEFCSQHTMQIVDKdGTYYLPYPDPSRILSAGVFVDDVVKTDavvlLERYVSLAIDAYPLSKHENPEYKKVFYVLLDWV-
--
>RdRP_1 SRR6713876.coronaspades.NODE_2_length_28164_cluster_2_candidate_1_domains_9_3/4616-5100
------GMTNQTVKPGHFNKEFYDFLLEQgfFSEGSELTLKHFFFAQKGDAAVKDFDYYRYNRPTVLDICQARVVYQIvQRYFDIYEGGCITAKEVVVTNLNKSAGYPLNKFG--KAGLYYESLSYEEQDELYAYTKRN--ILPTMTQLNLKYAIS-----------GKERARTVGGVSLLSTMTTRQYH\
QKHLKSIVNTRGASV-VIGTTKFYGGWDNMLKNLIDGveNPCLMGWDYPKCDRALPNMIRMISAMILGSKHTTcCSSTDRFFRLCNELaqVLTEVVYSNGGFYLKPGGTTSGDATTAYANSVFNIfqavsanvnkllsvdsnvchnLEVKQLQRKLYECCYRSTTVDDQFVVEYYGYlrkhfsMMiLSDD\
GVVCYNNDYASLgyvadLNAFKAVLYYQnNVFMSASKCWIEPDINKGPHEFCSQHTMQIVDKdGTYYLPYPDPSRILSAGVFVDDVVKTDavvlLERYVSLAIDAYPLSKHENPEYKKVFYVLLDWV-
--
>RdRP_1 SRR6713876.coronaspades.NODE_1_length_28164_cluster_1_candidate_1_domains_32_3/4616-5100
------GMTNQTVKPGHFNKEFYDFLLEQgfFSEGSELTLKHFFFAQKGDAAVKDFDYYRYNRPTVLDICQARVVYQIvQRYFDIYEGGCITAKEVVVTNLNKSAGYPLNKFG--KAGLYYESLSYEEQDELYAYTKRN--ILPTMTQLNLKYAIS-----------GKERARTVGGVSLLSTMTTRQYH\
QKHLKSIVNTRGASV-VIGTTKFYGGWDNMLKNLIDGveNPCLMGWDYPKCDRALPNMIRMISAMILGSKHTTcCSSTDRFFRLCNELaqVLTEVVYSNGGFYLKPGGTTSGDATTAYANSVFNIfqavsanvnkllsvdsnvchnLEVKQLQRKLYECCYRSTTVDDQFVVEYYGYlrkhfsMMiLSDD\
GVVCYNNDYASLgyvadLNAFKAVLYYQnNVFMSASKCWIEPDINKGPHEFCSQHTMQIVDKdGTYYLPYPDPSRILSAGVFVDDVVKTDavvlLERYVSLAIDAYPLSKHENPEYKKVFYVLLDWV-
(base) ubuntu@ip-172-31-65-128:~/repos/darth/test/assemblies/SRR6713876/transeq$ grep -A 1 RdRP_1 alignments.fasta | egrep -v "^>" | tr -d '[a-z]' | awk '{ print length }'
461
2
461
2
461
I spot-checked one of the accessions that @rcedgar reported had 'hundreds' of RdRPs: SRR11648360. It had no RdRPs found. What was the basis for saying that there are hundreds of RdRP domains in the assembly?
bgc_statistics.txt of SRR11648360 shows # RdRP_1-domains - 1100
I'm having:
Internal pipeline statistics summary:
-------------------------------------
Query model(s): 1 (461 nodes)
Target sequences: 3840 (3432786 residues searched)
Passed MSV filter: 577 (0.15026); expected 76.8 (0.02)
Passed bias filter: 567 (0.147656); expected 76.8 (0.02)
Passed Vit filter: 550 (0.143229); expected 3.8 (0.001)
Passed Fwd filter: 549 (0.142969); expected 0.0 (1e-05)
Initial search space (Z): 3840 [actual number of targets]
Domain search space (domZ): 548 [number of targets reported over threshold]
Here is what I get from Pfam:
# --- full sequence --- -------------- this domain ------------- hmm coord ali coord env coord
# target name accession tlen query name accession qlen E-value score bias # of c-Evalue i-Evalue score bias from to from to from to acc descripti\
on of target
#------------------- ---------- ----- -------------------- ---------- ----- --------- ------ ----- --- --- --------- --------- ------ ----- ----- ----- ----- ----- ----- ----- ---- ---------\
------------
NODE_69_length_335_cluster_69_candidate_1_domains_1_3 - 111 CoV_NSP7 PF08716.11 83 2.1e-14 42.8 0.2 1 1 4.2e-15 2.5e-14 42.5 0.2 3 83 16\
96 14 96 0.97 -
#
# Program: hmmsearch
# Version: 3.1b2 (February 2015)
# Pipeline mode: SEARCH
# Query file: /root/data/Pfam-A.SARS-CoV-2.hmm
# Target file: trans-6-frame.fasta
# Option settings: hmmsearch -A match-alignments.sto --domtblout hmmsearch-matches.txt --cut_ga /root/data/Pfam-A.SARS-CoV-2.hmm trans-6-frame.fasta
# Current dir: /output/transeq
# Date: Tue Jul 28 06:41:27 2020
# [ok]
@asl, what do you get when you do a Pfam on this assembly?
s3://serratus-rayan/master_table_assemblies/SRR11648360.fa
Mine is
transeq -frame 6 SRR11648360.coronaspades.gene_clusters.fa SRR11648360.coronaspades.gene_clusters.aa
Can you confirm that SRR11648360.coronaspades.gene_clusters.fa
is bit-identical with s3://serratus-rayan/master_table_assemblies/SRR11648360.fa
?
I think some terminology might help clarify things. Are you testing the "raw" assembly, or the "filtered" assembly (using checkv or BGC approach)? And is the file that I am copying from S3 using the same filtering approach as the (filtered?) assembly that you are analyzing?
@asl Unless I hear otherwise from you, I believe the point that you are raising is an orthogonal concern to the multiple RdRP issue. I am going to create a separate issue for your point, so that I can commit my code fixes and ask @rchikhi to confirm my fixes before closing the issue.
Adding in analysis of the following accessions, because @rcedgar noticed non-standard alignment length: SRR5912616 SRR5912645 SRR6713708
SRR5912616: The alignments have the correct length if we also take care to remove '.' (period) entries:
(base) ubuntu@ip-172-31-65-128:~/repos/darth/test/assemblies/SRR5912616/transeq$ grep -A 1 RdRP_1 alignments.fasta
>RdRP_1 SRR5912616.coronaspades.NODE_2_length_30135_cluster_2_candidate_1_domains_27_2/4875-5364
-AALTTGLTFQTVRPGNFNQDFYDFVVSKgfFKEGSSVTLKHFFFAQDGNAAITDYNYYSYNLPTMCDIKQMLFCMEVvNKYFEIYDGGCLNASEVVVNNLDKSAGHPFNKFG--KARVYYESMSYQEQDELFAMTKRN--VIPTMTQMNLKYAIS-----------AKNRARTVAGVSILSTMTNRQYH\
QKMLKSMAATRGATC-VIGTTKFYGGWDFMLKTLYKDvdNPHLMGWDYPKCDRAMPNMCRIFASLILARKHGTcCTTRDRFYRLANECaqVLSEYVLCGGGYYVKPGGTSSGDATTAYANSVFNIlqattanv....salmgangnK--------IvdkevkdmqfdlYVNVYRSTSPDPKFVDKYYAFl\
nk.hfsMMiLSDDGVVCYNSDYAAKgyiagIQNFKETLYYQnNVFMSEAKCWVETDLKKGPHEFCSQHTLYIKDGdDGYFLPYPDPSRILSAGCFVDDIVKTDgtlmVERFVSLAIDAYPLTKHEDIEYQNVFWVYLQYI-
--
>RdRP_1 SRR5912616.coronaspades.NODE_1_length_27639_cluster_1_candidate_1_domains_36_1/4608-5093
-----TGLTSQTVKPGHFNKEFYDFLRSQgfFDEGSELTLKHFFFTQKGDAAIKDFDYYRYNRPTMLDIGQARVAYQVaSRYFDCYEGGCITSREVVVTNLNKSAGWPLNKFG--KAGLYYESISYEEQDAMFALTKRN--ILPTMTQLNLKYAIS-----------GKERARTVGGVSLLATMTTRQFH\
QKCLKSIVATRNATV-VIGTTKFYGGWDNMLKNLIADvdDPKLMGWDYPKCDRAMPSMIRMLSAMILGSKHVTcCTASDKFYRLSNELaqVLTEVVYSNGGFYFKPGGTTSGDATTAYANSVFNIfqavssninrilsvnssncnnLNVKKLQKQL............YDNCYRNSNVDESFVDDFYGYl\
qkhfsmMI.LSDDGVVCYNKIYAELgyiadISAFKATLYYQnGVFMSTAKCWTEEDLSVGPHEFCSQHTMQIVDEnGKYYLPYPDPSRIISAGVFVDDITKTDavilLERYVSLAIDAYPLSKHPKPEYRKVFYALLDWV-
--
>RdRP_1 SRR5912616.coronaspades.NODE_3_length_30135_cluster_3_candidate_1_domains_11_6/4875-5364
-AALTTGLTFQTVRPGNFNQDFYDFVVSKgfFKEGSSVTLKHFFFAQDGNAAITDYNYYSYNLPTMCDIKQMLFCMEVvNKYFEIYDGGCLNASEVVVNNLDKSAGHPFNKFG--KARVYYESMSYQEQDELFAMTKRN--VIPTMTQMNLKYAIS-----------AKNRARTVAGVSILSTMTNRQYH\
QKMLKSMAATRGATC-VIGTTKFYGGWDFMLKTLYKDvdNPHLMGWDYPKCDRAMPNMCRIFASLILARKHGTcCTTRDRFYRLANECaqVLSEYVLCGGGYYVKPGGTSSGDATTAYANSVFNIlqattanv....salmgangnK--------IvdkevkdmqfdlYVNVYRSTSPDPKFVDKYYAFl\
nk.hfsMMiLSDDGVVCYNSDYAAKgyiagIQNFKETLYYQnNVFMSEAKCWVETDLKKGPHEFCSQHTLYIKDGdDGYFLPYPDPSRILSAGCFVDDIVKTDgtlmVERFVSLAIDAYPLTKHEDIEYQNVFWVYLQYI-
(base) ubuntu@ip-172-31-65-128:~/repos/darth/test/assemblies/SRR5912616/transeq$ grep -A 1 RdRP_1 alignments.fasta | egrep -v "^>" | tr -d '[a-z]' | awk '{ print length }'
466
2
474
2
466
(base) ubuntu@ip-172-31-65-128:~/repos/darth/test/assemblies/SRR5912616/transeq$ grep -A 1 RdRP_1 alignments.fasta | egrep -v "^>" | tr -d '[a-z.]' | awk '{ print length }'
461
2
461
2
461
Removal of pesky periods for SRR5912645 seems to do the trick as well:
(base) ubuntu@ip-172-31-65-128:~/repos/darth/test/assemblies/SRR5912645/transeq$ grep -A 1 RdRP_1 alignments.fasta
>RdRP_1 SRR5912645.coronaspades.NODE_2_length_29985_cluster_2_candidate_1_domains_34_2/4846-5335
-AALTTGLTFQTVRPGNFNQDFYDFVVSKgfFKEGSSVTLKHFFFAQDGNAAITDYNYYSYNLPTMCDIKQMLFCMEVvNKYFEIYDGGCLNASEVVVNNLDKSAGHPFNKFG--KARVYYESMSYQEQDELFAMTKRN--VIPTMTQMNLKYAIS-----------AKNRARTVAGVSILSTMTNRQYH\
QKMLKSMAATRGATC-VIGTTKFYGGWDFMLKTLYKDvdNPHLMGWDYPKCDRAMPNMCRIFASLILARKHGTcCTTRDRFYRLANECaqVLSEYVLCGGGYYVKPGGTSSGDATTAYANSVFNIlqattanv....salmgangnK--------IvdkevkdmqfdlYVNVYRSTSPDPKFVDKYYAFl\
nk.hfsMMiLSDDGVVCYNSDYAAKgyiagIQNFKETLYYQnNVFMSEAKCWVETDLKKGPHEFCSQHTLYIKDGdDGYFLPYPDPSRILSAGCFVDDIVKTDgtlmVERFVSLAIDAYPLTKHEDIEYQNVFWVYLQYI-
--
>RdRP_1 SRR5912645.coronaspades.NODE_1_length_27457_cluster_1_candidate_1_domains_38_5/4586-5071
-----TGLTSQTVKPGHFNKEFYDFLRSQgfFDEGSELTLKHFFFTQKGDAAIKDFDYYRYNRPTMLDIGQARVAYQVaSRYFDCYEGGCITSREVVVTNLNKSAGWPLNKFG--KAGLYYESISYEEQDAMFALTKRN--ILPTMTQLNLKYAIS-----------GKERARTVGGVSLLATMTTRQFH\
QKCLKSIVATRNATV-VIGTTKFYGGWDNMLKNLIADvdDPKLMGWDYPKCDRAMPSMIRMLSAMILGSKHVTcCTASDKFYRLSNELaqVLTEVVYSNGGFYFKPGGTTSGDATTAYANSVFNIfqavssninrilsvnssncnnLNVKKLQKQL............YDNCYRNSNVDESFVDDFYGYl\
qkhfsmMI.LSDDGVVCYNKIYAELgyiadISAFKATLYYQnGVFMSTAKCWTEEDLSVGPHEFCSQHTMQIVDEnGKYYLPYPDPSRIISAGVFVDDITKTDavilLERYVSLAIDAYPLSKHPKPEYRKVFYALLDWV-
(base) ubuntu@ip-172-31-65-128:~/repos/darth/test/assemblies/SRR5912645/transeq$ grep -A 1 RdRP_1 alignments.fasta | egrep -v "^>" | tr -d '[a-z.]' | awk '{ print length }'
461
2
461
... and same scenario for SRR6713708:
(base) ubuntu@ip-172-31-65-128:~/repos/darth/test/assemblies/SRR6713708/transeq$ grep -A 1 RdRP_1 alignments.fasta
>RdRP_1 SRR6713708.coronaspades.NODE_1_length_28131_cluster_1_candidate_1_domains_36_6/4604-5088
------GMTNQTVKPGHFNKEFYDFLLEQgfFSEGSELTLKHFFFAQKGDAAVKDFDYYRYNRPTVLDICQARVVYQIvQRYFDIYEGGCITAKEVVVTNLNKSAGYPLNKFG--KAGLYYESLSYEEQDELYAYTKRN--ILPTMTQLNLKYAIS-----------GKERARTVGGVSLLSTMTTRQYH\
QKHLKSIVNTRGASV-VIGTTKFYGGWDNMLKNLIDGveNPCLMGWDYPKCDRALPNMIRMISAMILGSKHTTcCSSTDRFFRLCNELaqVLTEVVYSNGGFYLKPGGTTSGDATTAYANSVFNIfqavsanvnkllsvdsnvchnLEVKQLQRKL.....................YECCYRSTTVDDQ\
FVVEYYGYlrk.hfsMMiLSDDGVVCYNNDYASLgyvadLNAFKAVLYYQnNVFMSASKCWIEPDINKGPHEFCSQHTMQIVDKdGTYYLPYPDPSRILSAGVFVDDVVKTDavvlLERYVSLAIDAYPLSKHENPEYKKVFYVLLDWV-
--
>RdRP_1 SRR6713708.coronaspades.NODE_2_length_12004_cluster_2_candidate_1_domains_14_3/881-1342
-----SGIAKQSVKPGHFNQHFYKHLLDS.nLLDQLGIDIRHFYYMQDGEAAITDYSYYRYNTPTMVDIKMFLFCLEVaDKYLEPYEGGCINAQSVVVSNLDKSAGYPFNKLG--KARNYYDMTH-AEQNQLFEYTKRN--VLPTLTQMNLKYAIS-----------AKDRARTVAGVSIISTMTNRQYH\
QKMLKSISLARNQTI-VIGTTKFYGGWDNMLRRLMCNinNPILVGWDYPKCDRSMPNMLRIAASCLLARKHTC.CNQSQRFYRLANECcqVLSEVVVSGNNLYVKPGGTSSGDATTAYANSVFNI.....................LQVVSANVATflststtthlnkdiadlhrslYEDIYRGDSNDIT\
VINRFYQHlqsyfglMI.LSDDGVACIDSAVAKAgavadLDGFRDILFYQnNVYMADSKCWTETDMNVGPHEFCSQHTVLAEHDgKPYYLPYPDVSRILGACIFVDDVNKADpvqnLERYISLAIDAY----------------------
(base) ubuntu@ip-172-31-65-128:~/repos/darth/test/assemblies/SRR6713708/transeq$ grep -A 1 RdRP_1 alignments.fasta | egrep -v "^>" | tr -d '[a-z.]' | awk '{ print length }'
461
2
461
@rcedgar For your reference, as per this Wikipedia page on Stockholm format, the '.' is another valid character for representing an insertion, like '-'.
@rchikhi I just pushed my code changes to the Darth BitBucket. Reassigning this issue to you to confirm the fix. Please review my commit, incorporate, and test on the assemblies above. Note that I created an optional 7th argument to darth.sh that, when it equals 'canonicalize_only', it will just do the canonicalization step, and then exit, which makes confirming these fixes locally very fast, as you don't have to wait for VADR to complete.
@rchikhi Please try to do a quick test of my code ASAP, so that I have a chance to fix any bugs before 10 AM PDT, Tuesday morning. Thanks!
The new code implements the same awk
fix as the one I ran yesterday. (I manually reviewed your commits, I don't think I missed any key difference but you never know..) So the data I handed to @rcedgar yesterday should be 'good', execpt for the '.' insertion character thing.
To be clear, the tr -d '[a-z.]'
part that deletes the '.' characters hasn't been implemented in darth (or canonicalize_contigs), but instead Tomer intends that the '.' characters should be parsed (or replaced by '') downnstream by @rcedgar (as part of #229).
The only tweak is that I make the deflines unique by doing 'print ">" pfam[id] " " id'. This is convenient, as it allows @rcedgar to see where these different RdRP_1's are coming from in the assembly.
There's nothing wrong with the '.' character; it is valid Stockholm format. The OTU pipeline must handle it gracefully, as you mentioned, as part of #229.
Dotted gaps are unusual (arguably non-standard) in a2m, but agreed I can delete them, and I should have caught this myself after Artem & Rayan's fix, my bad. Looks to be working now, closing this issue.
Spot-check suggests that at most one RdRp per assembly is captured in the most recent PFAM alignments. If there is >1, we should capture all of them. Punt for now, revisit for next round of analysis after PSI-Serratus. See discussion in #212.