ababaian / serratus

Ultra-deep search for novel viruses
http://serratus.io
GNU General Public License v3.0
254 stars 33 forks source link

PFAM alignment bug #224

Closed rcedgar closed 4 years ago

rcedgar commented 4 years ago

Example s3://serratus-public/assemblies/annotations/DRR220591.fa.darth.alignments.fasta. Looks like multiple alignments are concatenated into one RdRP_1 sequence.

taltman commented 4 years ago

I looked into the issue, and updated my AWK script to account for multiple domain matches within an assembly. Here is a sample output, with each alignment sequence being 508 characters long. If it looks good to you, I'll commit the changes.

(base) ubuntu@ip-172-31-65-128:~/repos/darth/test/assemblies/DRR220591/transeq$ grep -A 1 -i RDRP /tmp/alignments-test.fasta 
>RdRP_1 DRR220591.coronaspades.NODE_2_length_27461_cluster_1_candidate_2_domains_26_1/4512-4998
---LASGITQQTVKPGHFNKDFYDFAEKAgmFKEGSSIPLKHFFYPQTGNAAINDYDYYRYNRPTMFDIRQLLFCLEVtSKYFECYEGGCIPASQVVVNNLDKSAGYPFNKFG--KARLYYEMS-LEEQDQLFESTKKN--VLPTITQMNLKYAIS-----------AKNRARTVAGVSILSTMTNRQFHQK\
ILKSIVNTRNAPV-VIGTTKFYGGWDNMLRNLIQGveDPILMGWDYPKCDRAMPNLLRIAASLVLARKHTNcCTWSERIYRLYNECaqVLSETVLATGGIYVKPGGTSSGDATTAYANSVFNIiqatsanvarllsvitrdiayddiKSLQYELYQQVYRRVNFDPAFVEKFYSYLCKnfslMILSDDGVVC\
YNNTLAKQglvadISGFREVLYYQnNVFMSDSKCWVEPDLEKGPHEFCSQHTMLVEVDgELKYLPYPDPSRILGACVFVDDVDKTEpvavMERYIALAIDAYPLVHHENEEYKKVFFVLLSYI-
--
>RdRP_1 DRR220591.coronaspades.NODE_1_length_27665_cluster_1_candidate_1_domains_26_1/4512-4998
---LASGITQQTVKPGHFNKDFYDFAEKAgmFKEGSSIPLKHFFYPQTGNAAINDYDYYRYNRPTMFDIRQLLFCLEVtSKYFECYEGGCIPASQVVVNNLDKSAGYPFNKFG--KARLYYEMS-LEEQDQLFESTKKN--VLPTITQMNLKYAIS-----------AKNRARTVAGVSILSTMTNRQFHQK\
ILKSIVNTRNAPV-VIGTTKFYGGWDNMLRNLIQGveDPILMGWDYPKCDRAMPNLLRIAASLVLARKHTNcCTWSERIYRLYNECaqVLSETVLATGGIYVKPGGTSSGDATTAYANSVFNIiqatsanvarllsvitrdiayddiKSLQYELYQQVYRRVNFDPAFVEKFYSYLCKnfslMILSDDGVVC\
YNNTLAKQglvadISGFREVLYYQnNVFMSDSKCWVEPDLEKGPHEFCSQHTMLVEVDgELKYLPYPDPSRILGACVFVDDVDKTEpvavMERYIALAIDAYPLVHHENEEYKKVFFVLLSYI-
--
>RdRP_1 DRR220591.coronaspades.NODE_4_length_18477_cluster_2_candidate_2_domains_21_2/1517-2003
---LASGITQQTVKPGHFNKDFYDFAEKAgmFKEGSSIPLKHFFYPQTGNAAINDYDYYRYNRPTMFDIRQLLFCLEVtSKYFECYEGGCIPASQVVVNNLDKSAGYPFNKFG--KARLYYEMS-LEEQDQLFESTKKN--VLPTITQMNLKYAIS-----------AKNRARTVAGVSILSTMTNRQFHQK\
ILKSIVNTRNAPV-VIGTTKFYGGWDNMLRNLIQGveDPILMGWDYPKCDRAMPNLLRIAASLVLARKHTNcCTWSERIYRLYNECaqVLSETVLATGGIYVKPGGTSSGDATTAYANSVFNIiqatsanvarllsvitrdiayddiKSLQYELYQQVYRRVNFDPAFVEKFYSYLCKnfslMILSDDGVVC\
YNNTLAKQglvadISGFREVLYYQnNVFMSDSKCWVEPDLEKGPHEFCSQHTMLVEVDgELKYLPYPDPSRILGACVFVDDVDKTEpvavMERYIALAIDAYPLVHHENEEYKKVFFVLLSYI-
--
>RdRP_1 DRR220591.coronaspades.NODE_3_length_18681_cluster_2_candidate_1_domains_21_2/1517-2003
---LASGITQQTVKPGHFNKDFYDFAEKAgmFKEGSSIPLKHFFYPQTGNAAINDYDYYRYNRPTMFDIRQLLFCLEVtSKYFECYEGGCIPASQVVVNNLDKSAGYPFNKFG--KARLYYEMS-LEEQDQLFESTKKN--VLPTITQMNLKYAIS-----------AKNRARTVAGVSILSTMTNRQFHQK\
ILKSIVNTRNAPV-VIGTTKFYGGWDNMLRNLIQGveDPILMGWDYPKCDRAMPNLLRIAASLVLARKHTNcCTWSERIYRLYNECaqVLSETVLATGGIYVKPGGTSSGDATTAYANSVFNIiqatsanvarllsvitrdiayddiKSLQYELYQQVYRRVNFDPAFVEKFYSYLCKnfslMILSDDGVVC\
YNNTLAKQglvadISGFREVLYYQnNVFMSDSKCWVEPDLEKGPHEFCSQHTMLVEVDgELKYLPYPDPSRILGACVFVDDVDKTEpvavMERYIALAIDAYPLVHHENEEYKKVFFVLLSYI-
taltman commented 4 years ago

If you look at the Pfam-style coordinates which are now included in the deflines, you will see that contigs Node_1 and Node_2 have the same residue coordinates, as do Node_3 and Node_4. @asl are these two contig duplicates due to coronaSPAdes?

Also, all four alignment sequences are exactly the same.

rcedgar commented 4 years ago

This looks good. FYI the total length can vary if the number of insertions varies (lower-case letters). The number of upper case letters plus gaps shown as dashes (-) should be the same for every a2m sequence; 461 for RdRP_1. Sometimes a2m has a second type of gap indicated by periods (.), these also indicate variable-length insertions but I haven't see any in these files.

Will the coordinates be available for all HMM alignments now? That would be perfect for checking domain layout, e.g. for the Epsys.

asl commented 4 years ago

If you look at the Pfam-style coordinates which are now included in the deflines, you will see that contigs Node_1 and Node_2 have the same residue coordinates, as do Node_3 and Node_4. @asl are these two contig duplicates due to coronaSPAdes?

Yes, it might be if we'd output multiple variants. I doubt that they will differ in RdRp, so everything seems as expected.

rcedgar commented 4 years ago

I think Coronaspades is calling variants in this example, the reads contain two different strains of virus. They may have SNPs elsewhere in the contig and it's not surprising if the RdRP is identical in cases like this.

rcedgar commented 4 years ago

Kudos to @asl on this BTW, I was a bit skeptical of CS at first but I shouldn't have been -- it's doing a great job,

taltman commented 4 years ago

Code committed, @rchikhi , could you please integrate the few changed lines in darth.sh into your image? Thanks!

rchikhi commented 4 years ago

done. ah nice. So can you please give me a list of accessions to rerun? (or a way to get such a list?)

taltman commented 4 years ago

I think @rchikhi has re-run some of the assemblies & their annotations using this version of Darth, so I will close this for now.

taltman commented 4 years ago

Reopening to confirm fix.

taltman commented 4 years ago

Confirming that four RdRP_1 domains are found, with the expected length:

(base) ubuntu@ip-172-31-65-128:~/repos/darth/test/assemblies/DRR220591/transeq$ grep -A 1 RdRP_1 alignments.fasta 
>RdRP_1 DRR220591.coronaspades.NODE_2_length_27461_cluster_1_candidate_2_domains_26_1/4512-4998
---LASGITQQTVKPGHFNKDFYDFAEKAgmFKEGSSIPLKHFFYPQTGNAAINDYDYYRYNRPTMFDIRQLLFCLEVtSKYFECYEGGCIPASQVVVNNLDKSAGYPFNKFG--KARLYYEMS-LEEQDQLFESTKKN--VLPTITQMNLKYAIS-----------AKNRARTVAGVSILSTMTNRQFH\
QKILKSIVNTRNAPV-VIGTTKFYGGWDNMLRNLIQGveDPILMGWDYPKCDRAMPNLLRIAASLVLARKHTNcCTWSERIYRLYNECaqVLSETVLATGGIYVKPGGTSSGDATTAYANSVFNIiqatsanvarllsvitrdiayddiKSLQYELYQQVYRRVNFDPAFVEKFYSYLCKnfslMILSDD\
GVVCYNNTLAKQglvadISGFREVLYYQnNVFMSDSKCWVEPDLEKGPHEFCSQHTMLVEVDgELKYLPYPDPSRILGACVFVDDVDKTEpvavMERYIALAIDAYPLVHHENEEYKKVFFVLLSYI-
--
>RdRP_1 DRR220591.coronaspades.NODE_1_length_27665_cluster_1_candidate_1_domains_26_1/4512-4998
---LASGITQQTVKPGHFNKDFYDFAEKAgmFKEGSSIPLKHFFYPQTGNAAINDYDYYRYNRPTMFDIRQLLFCLEVtSKYFECYEGGCIPASQVVVNNLDKSAGYPFNKFG--KARLYYEMS-LEEQDQLFESTKKN--VLPTITQMNLKYAIS-----------AKNRARTVAGVSILSTMTNRQFH\
QKILKSIVNTRNAPV-VIGTTKFYGGWDNMLRNLIQGveDPILMGWDYPKCDRAMPNLLRIAASLVLARKHTNcCTWSERIYRLYNECaqVLSETVLATGGIYVKPGGTSSGDATTAYANSVFNIiqatsanvarllsvitrdiayddiKSLQYELYQQVYRRVNFDPAFVEKFYSYLCKnfslMILSDD\
GVVCYNNTLAKQglvadISGFREVLYYQnNVFMSDSKCWVEPDLEKGPHEFCSQHTMLVEVDgELKYLPYPDPSRILGACVFVDDVDKTEpvavMERYIALAIDAYPLVHHENEEYKKVFFVLLSYI-
--
>RdRP_1 DRR220591.coronaspades.NODE_4_length_18477_cluster_2_candidate_2_domains_21_2/1517-2003
---LASGITQQTVKPGHFNKDFYDFAEKAgmFKEGSSIPLKHFFYPQTGNAAINDYDYYRYNRPTMFDIRQLLFCLEVtSKYFECYEGGCIPASQVVVNNLDKSAGYPFNKFG--KARLYYEMS-LEEQDQLFESTKKN--VLPTITQMNLKYAIS-----------AKNRARTVAGVSILSTMTNRQFH\
QKILKSIVNTRNAPV-VIGTTKFYGGWDNMLRNLIQGveDPILMGWDYPKCDRAMPNLLRIAASLVLARKHTNcCTWSERIYRLYNECaqVLSETVLATGGIYVKPGGTSSGDATTAYANSVFNIiqatsanvarllsvitrdiayddiKSLQYELYQQVYRRVNFDPAFVEKFYSYLCKnfslMILSDD\
GVVCYNNTLAKQglvadISGFREVLYYQnNVFMSDSKCWVEPDLEKGPHEFCSQHTMLVEVDgELKYLPYPDPSRILGACVFVDDVDKTEpvavMERYIALAIDAYPLVHHENEEYKKVFFVLLSYI-
--
>RdRP_1 DRR220591.coronaspades.NODE_3_length_18681_cluster_2_candidate_1_domains_21_2/1517-2003
---LASGITQQTVKPGHFNKDFYDFAEKAgmFKEGSSIPLKHFFYPQTGNAAINDYDYYRYNRPTMFDIRQLLFCLEVtSKYFECYEGGCIPASQVVVNNLDKSAGYPFNKFG--KARLYYEMS-LEEQDQLFESTKKN--VLPTITQMNLKYAIS-----------AKNRARTVAGVSILSTMTNRQFH\
QKILKSIVNTRNAPV-VIGTTKFYGGWDNMLRNLIQGveDPILMGWDYPKCDRAMPNLLRIAASLVLARKHTNcCTWSERIYRLYNECaqVLSETVLATGGIYVKPGGTSSGDATTAYANSVFNIiqatsanvarllsvitrdiayddiKSLQYELYQQVYRRVNFDPAFVEKFYSYLCKnfslMILSDD\
GVVCYNNTLAKQglvadISGFREVLYYQnNVFMSDSKCWVEPDLEKGPHEFCSQHTMLVEVDgELKYLPYPDPSRILGACVFVDDVDKTEpvavMERYIALAIDAYPLVHHENEEYKKVFFVLLSYI-
(base) ubuntu@ip-172-31-65-128:~/repos/darth/test/assemblies/DRR220591/transeq$ grep -A 1 RdRP_1 alignments.fasta | egrep -v "^>" | tr -d '[a-z]' | awk '{ print length }'
461
2
461
2
461
2
461