Order FASTA sequences and added warnings for CDS-prot mismatches

guignonv commented 2 years ago

FASTA sequences are automatically ordered when loaded through read_FASTA_file_array() (either alphabetically or numerically according to sequence name nature) get_homologues-est.pl warns when it detects CDS and protein sequences that do not appear to be matching. When the sequence names are different (discarding FASTA sequence header comments) between .fna and .faa, the protein file will be discarded. If a sequence appears to be completely different in CDS or protein (protein not being a translation of a CDS since their length differ too much), just a warning is displayed for the sequence (bu the protein .faa is kept).

eead-csic-compbio commented 2 years ago

Hi @guignonv , many thanks for taking the time to write this! As you can see your commits currently conflict with my proposed solution, which involved creating a new subroutine same_sequence_order in PhyTools (please see 14c4b46e21e5b953192258be1e19b5947dd1d094). My preference would be to:

1) ask you to edit your suggested code so that sequence order is actually checked in that sub (same_sequence_order). As you can see, I went all the way and actually translated the sequences, but did not checked their lengths as you did, which could make the test faster, can you please add that bit of logic?

2) I still prefer not to change the order of sequences inferred from the FASTA file from within the code, as I have seen cases where names are not conserved. I therefore prefer to leave that up to the user and don't use the .faa file in that case. I prefer to keep that simple, ok?

Wait for your comments, Bruno

guignonv commented 2 years ago

I need time to process those changes. However, regarding your comment 2. ("names are not conserved"), I can say that I also noticed the same problem in the genomes I use. So, it's true that different names should not be "blocking" the use of the protein file. But to go further, regarding the translation of CDS, I also met genomes that do not provide the exact translation of a CDS. That's why I checked the length rather than the content, with a magnitude of +/- 1 amino acid due to frame reading shifting.

Here is an example of gene from Medicago truncatula (Tang et al., 2014, https://www.jcvi.org/publications/improved-genome-release-version-mt40-model-legume-medicago-truncatula). CDS sequence:

>Medtr8g470560.1 | LRR receptor-like kinase family protein | LC | chr8:25717385-25714796 | 20140818
ACCATCTTAATAACATTGTTTCTTATCCATTTTCAGGTACGATTCCCGAGGAGATTGGCT
ATCTTGATAAACTTGAGGTGTTATACTTGTATAACAATAGCTTAAGTGGATCTATTCCTT
CCAAAATCTTCAACTTGTCATCACTCACTCATTTGGGGGTTGATCAGAATAGCCTCTCAG
GCACACTTCCATCAAATACGGGATATAGCCTTCCTAACCTGCAATATCTATACTTGAATC
ATAACAATTTTGTTGGAAATATTCCAAATAACATATTCAACTCTTCTAATCTTATTATTT
TTCAATTGCATGACAATGCATTCAGTGGAACTCTACCCAATATTGCTTTTGGAGATTTAG
GATTGCTTGAATCGTTTCGCATATATAACAACAATTTGACAATAGAAGATTCTCATCAAT
TCTTTACTTCCTTGACAAATTGTAGATATTTGAAGTATCTAGACTTATCAGGGAATCATA
TATCCAATCTTCCTAAGTCAATTGGAAACATAACATCAGAATTCTTCCGAGCAGCATCAT
GTGGAATTGATGGTAATATTCCCCAAGAAGTTGGAAACATGACCAACTTGCTACTTCTTT
CTATTTTTGGGAATAATATAACTGGACGAATACCTGGTACATTCAAAGAGTTGCAGAAAC
TTCAGTATTTGAATCTTGGCAACAATGGACTACAAGGATCATTTATTGAAGAGTTTTGTG
AAATGAAGAGTTTGGGTGAGTTGTATCTGGAAAATAATAAGCTCTCTGGAGTTTTACCAA
CATGTTTGGGAAATATGACTTCTCTTAGAATATTGAACATTGGATCTAATGATTTGAACT
CTAAGATACCGTCCTCTCTTTGGAGTCTCAAAGATATCTTACTAGTAAATTTGTTCTCTA
ATGCTCTCATTGGTGATCTTCCACCTGAGGTTGGGAATTTGAGACAAATTGTAGTATTAG
ATCTATCAAGAAATCATATTTCAAGAAACATTCCCACAACCATCAGTTCCTTACAAAATT
TGCAGACTCTGTCCTTAGCTCATAACAAATTGAATGGATCAATTCCCTCATCACTTAGTG
AAATGGTAAGCTTGGTCTCTTTGGACTTGTCCCAAAATATGCTAGATGGTGTTATTCCAA
AATCCTTAGAATCACTTTTGTACCTTCAAAACATCAACTTCTCATATAATAGATTACAAG
GAGAGATTCCTGATGGTGGACATTTCAAAAACTTCACAGCTCAATCATTCATGCATAATG
ATGCACTTTGCGGTGATCCTCGACTCATAGTACCTCCATGTGATAAGCAAGTTAAAAAAT
GGTCAATGGAAAAAAAGCTTATATTGAAATGCATACTTCCTATAGTTGTGTCAGTCGTTT
TGATTGTTGCGTGCATCATACTTTTAAAACATAATAAAGGGAAAAAGAACGAAACTACTC
TTGAAAGGGGTTTTTCAACTTTAGGAGCTCCAAGAAGAATATCCTATTATGAAATTGTGC
AAGCAACTAATGGATTCAATGAGAGTAATTTCCTTGGAAGGGGGGGATTTGGCTCTGTTT
ATCAAGGAAAGCTTCATGATGGTGAGATGATTGCAGTAAAAGTAATTGATTTGCAATCAG
AGGCAAAATCAAAGAGCTTTGATGCAGAATGCAATGCGATGAGAAATCTACGACATCGGA
ATCTGGTAAAGATTATCAGAAGTTGCTCAAATCTTGATTTCAAATCATTGGTGATGGAGT
TCATGTCAAATGGAAGTGTAGAAAAATGGTTATATTCAAATAAATATTGTCTAAGTTTCT
TGCAAAGGTTAAATATAATGATAGATGTTGCATCTGCATTGGAATATCTCCATCGTGGTT
CTTCAATACCAGTGGTTCATTGTGATCTAAAGCCTTCCAATGTCTTGTTGGATGAAAATA
TGGTTGCACATGTTAGCGATTTTGGTATTGCAAAGCTCATGGATGAAGGACAATCTCAAA
CTCATACACAAACTTTGGCTACTATTGGATATCTTGCACCAGAGTATGGTTCTAGAGGAA
TTGTTTCTGTCAAAGGAGACGTGTACAGCTATGGGATTATGCTAATGGAAATCTTAACAA
GAAAAAAGCCAACAGATGATATGTTTGTTGCAGAACTAAGCTTGAAGACATGGATCAGTG
AATCATTGCCTAATTCAATTATGGAGGTCATGGATTCAAATTTAGTCCAAATAACTGGGG
ACCAAATTGATGATATATCGACTCACATGTCATCTATTTTTAGTTTAGCCCTGAGTTGTT
GTGAAAATTCACCTGAAGCAAGAATTAATATGGCAGATGTTATTGCGTCGCTAATGAAAA
TAAAGGCTTTGGTTCTTGGCGCAAACAGGGTCTAG

Protein sequence:

>Medtr8g470560.1 | LRR receptor-like kinase family protein | LC | chr8:25717385-25714796 | 20140818
HLNNIVSYPFSGTIPEEIGYLDKLEVLYLYNNSLSGSIPSKIFNLSSLTHLGVDQNSLSG
TLPSNTGYSLPNLQYLYLNHNNFVGNIPNNIFNSSNLIIFQLHDNAFSGTLPNIAFGDLG
LLESFRIYNNNLTIEDSHQFFTSLTNCRYLKYLDLSGNHISNLPKSIGNITSEFFRAASC
GIDGNIPQEVGNMTNLLLLSIFGNNITGRIPGTFKELQKLQYLNLGNNGLQGSFIEEFCE
MKSLGELYLENNKLSGVLPTCLGNMTSLRILNIGSNDLNSKIPSSLWSLKDILLVNLFSN
ALIGDLPPEVGNLRQIVVLDLSRNHISRNIPTTISSLQNLQTLSLAHNKLNGSIPSSLSE
MVSLVSLDLSQNMLDGVIPKSLESLLYLQNINFSYNRLQGEIPDGGHFKNFTAQSFMHND
ALCGDPRLIVPPCDKQVKKWSMEKKLILKCILPIVVSVVLIVACIILLKHNKGKKNETTL
ERGFSTLGAPRRISYYEIVQATNGFNESNFLGRGGFGSVYQGKLHDGEMIAVKVIDLQSE
AKSKSFDAECNAMRNLRHRNLVKIIRSCSNLDFKSLVMEFMSNGSVEKWLYSNKYCLSFL
QRLNIMIDVASALEYLHRGSSIPVVHCDLKPSNVLLDENMVAHVSDFGIAKLMDEGQSQT
HTQTLATIGYLAPEYGSRGIVSVKGDVYSYGIMLMEILTRKKPTDDMFVAELSLKTWISE
SLPNSIMEVMDSNLVQITGDQIDDISTHMSSIFSLALSCCENSPEARINMADVIASLMKI
KALVLGANRV*

That protein has been translated using the 3rd reading frame. If you translate it using the first reading frame, you would get that: TILITLFLIHFQVRFPRRLAILINLRCYTCITIA-VDLFLPKSSTCHHSLIWGLIRIASQAHFHQIRDIAFLTCNIYT-IITILLEIFQITYSTLLILLFFNCMTMHSVELYPILLLEI-DCLNRFAYITTI-Q-KILINSLLP-QIVDI-SI-TYQGIIYPIFLSQLET-HQNSSEQHHVELMVIFPKKLET-PTCYFFLFLGII-LDEYLVHSKSCRNFSI-ILATMDYKDHLLKSFVK-RVWVSCIWKIISSLEFYQHVWEI-LLLEY-TLDLMI-TLRYRPLFGVSKISY--ICSLMLSLVIFHLRLGI-DKL-Y-IYQEIIFQETFPQPSVPYKICRLCP-LITN-MDQFPHHLVKW-AWSLWTCPKIC-MVLFQNP-NHFCTFKTSTSHIIDYKERFLMVDISKTSQLNHSCIMMHFAVILDS-YLHVISKLKNGQWKKSLY-NAYFL-LCQSF-LLRASYF-NIIKGKRTKLLLKGVFQL-ELQEEYPIMKLCKQLMDSMRVISLEGGDLALFIKESFMMVR-LQ-K-LICNQRQNQRALMQNAMR-EIYDIGIW-RLSEVAQILISNHW-WSSCQMEV-KNGYIQINIV-VSCKG-I---MLHLHWNISIVVLQYQWFIVI-SLPMSCWMKIWLHMLAILVLQSSWMKDNLKLIHKLWLLLDILHQSMVLEELFLSKETCTAMGLC-WKS-QEKSQQMICLLQN-A-RHGSVNHCLIQLWRSWIQI-SK-LGTKLMIYRLTCHLFLV-P-VVVKIHLKQELIWQMLLRR--K-RLWFLAQTGS

Which will obviously not match the given protein. So, matching using translation might be a bit too complex. I'll investigate the code on next Monday or Tuesday but maybe we should consider a different approach: how about just warning the user about detected discrepancies between CDS and protein (different names, different size) without discarding the use of the protein file and add a parameter to get_homologues-est that would enable FASTA reordering based on names for people who don't want to struggle with their FASTA file to have genes ordered properly? Still, there is the problem of corresponding names... in my experience, it's often a story of genes using a "G" and protein using a "P" at a given place in the names (sometimes capital letter, sometimes not). Maybe, we could add a function to "smart-guess" that kind of difference of even change the name matching method to ignore differences between a P and a G... I'm in "brainstorming" mode so anybody is welcome to give other (better) ideas! ;-)

eead-csic-compbio commented 2 years ago

I need time to process those changes. However, regarding your comment 2. ("names are not conserved"), I can say that I also noticed the same problem in the genomes I use. So, it's true that different names should not be "blocking" the use of the protein file.

Agree.

But to go further, regarding the translation of CDS, I also met genomes that do not provide the exact translation of a CDS. That's why I checked the length rather than the content, with a magnitude of +/- 1 amino acid due to frame reading shifting.

Checking the length first would save actually the translation in cases with unsorted sequences, so I think it's a good idea.

Here is an example of gene from Medicago truncatula (Tang et al., 2014, https://www.jcvi.org/publications/improved-genome-release-version-mt40-model-legume-medicago-truncatula). CDS sequence: [...] That protein has been translated using the 3rd reading frame. If you translate it using the first reading frame, you would get [...] Which will obviously not match the given protein. So, matching using translation might be a bit too complex.

This is clearly a bad sequence, a nucleotide CDS sequence should have one codon per amino acid residue and a stop codon, right? I think in this case the code should definetely warn the user and don't use it. There's another reason we prefer to make sure translations match. When they do, you can use clusters of nucleotides and peptides to do molecular phylogenies with DNA and codon-based models, respectively, as we do with the GET_PHYLOMARKERS software.

I'll investigate the code on next Monday or Tuesday but maybe we should consider a different approach: how about just warning the user about detected discrepancies between CDS and protein (different names, different size) without discarding the use of the protein file and add a parameter to get_homologues-est that would enable FASTA reordering based on names for people who don't want to struggle with their FASTA file to have genes ordered properly? Still, there is the problem of corresponding names... in my experience, it's often a story of genes using a "G" and protein using a "P" at a given place in the names (sometimes capital letter, sometimes not). Maybe, we could add a function to "smart-guess" that kind of difference of even change the name matching method to ignore differences between a P and a G... I'm in "brainstorming" mode so anybody is welcome to give other (better) ideas! ;-)

To keep the script simple I would prefer to warn the user and suggest him/her to use a separate to-be-written script, with your help, to sort the nt and aa sequences. In that separate script, which could be named sort_CDS_sequences.pl, we can apply some of those ideas. In order for that to work generally I believe we would have to sort entries in the twin FASTA files by their sequences, while conserving their original names. I think that's the most robust approach. Of course you might find CDS with the same nt/aa sequences, you would have to warn the user about them?

Bruno

guignonv commented 2 years ago

Sounds good. I'll see that next week then. :) [UPDATE]: ...well, the week after I guess.

eead-csic-compbio / get_homologues

Order FASTA sequences and added warnings for CDS-prot mismatches #101