Closed guignonv closed 2 years ago
Hi @guignonv , thanks for reporting this. We did warn about this in the GET_HOMOLOGUES manual:
if each input .faa file has a twin .fna file in place, containing the corresponding
nucleotide sequences in the same order, the program will attempt to produce the
corresponding clusters of nucleotide sequences
However, we failed to explain that in the GET_HOMOLOGUES-EST manual, thanks for the catch. I think we can:
I'm about to produce a PR for that. I'm just running tests at the moment. ;)
I am about to add the code that checks the order
Please see my comments at https://github.com/eead-csic-compbio/get_homologues/pull/101
When using .faa and .fna genome files that don't share the same order of genes, get_homologues-est will generate .faa cluster files with the correct protein names but with incorrect protein content. The documentation ("3.1 Input data") states that .fna files must have twin .faa files but it does not state that their gene order must be the same. Furthermore, no check is made on the gene order.
Surprisingly, it is quite common to have different gene order in .faa and .fna for a genome. So, when get_homologues-est is run on such a dataset, it will parse .fna and .faa files and assume that genes in .faa follow the order of the ones in the .fna. When it will generate protein cluster files, it will therefore use the gene names of the .fna files but the sequences of the .faa files that have the same position of the .fna gene, even if their names don't match.
A simple fix could be to sort genes by name when FASTA files are read. Therefore, if names are identical, they will be sorted in the same order in both files.