eead-csic-compbio / get_homologues

GET_HOMOLOGUES: a versatile software package for pan-genome analysis
Other
110 stars 26 forks source link

Invalid protein cluster files generated by get_homologues-est when .faa and .fna have genes in different order. #100

Closed guignonv closed 2 years ago

guignonv commented 2 years ago

When using .faa and .fna genome files that don't share the same order of genes, get_homologues-est will generate .faa cluster files with the correct protein names but with incorrect protein content. The documentation ("3.1 Input data") states that .fna files must have twin .faa files but it does not state that their gene order must be the same. Furthermore, no check is made on the gene order.

Surprisingly, it is quite common to have different gene order in .faa and .fna for a genome. So, when get_homologues-est is run on such a dataset, it will parse .fna and .faa files and assume that genes in .faa follow the order of the ones in the .fna. When it will generate protein cluster files, it will therefore use the gene names of the .fna files but the sequences of the .faa files that have the same position of the .fna gene, even if their names don't match.

A simple fix could be to sort genes by name when FASTA files are read. Therefore, if names are identical, they will be sorted in the same order in both files.

eead-csic-compbio commented 2 years ago

Hi @guignonv , thanks for reporting this. We did warn about this in the GET_HOMOLOGUES manual:

if each input .faa file has a twin .fna file in place, containing the corresponding  
nucleotide sequences in the same order, the program will attempt to produce the 
corresponding clusters of nucleotide sequences

However, we failed to explain that in the GET_HOMOLOGUES-EST manual, thanks for the catch. I think we can:

  1. add a check to the script to make sure CDS sequences are in the same order in .fna & .faa files
  2. explain this properly in the manual
  3. optionally, when we have the time, we could think of a way of sorting the sequences; I guess we never did it as you cannot trust the sequence name will be the same in both files, right?
guignonv commented 2 years ago

I'm about to produce a PR for that. I'm just running tests at the moment. ;)

eead-csic-compbio commented 2 years ago

I am about to add the code that checks the order

eead-csic-compbio commented 2 years ago

Please see my comments at https://github.com/eead-csic-compbio/get_homologues/pull/101