gem-pasteur / macsyfinder

MacSyFinder - Detection of macromolecular systems in protein datasets using systems modelling and similarity search.
GNU General Public License v3.0
51 stars 17 forks source link

FAA files do not always reflect the real order of the genes #64

Open najwataib opened 1 year ago

najwataib commented 1 year ago

Is your feature request related to a problem? Please describe. MSF relies on the order of the protein sequences in the faa files to identify systems. However, sometimes the proteins files downloaded from the ncbi are not ordered according to their positions on the genomes. This results on MSF missing some systems.

Describe the solution you'd like One solution would be to use gff files to retrieve the order of the proteins on the genomes, and either re-ordering the faa files either including directly this information while processing all the hits found after hmm searches.

Describe alternatives you've considered Currently, I am re-ordering systematically all the faa files I download from the ncbi.

Please complete the following information):

OS:

MacSyFinder Version: macsyfinder2.1

jpjarnoux commented 2 months ago

Hi ! I encounter the same problem, I found a solution by using the translated_cds file from NCBI. Be careful that there are pseudogenes inside. If you want to remove the pseudogenes, you can use this line

awk '/^>/ {if (skip) skip=0; if (/pseudo=true/) skip=1} !skip' input_genome > genome_without_pseudo
saphia commented 1 month ago

Hi Najwa and Jérôme,

This is actually a feature that we may add in a future version of MacSyFinder, the reordering of the FASTA files. But this would mean we would have to (optionally) require more input files than simple FASTA files (e.g. GFF files). Stay tuned!

Best,

Sophie