Open najwataib opened 1 year ago
Hi ! I encounter the same problem, I found a solution by using the translated_cds file from NCBI. Be careful that there are pseudogenes inside. If you want to remove the pseudogenes, you can use this line
awk '/^>/ {if (skip) skip=0; if (/pseudo=true/) skip=1} !skip' input_genome > genome_without_pseudo
Hi Najwa and Jérôme,
This is actually a feature that we may add in a future version of MacSyFinder, the reordering of the FASTA files. But this would mean we would have to (optionally) require more input files than simple FASTA files (e.g. GFF files). Stay tuned!
Best,
Sophie
Is your feature request related to a problem? Please describe. MSF relies on the order of the protein sequences in the faa files to identify systems. However, sometimes the proteins files downloaded from the ncbi are not ordered according to their positions on the genomes. This results on MSF missing some systems.
Describe the solution you'd like One solution would be to use gff files to retrieve the order of the proteins on the genomes, and either re-ordering the faa files either including directly this information while processing all the hits found after hmm searches.
Describe alternatives you've considered Currently, I am re-ordering systematically all the faa files I download from the ncbi.
Please complete the following information):
OS:
MacSyFinder Version: macsyfinder2.1