AuReMe / emapper2gbk

Convert GFF, fastas, annotation table and species name into Genbank.
GNU Lesser General Public License v3.0
11 stars 5 forks source link

emapper2gbk genomes issue #12

Open alsmadin01 opened 2 years ago

alsmadin01 commented 2 years ago

Description

I am using M2M, and I am trying to make gbk files using emapper2gbk for ~1200 bacterial species for use as input in m2m recon. I have created separate folders with .faa, .fna, .gff, and eggNOG files (in .tsv), and have also made a .tsv file with genome ID in column 1 against bacterial name in column 2 for -nf. However, when I try to run emapper using the genomes mode for all of my ~1200 bacterial species, I attain the error attached below. Please note that I have tried to run emapper genomes for each bacteria separately and it was working; the issue seems to be when I try to run it in bulk.

What I Did

Screen Shot 2022-03-31 at 1 41 05 PM

Here is how the organism name .tsv file looks like:

Screen Shot 2022-03-31 at 2 18 10 PM
ArnaudBelcour commented 2 years ago

Hi @alsmadin01,

I think the issue comes from the empty lines in the metagenome-name-final.tsv. If I understand correctly your screenshots, this file is as follow:

MGYG000000002 Blautia faecis
MGYG000000003 Alistipes shahii

When emapper2gbk tries to read the file, it encounters the empty line between MGYG000000002 and MGYG000000003. It tries to parse this line, expecting tabulations and 2 columns. But as it is empty it fails and returns the error you encounter.

Maybe by removing the empty lines, such as in this example:

MGYG000000002 Blautia faecis
MGYG000000003 Alistipes shahii

It could fix the issue.

Best Regards, Arnaud Belcour.

alsmadin01 commented 2 years ago

@ArnaudBelcour Yes, I actually figured that out yesterday and it worked. I created the tabulation in Excel and converted it to tsv, and I guess that leaves blank lines in between that I haven’t noticed or taken into consideration. Thank you!

alsmadin01 commented 2 years ago

Hi @ArnaudBelcour,

As a follow up, can I run emapper2gbk for species not present in the EBI catalog like "UBA737 sp900554525" or "RC9"? I have gff, EggNOG annotation, fasta, and faa files for each. Also wanted to ask what kind of taxonomic information is extracted when species name is provided (like in a .tsv in my case).

ArnaudBelcour commented 2 years ago

Hi @alsmadin01,

When there is no match on the EBI (which used the NCBI Taxonomy classification), there is often 2 possibilities:

Using the species names, 3 informations will be extracted and used (I use a genbank as example for the output):

These data are extracted because they will be used by Pathway Tools during the draft metabolic network reconstruction. Pathway Tools uses them to place the organism inside the taxonomic hierarchy. This will impact the taxonomic pruning of the metabolic pathways as metabolic pathways are associated to a range of taxons and the taxonomic information will be used to keep or not a pathway according to the organism.