emapper2gbk genomes issue

alsmadin01 commented 2 years ago

eggnog2gbk version:
Python version:
Operating System:

Description

I am using M2M, and I am trying to make gbk files using emapper2gbk for ~1200 bacterial species for use as input in m2m recon. I have created separate folders with .faa, .fna, .gff, and eggNOG files (in .tsv), and have also made a .tsv file with genome ID in column 1 against bacterial name in column 2 for -nf. However, when I try to run emapper using the genomes mode for all of my ~1200 bacterial species, I attain the error attached below. Please note that I have tried to run emapper genomes for each bacteria separately and it was working; the issue seems to be when I try to run it in bulk.

What I Did

Here is how the organism name .tsv file looks like:

ArnaudBelcour commented 2 years ago

Hi @alsmadin01,

I think the issue comes from the empty lines in the metagenome-name-final.tsv. If I understand correctly your screenshots, this file is as follow:


MGYG000000002	Blautia faecis

MGYG000000003	Alistipes shahii

When emapper2gbk tries to read the file, it encounters the empty line between MGYG000000002 and MGYG000000003. It tries to parse this line, expecting tabulations and 2 columns. But as it is empty it fails and returns the error you encounter.

Maybe by removing the empty lines, such as in this example:


MGYG000000002	Blautia faecis
MGYG000000003	Alistipes shahii

It could fix the issue.

Best Regards, Arnaud Belcour.

alsmadin01 commented 2 years ago

@ArnaudBelcour Yes, I actually figured that out yesterday and it worked. I created the tabulation in Excel and converted it to tsv, and I guess that leaves blank lines in between that I haven’t noticed or taken into consideration. Thank you!

alsmadin01 commented 2 years ago

Hi @ArnaudBelcour,

As a follow up, can I run emapper2gbk for species not present in the EBI catalog like "UBA737 sp900554525" or "RC9"? I have gff, EggNOG annotation, fasta, and faa files for each. Also wanted to ask what kind of taxonomic information is extracted when species name is provided (like in a .tsv in my case).

ArnaudBelcour commented 2 years ago

Hi @alsmadin01,

When there is no match on the EBI (which used the NCBI Taxonomy classification), there is often 2 possibilities:

the corresponding species is not present in the database. When I encounter this issue, I use a higher taxonomic rank in the affiliation. For an artificial example cellular organisms; Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacterales; Enterobacteriaceae; Escherichia; Escherichia coli, If Escherichia coli is not found, I will use a higher taxonomic rank such as the genus Escherichia or the family Enterobacteriaceae.
the species name comes from a different classification (for example GBIF). Then you have to search for a mapping between this classification and the classification from the NCBI Taxonomy. For example for "UBA737 sp900554525", I used the family associated to this species Acutalibacteraceae and I made a second research with it, to find an article saying that Acutalibacteraceae corresponds to Ruminococcaceae (in the NCBI taxonomy database) and after some search I have found a Ruminococcaceae bacterium UBA737. But this is method is quite time consuming, another way to handle it is to do the same method than the one used for the previous issue: to use a higher taxonomic rank which is present in the EBI, for UBA737 sp900554525 it could have been Clostridia.

Using the species names, 3 informations will be extracted and used (I use a genbank as example for the output):

Organism name: Escherichia coli
"Taxonomy": Bacteria;Proteobacteria; Gammaproteobacteria; Enterobacterales;Enterobacteriaceae; Escherichia.
Taxon ID in the NCBI Taxonomy: 562, it is the Taxonomy ID in this page.

These data are extracted because they will be used by Pathway Tools during the draft metabolic network reconstruction. Pathway Tools uses them to place the organism inside the taxonomic hierarchy. This will impact the taxonomic pruning of the metabolic pathways as metabolic pathways are associated to a range of taxons and the taxonomic information will be used to keep or not a pathway according to the organism.

AuReMe / emapper2gbk

emapper2gbk genomes issue #12

Description

What I Did