Open alsmadin01 opened 2 years ago
Hi @alsmadin01,
I think the issue comes from the empty lines in the metagenome-name-final.tsv
. If I understand correctly your screenshots, this file is as follow:
MGYG000000002 | Blautia faecis |
MGYG000000003 | Alistipes shahii |
When emapper2gbk tries to read the file, it encounters the empty line between MGYG000000002
and MGYG000000003
. It tries to parse this line, expecting tabulations and 2 columns. But as it is empty it fails and returns the error you encounter.
Maybe by removing the empty lines, such as in this example:
MGYG000000002 | Blautia faecis |
MGYG000000003 | Alistipes shahii |
It could fix the issue.
Best Regards, Arnaud Belcour.
@ArnaudBelcour Yes, I actually figured that out yesterday and it worked. I created the tabulation in Excel and converted it to tsv, and I guess that leaves blank lines in between that I haven’t noticed or taken into consideration. Thank you!
Hi @ArnaudBelcour,
As a follow up, can I run emapper2gbk for species not present in the EBI catalog like "UBA737 sp900554525" or "RC9"? I have gff, EggNOG annotation, fasta, and faa files for each. Also wanted to ask what kind of taxonomic information is extracted when species name is provided (like in a .tsv in my case).
Hi @alsmadin01,
When there is no match on the EBI (which used the NCBI Taxonomy classification), there is often 2 possibilities:
the corresponding species is not present in the database. When I encounter this issue, I use a higher taxonomic rank in the affiliation. For an artificial example cellular organisms; Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacterales; Enterobacteriaceae; Escherichia; Escherichia coli
, If Escherichia coli
is not found, I will use a higher taxonomic rank such as the genus Escherichia
or the family Enterobacteriaceae
.
the species name comes from a different classification (for example GBIF). Then you have to search for a mapping between this classification and the classification from the NCBI Taxonomy. For example for "UBA737 sp900554525", I used the family associated to this species Acutalibacteraceae and I made a second research with it, to find an article saying that Acutalibacteraceae
corresponds to Ruminococcaceae
(in the NCBI taxonomy database) and after some search I have found a Ruminococcaceae bacterium UBA737. But this is method is quite time consuming, another way to handle it is to do the same method than the one used for the previous issue: to use a higher taxonomic rank which is present in the EBI, for UBA737 sp900554525
it could have been Clostridia
.
Using the species names, 3 informations will be extracted and used (I use a genbank as example for the output):
Escherichia coli
Bacteria;Proteobacteria; Gammaproteobacteria; Enterobacterales;Enterobacteriaceae; Escherichia.
562
, it is the Taxonomy ID in this page.These data are extracted because they will be used by Pathway Tools during the draft metabolic network reconstruction. Pathway Tools uses them to place the organism inside the taxonomic hierarchy. This will impact the taxonomic pruning of the metabolic pathways as metabolic pathways are associated to a range of taxons and the taxonomic information will be used to keep or not a pathway according to the organism.
Description
I am using M2M, and I am trying to make gbk files using emapper2gbk for ~1200 bacterial species for use as input in m2m recon. I have created separate folders with .faa, .fna, .gff, and eggNOG files (in .tsv), and have also made a .tsv file with genome ID in column 1 against bacterial name in column 2 for -nf. However, when I try to run emapper using the genomes mode for all of my ~1200 bacterial species, I attain the error attached below. Please note that I have tried to run emapper genomes for each bacteria separately and it was working; the issue seems to be when I try to run it in bulk.
What I Did
Here is how the organism name .tsv file looks like: