AuReMe / emapper2gbk

Convert GFF, fastas, annotation table and species name into Genbank.
GNU Lesser General Public License v3.0
12 stars 5 forks source link

Usage of with eggnog-mapper2 #2

Open Lucas-Maciel opened 3 years ago

Lucas-Maciel commented 3 years ago

Description

Hi, I'm trying to use your tool with my output from eggnog-mapper v2

What I Did

I used your test data and it worked, but not with mine.

emapper2gbk genomic -fg ../Roseburia_inulinivorans_DSM16841/GCF_000174195.1_ASM17419v1_cds_from_genomic.fna -fp ../Roseburia_inulinivorans_DSM16841/GCF_000174195.1_ASM17419v1_protein.faa -o teste.out -a Roseburia_inulinivorans_DSM16841.emapper.annotations 
The default organism name 'cellular organisms' is used.
Formatting fasta and annotation file for GCF_000174195.1_ASM17419v1_genomic
Traceback (most recent call last):
  File "/raeslab/scratch/lucmac/miniconda3/bin/emapper2gbk", line 8, in <module>
    sys.exit(cli())
  File "/raeslab/scratch/lucmac/miniconda3/lib/python3.8/site-packages/emapper2gbk/__main__.py", line 245, in cli
    gbk_creation(genome=args.fastagenome, proteome=args.fastaprot, annot=args.annotation, gff=args.gff, org=orgnames, gbk=args.out, gobasic=args.gobasic, dirmode=directory_mode, cpu=args.cpu, metagenomic_mode=False)
  File "/raeslab/scratch/lucmac/miniconda3/lib/python3.8/site-packages/emapper2gbk/emapper2gbk.py", line 32, in gbk_creation
    fa_to_gbk.main(genome, proteome, annot, org, gbk, gobasic)
  File "/raeslab/scratch/lucmac/miniconda3/lib/python3.8/site-packages/emapper2gbk/fa_to_gbk.py", line 170, in main
    faa_to_gbk(genome_fasta, prot_fasta, annot_table, species_name, gbk_out, gobasic)
  File "/raeslab/scratch/lucmac/miniconda3/lib/python3.8/site-packages/emapper2gbk/fa_to_gbk.py", line 64, in faa_to_gbk
    annotation_data = dict(read_annotation(annotation_data))
  File "/raeslab/scratch/lucmac/miniconda3/lib/python3.8/site-packages/emapper2gbk/utils.py", line 269, in read_annotation
    annotation_data.columns = headers_row
  File "/home/lucmac/.local/lib/python3.8/site-packages/pandas/core/generic.py", line 5475, in __setattr__
    return object.__setattr__(self, name, value)
  File "pandas/_libs/properties.pyx", line 66, in pandas._libs.properties.AxisProperty.__set__
  File "/home/lucmac/.local/lib/python3.8/site-packages/pandas/core/generic.py", line 669, in _set_axis
    self._mgr.set_axis(axis, labels)
  File "/home/lucmac/.local/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 220, in set_axis
    raise ValueError(
ValueError: Length mismatch: Expected axis has 24 elements, new values have 1 elements
# Fri Feb 12 12:56:02 2021
# emapper-2.0.6
# emapper.py -i Roseburia_inulinivorans_DSM16841/GCF_000174195.1_ASM17419v1_protein.faa --cpu 4 --itype proteins -m diamond --output_dir eggnog --output Roseburia_inulinivorans_DSM16841 
#
#query_name     seed_eggNOG_ortholog    seed_ortholog_evalue    seed_ortholog_score     eggNOG OGs   narr_og_name     narr_og_cat     narr_og_desc    best_og_name    best_og_cat     best_og_desc    Preferred_name        GOs     EC      KEGG_ko KEGG_Pathway    KEGG_Module     KEGG_Reaction   KEGG_rclass  BRITE    KEGG_TC CAZy    BiGG_Reaction   PFAMs
cfrioux commented 3 years ago

Hi Lucas, sorry for the delay.

What you describe is likely to a bug brought by the changes since the latest release.

Unfortunately, we don't have the latest releases (2.0.x) of emapper installed on our servers yet, and it appears that the online version of eggnog-mapper does not have the latest release in production either.

Is there a way for you to run emapper on our test data and share the .emapper.annotations file with us so we could fix the bug?

Lucas-Maciel commented 3 years ago

Hi @cfrioux . Here is the file you asked

betaox.emapper.zip

Thank you very much

ArnaudBelcour commented 3 years ago

Hi @Lucas-Maciel,

I have push a commit on the genomic_update branch that should fix this issue:

https://github.com/AuReMe/emapper_to_gbk/tree/genomic_update

Can you test it?

kieft1bp-sys commented 3 years ago

Hello, I'm having the same issue. Was this ever resolved?

kieft1bp-sys commented 3 years ago

Actually I'm getting a different error too, seems to be something related to simplejson? I'm uploading my files now with the call and error message in a txt file.

emapper2gbk_test.zip

ArnaudBelcour commented 3 years ago

Hi @kieft1bp-sys,

Hello, I'm having the same issue. Was this ever resolved?

This issue has been resolved in the genomic_update branch of emapper2gbk.

Actually I'm getting a different error too, seems to be something related to simplejson? I'm uploading my files now with the call and error message in a txt file.

emapper2gbk_test.zip

Sorry for this error message, I am currently adding a more user friendly message in the new version.

This error is linked to the argument '-n "AB48"' in your command line. The '-n' argument expects a complete taxon name. For example you will get the same error if you put '-n "K-12"' instead of '-n "Escherichia coli K-12"'. But if you have no genus or species name, you can put a family name for example '-n "Enterobacteriaceae"'.

If you want to check if your taxon name is correct, you can check if using this http (which is the one used by emapper2gbk): https://www.ebi.ac.uk/ena/taxonomy/rest/scientific-name/

For example with 'Escherichia coli' (and replacing ' ' by '%20'): https://www.ebi.ac.uk/ena/taxonomy/rest/scientific-name/Escherichia%20coli

If it sends you a "No Results" message it is because either the taxon name is not in the database or there is a typo error in your taxon name.

Also it seems that you use a new version of eggnog-mapper (2.1.2) that changes the format of the output. So with the current version of emapper2gbk it will not work. I have pushed a new commit on the genomic_update branch that should fixed this issue.

But I think there will still be an issue: the nucleic fasta you provided contains gene sequences and not genome (chromosome) sequences so with the genome mode it will not work and the GFF file does not have a compatible format with the one expected in emapper2gbk (presented here).

If you use the genomic_update branch, you could obtain a genbank file with your "AB48.fna", "AB48.faa" and "AB48.emapper.annotations.tsv" by using the command:

emapper2gbk genes -fn AB48.fna -fp AB48.faa -a AB48.emapper.annotations.tsv -o AB48.gbk -n "Taxon name"

kieft1bp-sys commented 3 years ago

Thanks for the extensive answer! I'll try out your suggestions today.

kieft1bp-sys commented 3 years ago

I tried using the last command you suggested after installing the new branch and the program runs fine but does not bring in any annotations from the emapper annotations file (see attached .gbk).

AB48.gbk.txt

kieft1bp-sys commented 3 years ago

Also, I modified my .gff file according to the format you linked to (see attached .gff) and tried running in "genomes" mode with my correct genome assembly .fna file (see attached .fna). It ran fine but produced an odd-looking gbk file (attached .gbk), so maybe my reformatting didn't help. (adding .txt to all file extensions because github needs it).

AB48_genomes_mode.gbk.txt AB48_genome.fna.txt AB48_updated.gff.txt

ArnaudBelcour commented 3 years ago

I tried using the last command you suggested after installing the new branch and the program runs fine but does not bring in any annotations from the emapper annotations file (see attached .gbk).

AB48.gbk.txt

emapper2gbk will only extract GO Terms, EC number and gene name from the eggnog-mapper file. If genes have not these annotations, they will be not be annotated in the genbank. For example, the first 3 genes in the genbank file are not annotated because they have no GO Terms, EC numbers and gene name in the eggnog-mapper annotation file.

But if you move down in the file, you can see that the gene "contig_5_1000" is annotated. Or you can search in the file for "go_component", "gene", "go_function", "go_process" or "EC_number" to find annotations from eggnog-mapper.

Also, I modified my .gff file according to the format you linked to (see attached .gff) and tried running in "genomes" mode with my correct genome assembly .fna file (see attached .fna). It ran fine but produced an odd-looking gbk file (attached .gbk), so maybe my reformatting didn't help. (adding .txt to all file extensions because github needs it).

AB48_genomes_mode.gbk.txt AB48_genome.fna.txt AB48_updated.gff.txt

In this genbank, there is no annotation and no protein sequences associated to genes. I think it is because when you have updated the GFF file, the ID of the CDS does not match the ID in the "AB48.fna" and "AB48.emapper.annotations.tsv" files. For example in the GFF file: "cds-contig_5_1" is the CDS ID for "contig_5_1". So emapper2gbk will search for the ID "cds-contig_5_1" in the "AB48.fna" and in the "AB48.emapper.annotations.tsv". But it will not find it as in these files it is still labelled "contig_5_1".

Updating both "AB48.fna" and "AB48.emapper.annotations.tsv" with the "cds-contig" ID should fix this.