AuReMe / emapper2gbk

Convert GFF, fastas, annotation table and species name into Genbank.
GNU Lesser General Public License v3.0
14 stars 5 forks source link

issue KeyError for gbk generation #7

Closed boaty closed 2 years ago

boaty commented 2 years ago

Description

hello,

We encountered this KeyError while running emapper2gbk. It seems like due to unmatched column for input annotation file from emapper. We tested then with GitHub example files: betbox fna,faa and annotation, but the problem persisted.

We used emapper version 2.1.6 with a default outformat 6 (--outfmt 6).

And we also noticed that online file of go-basic.obo has a missing ":" .

thanks a lot

What I Did

emapper2gbk genes -fn nucleotide_sequence/ -fp protein_sequence/  -a annotation/ -o gbk/  -go /data/eggnog-mapper_database/eggnog-mapper/data/go-basic.obo 
The default organism name 'metagenome' is used.
Assembling Genbank informations for MAG001
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/anaconda3/envs/m2m/lib/python3.9/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/anaconda3/envs/m2m/lib/python3.9/multiprocessing/pool.py", line 51, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
  File "/home/anaconda3/envs/m2m/lib/python3.9/site-packages/emapper2gbk/genes_to_gbk.py", line 103, in faa_to_gbk
    create_genbank(gene_nucleic_seqs, gene_protein_seqs, annot, go_namespaces, go_alternatives, output_path, species_informations)
  File "/home/anaconda3/envs/m2m/lib/python3.9/site-packages/emapper2gbk/genes_to_gbk.py", line 127, in create_genbank
    record = record_info(gene_nucleic_id, gene_nucleic_seqs[gene_nucleic_id], species_informations)
  File "/home/anaconda3/envs/m2m/lib/python3.9/site-packages/emapper2gbk/utils.py", line 298, in record_info
    description=species_informations['description'],
KeyError: 'description'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/anaconda3/envs/m2m/bin/emapper2gbk", line 8, in <module>
    sys.exit(cli())
  File "/home/anaconda3/envs/m2m/lib/python3.9/site-packages/emapper2gbk/__main__.py", line 309, in cli
    gbk_creation(nucleic_fasta=args.fastanucleic, protein_fasta=args.fastaprot, annot=args.annotation, org=orgnames,
  File "/home/anaconda3/envs/m2m/lib/python3.9/site-packages/emapper2gbk/emapper2gbk.py", line 196, in gbk_creation
    gbk_results = gbk_pool.starmap(genes_to_gbk.faa_to_gbk, multiprocess_data)
  File "/home/anaconda3/envs/m2m/lib/python3.9/multiprocessing/pool.py", line 372, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
  File "/home/anaconda3/envs/m2m/lib/python3.9/multiprocessing/pool.py", line 771, in get
    raise self._value
KeyError: 'description'
ArnaudBelcour commented 2 years ago

Hello @boaty,

Thank you for the issue.

The KeyError comes from an error in the code creating the input dictionary with the metadata associated to the Organism/Genome Name. The commit 113651f73917b8c074737e8b239053ec2f12e39f should fixed this issue. Can you use the GitHub version of emapper2gbk and see if this fix the issue on your end?

For the issue with the missing ":", it happens sometime after an update of the Gene Ontology obo file. This issue is fixed in the obo file in the GitHub repository (https://github.com/geneontology/go-ontology/commit/2f630886cf2a1cbf8163d1ddf2cd58b00c927482). So we should wait for the next release to have this fix. But I will try to implement a function that try to query this obo file when we encounter an issue with the release Gene Ontology file. Maybe this could fix this issue.

Best Regards.

boaty commented 2 years ago

Thank you Arnaud, I tired GitHub version and it works well, bug fixed!