AuReMe / emapper2gbk

Convert GFF, fastas, annotation table and species name into Genbank.
GNU Lesser General Public License v3.0
14 stars 5 forks source link

Incorrect gbk files when genes identifiers are numbers #10

Open cfrioux opened 2 years ago

cfrioux commented 2 years ago

Description

Running emapper2gbk in genes mode with gene identifiers consisting of numbers does not create all the GBK features (translation etc.). There is no crash, a gbk is created but it lacks some important information.

What I Did

emapper2gbk genes -fn bin.fna -fp bin.faa -o bin.gbk -n "Prevotella" -a bin.tsv
LOCUS       _10007119               3225 bp    DNA              BCT 08-MAR-2022
DEFINITION  Prevotella genome.
ACCESSION   10007119
VERSION     10007119
KEYWORDS    Prevotella.
SOURCE      .
  ORGANISM  Prevotella
            Bacteria; Bacteroidetes; Bacteroidia; Bacteroidales; Prevotellaceae.
FEATURES             Location/Qualifiers
     source          1..3225
                     /scaffold="10007119"
                     /db_xref="taxon:838"
     gene            2..3225
                     /locus_tag="gene_10007119"
     CDS             2..3225
                     /locus_tag="gene_10007119"
ORIGIN
        1 atgaaagatc aaaatattaa gaaggtgttg ctcctcggct ccggtgcgtt gaagatcggt
       61 gaggccggcg agttcgacta ttccggttca caggcactca aggcgctgcg tgaggaaggc
      121 gtctacacgg tgctcatcaa tcctaatatc gccaccgtgc agacctccga gggcgtggcc
     [...]
//

When adding a prefix to all identifiers, a correct gbk is created:

LOCUS       g10007119               3225 bp    DNA              BCT 08-MAR-2022
DEFINITION  Prevotella genome.
ACCESSION   g10007119
VERSION     g10007119
KEYWORDS    Prevotella.
SOURCE      .
  ORGANISM  Prevotella
            Bacteria; Bacteroidetes; Bacteroidia; Bacteroidales; Prevotellaceae.
FEATURES             Location/Qualifiers
     source          1..3225
                     /scaffold="g10007119"
                     /db_xref="taxon:838"
     gene            2..3225
                     /locus_tag="g10007119"
     CDS             2..3225
                     /locus_tag="g10007119"
                     /gene="carB"
                     /EC_number="6.3.5.5"
                     /dbxref="KEGG:R00256"
                     /dbxref="KEGG:R00575"
                     /dbxref="KEGG:R01395"
                     /dbxref="KEGG:R10948"
                     /dbxref="KEGG:R10949"
                     /translation="MKDQNIKKVLLLGSGALKIGEAGEFDYSGSQALKALREEGVYTVL
                     INPNIATVQTSEGVADQIYFLP[...]"
ORIGIN
        1 atgaaagatc aaaatattaa gaaggtgttg ctcctcggct ccggtgcgtt gaagatcggt
       61 gaggccggcg agttcgacta ttccggttca caggcactca aggcgctgcg tgaggaaggc
      121 gtctacacgg tgctcatcaa tcctaatatc gccaccgtgc agacctccga gggcgtggcc
      [...]
\\
cfrioux commented 2 years ago

This case is partly accounted for already in https://github.com/AuReMe/emapper2gbk/blob/master/emapper2gbk/genes_to_gbk.py#L131

ArnaudBelcour commented 2 years ago

This should be fixed in commit https://github.com/AuReMe/emapper2gbk/commit/9025e0883f6c9e67df45e0242cf3273e2d67600e for emapper2gbk genes.

But there is still work to fix it for emapper2gbk genomes.

ArnaudBelcour commented 2 years ago

A first fix for genomes and genes have been made in 0.2.0.