Wrong GenBank output file

pailloufat-stack commented 2 years ago

eggnog2gbk version: 0.1.0
Python version: 3.7.4
Operating System: CentOS-7

Description

Hello,

I did a annotation with eggnog-mapper on my new Pacbio assembly. The annotation and the output files are (look) OK. What I want to do is convert my GFF annotation file into a GenBank file. That's why I use emapper2gbk.

I have a single chromosome assembly >CP019962.1_RagTag , (I only show the beginning of the files) my gene predictions from proidgal :

>CP019962.1_RagTag_1  2  115  -1  ID=1_1;partial=10;start_type=ATG;rbs_motif=GGAGG;rbs_spacer=5-10bp;gc_cont=0.465
MTKEQKAVLKRALDHYGIDNQLTKAAEEMAELTKEICK
>CP019962.1_RagTag_2  358  468  -1  ID=1_2;partial=00;start_type=ATG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp;gc_cont=0.342
MSWTMIFKKFEFPVLKVPVGNKVYIWLKKNINLLRV*
>CP019962.1_RagTag_3  446  2329  -1  ID=1_3;partial=00;start_type=GTG;rbs_motif=GGxGG;rbs_spacer=5-10bp;gc_cont=0.503
MQIEKLTKETILEDTTFEEIIDEKDEIYRQRLINDLTDRAAELGVKTKFTSLLKAYQKEE
KKMLQEQKKQLQEQNRARILQNLDRRTEFGSECYPDLRCGNWFADETGIRTFGMFGEVQA

my annotation.tsv file :

CP019962.1_RagTag_3     903814.ELI_1277 0.0     1239.2  COG5519@1|root,COG5519@2|Bacteria,1TRMV@1239|Firmicutes,24BJ9@186801|Clostridia 186801|ClostridiaL       Psort location Cytoplasmic, score       -       -       -       -       -       -       -       -       -       -       -       -       DUF927
CP019962.1_RagTag_4     903814.ELI_1276 3.4e-100        370.9   COG0358@1|root,COG0358@2|Bacteria       2|Bacteria      L       DNA primase activity    --       3.6.4.12        ko:K02316,ko:K17680     ko03030,map03030        -       -       -       ko00000,ko00001,ko01000,ko03029,ko03032 -       -       -DUF3991,DnaB_C,Toprim_2,Toprim_3,Toprim_N,zf-CHC2

and my gff file :

CP019962.1_RagTag       eggNOG-mapper   CDS     1       35      69.7    +       .       ID=CP019962.1_RagTag_1740;em_target=903814.ELI_3509;em_score=69.7;em_evalue=3.2e-10;em_tcov=15.6;em_searcher=diamond;em_OGs=COG1668@1|root,COG1668@2|Bacteria,1V2WV@1239|Firmicutes,24IFP@186801|Clostridia;em_COG_cat=CP;em_desc=transmembrane transport;em_max_annot_lvl=186801|Clostridia;em_Preferred_name=;em_KEGG_ko=ko:K16906;em_KEGG_Pathway=ko02010,map02010;em_KEGG_Module=M00224;em_BRITE=ko00000,ko00001,ko00002,ko02000;em_KEGG_TC=3.A.1;em_PFAMs=
CP019962.1_RagTag       eggNOG-mapper   CDS     1       37      80.9    +       .       ID=CP019962.1_RagTag_1273;em_target=903814.ELI_4010;em_score=80.9;em_evalue=1.4e-13;em_tcov=100.0;em_searcher=diamond;em_OGs=COG0257@1|root,COG0257@2|Bacteria,1VK4F@1239|Firmicutes,24UGF@186801|Clostridia,25XP9@186806|Eubacteriaceae;em_COG_cat=J;em_desc=Belongs to the bacterial ribosomal protein bL36 family;em_max_annot_lvl=186801|Clostridia;em_PFAMs=Ribosomal_L36;em_Preferred_name=rpmJ;em_KEGG_ko=ko:K02919;em_KEGG_Pathway=ko03010,map03010;em_KEGG_Module=M00178;em_BRITE=br01610,ko00000,ko00001,ko00002,ko03011;em_GOs=

I get a GenBank file but without the CDS annotations :

  LOCUS       CP019962.1_RagTag    4424507 bp    DNA              BCT 05-NOV-2021
DEFINITION  Firmicutes genome.
ACCESSION   CP019962.1_RagTag
VERSION     CP019962.1_RagTag
KEYWORDS    Firmicutes.
SOURCE      .
  ORGANISM  Firmicutes
            Bacteria.
FEATURES             Location/Qualifiers
     source          1..4424507
                     /scaffold="CP019962.1_RagTag"
                     /db_xref="taxon:1239"
ORIGIN
        1 gcttgcagat ctcttttgtc aattcagcca tctcttccgc ggccttggtg agctggttgt
       61 caatgccgta atggtcaagt gcccttttta ggaccgcttt ctgttctttg gtcatttcat
      121 cctcccgtag gtctcataaa tttcttgcaa cttatagttt tattttttaa ttgttataaa

What I Did

I ran :

emapper2gbk genomes --fastanucleic EggMapper_Annot_Microbial_Assembly_v2_RagTag_Scaffolded.emapper.fna --fastaprot EggMapper_Annot_Microbial_Assembly_v2_RagTag_Scaffolded.emapper.genepred.faa --out test.gbk --gff EggMapper_Annot_Microbial_Assembly_v2_RagTag_Scaffolded.emapper.gff --annotation EggMapper_Annot_Microbial_Assembly_v2_RagTag_Scaffolded.emapper.annotation.tsv -n "Firmicutes"

Do you have an idea to get a correct Genbank file?

Thanks

ArnaudBelcour commented 2 years ago

Hi @pailloufat-stack,

This error could come from how emapper2gbk uses your GFF file. In default mode, emapper2gbk searches for gene in the GFF using the region then extracts the CDS associated to these genes. Like in this example (where emapper2gbk will extract the cds_1 using the Parent=gene_1 relationship):

##gff file
region_1    RefSeq  region  1       12642   .       +       .       ID=region_1
region_1    RefSeq  gene    1       2445    .       -       .       ID=gene_1
region_1    RefSeq  CDS     1       2445    .       -       0       ID=cds_1;Parent=gene_1

But in your GFF file there is only CDS so it fails to extract them as there is no gene feature.

There is an option -gt cds_only for when you have only CDS inside your CDS. Can you add it to your command and see if it solves this issue?

pailloufat-stack commented 2 years ago

Hi @ArnaudBelcour ,

Thanks for the answer. Unfortunately, I ran the command with -gt cds_only but I got the same genbank file.

ArnaudBelcour commented 2 years ago

Hi @pailloufat-stack,

So it seems that there is another issue. I have tested the example lines of your GFF using gffutils and the package is not able to parse correctly your input file. This is strange because normally GFF files from Prodigal can be read with this package (I have tested it recently).

Have you modified the GFF file created by Prodigal? One possible issue could be that the tab-separated element are not well written. This could prevent gffutils from parsing correctly the input GFF and its columns.

AuReMe / emapper2gbk

Wrong GenBank output file #6

Description

What I Did