AuReMe / emapper2gbk

Convert GFF, fastas, annotation table and species name into Genbank.
GNU Lesser General Public License v3.0
12 stars 5 forks source link

No Corresponding protein ID between GFF and FAA #15

Closed dzolier closed 2 years ago

dzolier commented 2 years ago

Description

I am trying to convert the eggnogg-mapper output into gbk files. When entering using folders eggnog_fnas,eggnog_faas,eggnog_annot,eggnog_gff, and a namefile.txt, I keep getting a no corresponding protein ID error and then the gbk file isn't made

What I Did

my command:

emapper2gbk genomes -fn /mnt/d/eggnog_fnas -fp /mnt/d/eggnog_faas -o /mnt/d/gbk_files -g /mnt/d/eggnog_gffs -gt cds_only -go /mnt/d/GO_annotations/go-basic.obo -nf /mnt/d/namefile.txt -a /mnt/d/eggnog_annot -c 2 --keep-gff-annotation

The reply:

Creating GFF database (gffutils) for bin.8
Creating GFF database (gffutils) for bin.5
No corresponding protein ID between GFF /mnt/d/eggnog_gffs/bin.5.gff (-g/gff) and Fasta protein /mnt/d/eggnog_faas/bin.5.faa (-fp/protein_fasta) sequence for bin.5
Creating GFF database (gffutils) for bin.6
No corresponding protein ID between GFF /mnt/d/eggnog_gffs/bin.8.gff (-g/gff) and Fasta protein /mnt/d/eggnog_faas/bin.8.faa (-fp/protein_fasta) sequence for bin.8
Creating GFF database (gffutils) for bin.4
No corresponding protein ID between GFF /mnt/d/eggnog_gffs/bin.6.gff (-g/gff) and Fasta protein /mnt/d/eggnog_faas/bin.6.faa (-fp/protein_fasta) sequence for bin.6
Creating GFF database (gffutils) for bin.7
No corresponding protein ID between GFF /mnt/d/eggnog_gffs/bin.4.gff (-g/gff) and Fasta protein /mnt/d/eggnog_faas/bin.4.faa (-fp/protein_fasta) sequence for bin.4
Creating GFF database (gffutils) for bin.2
No corresponding protein ID between GFF /mnt/d/eggnog_gffs/bin.7.gff (-g/gff) and Fasta protein /mnt/d/eggnog_faas/bin.7.faa (-fp/protein_fasta) sequence for bin.7
Creating GFF database (gffutils) for bin.1
No corresponding protein ID between GFF /mnt/d/eggnog_gffs/bin.2.gff (-g/gff) and Fasta protein /mnt/d/eggnog_faas/bin.2.faa (-fp/protein_fasta) sequence for bin.2
Creating GFF database (gffutils) for bin.3
No corresponding protein ID between GFF /mnt/d/eggnog_gffs/bin.1.gff (-g/gff) and Fasta protein /mnt/d/eggnog_faas/bin.1.faa (-fp/protein_fasta) sequence for bin.1
No corresponding protein ID between GFF /mnt/d/eggnog_gffs/bin.3.gff (-g/gff) and Fasta protein /mnt/d/eggnog_faas/bin.3.faa (-fp/protein_fasta) sequence for bin.3
/!\ Only 0 on 8 genbanks have been created, check the logs for error.
--- Total runtime 46.52 seconds ---

I am a little confused. When I look at the .gff file, I see a header like thus: NODE_27_length_174714_cov_8.012086

And when I look in the .faa file, I see this kind of header: >NODE_27_length_174714_cov_8.012086_1

Is that the problem? How would I fix this?

Thank you for reading!

ArnaudBelcour commented 2 years ago

Hi @dzolier,

I have looked at the output files from eggnog-mapper and I have detected the issue.

The genbank looks like this:

##gff-version  3
# Sequence Data: seqnum=1;seqlen=62214;seqhdr="U00096.3"
# Model Data: version=Prodigal.v2.6.3;run_type=Single;model="Ab initio";gc_cont=52.15;transl_table=11;uses_sd=1
U00096.3    Prodigal_v2.6.3 CDS 3   98  5.7 +   0   ID=1_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.427;conf=78.85;score=5.72;cscore=2.50;sscore=3.22;rscore=0.00;uscore=0.00;tscore=3.22;
U00096.3    Prodigal_v2.6.3 CDS 337 2799    397.7   +   0   ID=1_2;partial=00;start_type=ATG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp;gc_cont=0.531;conf=99.99;score=397.70;cscore=378.47;sscore=19.23;rscore=11.51;uscore=4.52;tscore=3.86;

And the protein fasta looks like this:

>U00096.3_1 # 3 # 98 # 1 # ID=1_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.427
LFILTATGNMSLCGLKKECLIAASELVTCRE*
>U00096.3_2 # 337 # 2799 # 1 # ID=1_2;partial=00;start_type=ATG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp;gc_cont=0.531
MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAMIEKTISGQDA
LPNISDAERIFAELLTGLAAAQPGFPLAQLKTFVDQEFAQIKHVLHGISLLGQCPDSINA
ALICRGEKMSIAIMAGVLEARGHNVTVIDPVEKLLAVGHYLESTVDIAESTRRIAASRIP
ADHMVLMAGFTAGNEKGELVVLGRNGSDYSAAVLAACLRADCCEIWTDVDGVYTCDPRQV
PDARLLKSMSYQEAMELSYFGAKVLHPRTITPIAQFQIPCLIKNTGNPQAPGTLIGASRD
EDELPVKGISNLNNMAMFSVSGPGMKGMVGMAARVFAAMSRARISVVLITQSSSEYSISF
CVPQSDCVRAERAMQEEFYLELKEGLLEPLAVTERLAIISVVGDGMRTLRGISAKFFAAL
ARANINIVAIAQGSSERSISVVVNNDDATTGVRVTHQMLFNTDQVIEVFVIGVGGVGGAL
LEQLKRQQSWLKNKHIDLRVCGVANSKALLTNVHGLNLENWQEELAQAKEPFNLGRLIRL
VKEYHLLNPVIVDCTSSQAVADQYADFLREGFHVVTPNKKANTSSMDYYHQLRYAAEKSR
RKFLYDTNVGAGLPVIENLQNLLNAGDELMKFSGILSGSLSYIFGKLDEGMSFSEATTLA
REMGYTEPDPRDDLSGMDVARKLLILARETGRELELADIEIEPVLPAEFNAEGDVAAFMA
NLSQLDDLFAARVAKARDEGKVLRYVGNIDEDGVCRVKIAEVDGNDPLFKVKNGENALAF
YSHYYQPLPLVLRGYGAGNDVTAAGVFADLLRTLSWKLGV*

The issue is that in the protein fasta, the ID of the protein sequence is U00096.3_1 or U00096.3_2 but in the GFF it is 1_1 or 1_2 (the number after the ID=).

With the default behavior of emapper2gbk it is logical to have this error. But as these outputs are the direct outputs of eggnog-mapper, I have added a way to handle them in emapper2gbk in the commit https://github.com/AuReMe/emapper2gbk/commit/18ec5e83662b4ce5772896a0d2d912996e07aaec.

To use it, you will have to remove your version of emapper2gbk and install the github version instead.

With this version, you can use a new option in the command-line called -gt eggnog. This version should handle correctly the GFF of eggnog-mapper.

If you have the time, can you test it and say if it works?

Best regards, Arnaud.

dzolier commented 2 years ago

There was a warning about extending the name, but it seems to have worked.

the command:

emapper2gbk genomes -fn /mnt/d/eggnog_fnas -fp /mnt/d/eggnog_faas -o /mnt/d/gbk_files -g /mnt/d/eggnog_gffs -gt eggnog -go /mnt/d/GO_annotations/go-basic.obo -nf /mnt/d/namefile.txt -a /mnt/d/eggnog_annot -c 2 --keep-gff-annotation

the reply:

Creating GFF database (gffutils) for bin.7
Creating GFF database (gffutils) for bin.3
Assembling Genbank informations for bin.7
/home/alicopthera/anaconda3/lib/python3.8/site-packages/Bio/SeqIO/InsdcIO.py:683: BiopythonWarning: Increasing length of locus line to allow long name. This will result in fields that are not in usual positions.
  warnings.warn(
Assembling Genbank informations for bin.3
/home/alicopthera/anaconda3/lib/python3.8/site-packages/Bio/SeqIO/InsdcIO.py:683: BiopythonWarning: Increasing length of locus line to allow long name. This will result in fields that are not in usual positions.
  warnings.warn(
Creating GFF database (gffutils) for bin.2
Creating GFF database (gffutils) for bin.5
Assembling Genbank informations for bin.2
Assembling Genbank informations for bin.5
Creating GFF database (gffutils) for bin.1
Creating GFF database (gffutils) for bin.4
Assembling Genbank informations for bin.1
Assembling Genbank informations for bin.4
Creating GFF database (gffutils) for bin.6
Creating GFF database (gffutils) for bin.8
Assembling Genbank informations for bin.8
Assembling Genbank informations for bin.6
All genbanks have been created.
--- Total runtime 22.90 seconds ---

The genbank files look like they have everything, but just in case that warning was for bin.3,

Here's bin.1

LOCUS       NODE_27_length_174714_cov_8.012086 174714 bp    DNA              BCT 20-MAY-2022
DEFINITION  Bacteria genome.
ACCESSION   NODE_27_length_174714_cov_8
VERSION     NODE_27_length_174714_cov_8.12086
KEYWORDS    Bacteria.
SOURCE      .
  ORGANISM  Bacteria
            .
FEATURES             Location/Qualifiers
     source          1..174714
                     /scaffold="NODE_27_length_174714_cov_8.012086"
                     /db_xref="taxon:2"
     gene            638..3424
                     /locus_tag="NODE_27_length_174714_cov_8.012086_1"
     CDS             638..3424
                     /locus_tag="NODE_27_length_174714_cov_8.012086_1"
                     /gene="ppc"
                     /go_component="GO:0005575"
                     /go_component="GO:0005622"
                     /go_component="GO:0005623"
                     /go_component="GO:0005737"
                     /go_component="GO:0005829"
                     /go_component="GO:0044424"
                     /go_component="GO:0044444"
                     /go_component="GO:0044464"
                     /go_function="GO:0003674"
                     /go_function="GO:0003824"
                     /go_function="GO:0004611"
                     /go_function="GO:0008964"
                     /go_function="GO:0016829"
                     /go_function="GO:0016830"
                     /go_function="GO:0016831"
                     /EC_number="4.1.1.31"
                     /dbxref="KEGG:R00345"
                     /translation="MIENVIGLQKQGTNNLLRRDVRFLGHILGEVLVHQGGNDLLDVVE
                     KIREMSKSLRATYVIEIYDDFKQTISSLDPEIRHQVIRAFAIYFQLVNIAEQNHRIRRK
                     RDYERSTGESVQPGSIESIVQELKNNDTPYEEVQEILKSISLELVMTAHPTEAMRRAVL
                     DINLRISQDMMKLDNPMLTAREREQLREKLLGEVLNLWQTDELRDRKPTVIDEVRNGLY
                     YFDETLFDVLPEIYHELERCLNKYYPQEKWHVPSFLKFGSWIGGDRDGNPSVRANVTWE
                     TLGLHRQLALQKYEEVLKQTLEHMSFSKNIVTVSDALLASIQNDRDALGNVQDVWRNEK
                     EPYRIKTTYMIEKVHNTGNAHLPASQKYNSPDEFISDLQIIDASLRSHYADYVADKYIK
                     KLIRQVELFGFHLAALDIRQHSKEHENAMTEILAKMGITSDYSKLSEEEKISLLTDVLN
                     DPRPITSTYLDYSEGTKECLDVYRTVGKAQKEFGRNCINSYLISMTQGASDLLEVVVFA
                     KEAGLYRKESDGTVTSTLQSVPLFETIDDLHAAPGIMSTLLAIPAYKASLDPVTQLHEI
                     MLGYSDSNKDGGVITANWELRMALQDITEAAKKFGVKLKFFHGRGGALGRGGMPLNRSI
                     LAQPVETLGGGIKITEQGEVLSSRYSLQGIAYRSLEQATFALITASKLSRSPQRHPKED
                     KWETIMRGISEQAQTKYQDLIFRDEDFLTFFKESTPLPEIGELNIGSRPSKRKNSDKFE
                     DLRAIPWVFSWTQTRYLLPAWYAAGTGLQSFYQGNSANLETLKEMYEDWSFFRTMIDNL
                     QMALAKADLQIAKEYGNLVKESQIAERIFNLIREEYELTSSIILQITGQQEILDNVPVI
                     QESIRLRNPYVDPLSYMQVELLTELRALRDNNEDDAILLREVLLTINGIAAGLRNTG*"

... and it goes on for quite a while; I'll spare you the entirety of it. For bin.3, we get:

LOCUS       NODE_87_length_82681_cov_5.993707 82681 bp    DNA              BCT 20-MAY-2022
DEFINITION  Bacteroidetes genome.
ACCESSION   NODE_87_length_82681_cov_5
VERSION     NODE_87_length_82681_cov_5.993707
KEYWORDS    Bacteroidetes.
SOURCE      .
  ORGANISM  Bacteroidetes
            Bacteria.
FEATURES             Location/Qualifiers
     source          1..82681
                     /scaffold="NODE_87_length_82681_cov_5.993707"
                     /db_xref="taxon:976"
     gene            2..70
                     /locus_tag="NODE_87_length_82681_cov_5.993707_1"
     CDS             2..70
                     /locus_tag="NODE_87_length_82681_cov_5.993707_1"
                     /translation="YVADLITAGLGNIKGAYDQKLF*"
     gene            51..2384
                     /locus_tag="NODE_87_length_82681_cov_5.993707_2"
     CDS             51..2384
                     /locus_tag="NODE_87_length_82681_cov_5.993707_2"
                     /dbxref="PFAM:FtsX"
                     /dbxref="PFAM:MacB_PCD"
                     /translation="MIKNYFKTAFRNLFKTPLLSFINIAGLALGMAGTGLLLLNIYYMV
                     SIDQFHEKKDRVFKVYNKTSINDRVHCHDHSQAPLGPTLQKEFPRIRQMARIAYTGKQF
                     SYKDKKLQADGYYADAPFLSMFTFPLVTGSKQAVLKDPDAIVLTETMAKKIFGDEDPLH
                     KVIRLDNTRDVTVTGVLKDIPRNSSLKFDYLLSWEDNNNNWDIYFANTFVELNSPEEKG
                     VVDKQIAYIISKHSKNEQHSQVFLHPVGKMSLQRHFDEKGNPEIRSEIYFLSVLAVIML
                     LIGCINFMNLSTAHSGKRGKEVGVRKIMGAVRKSLIIQFITESTLLAFLAGCVGLLIVQ
                     LVWPSFSNMAKVRINIPWHLPVFWISTLAFVLFTGILAGSYPAFYLSSFKPVRVLKGVF
                     SNKGALITPRRILVVVQFVLAIFLMNFAILVRKQTNFTENREMGFAKGGLVFHSMTQDL
                     RKNFDAVQQELVNTGMVEAICKTNSPITRAGGAISGLEWNGREDNKYVSFSLYTTIGDF
                     VKTNGLTLLAGRDIDYSNYKTDNRSCVINESAARELGFANPVGQTVKEDDRKWTIIGVV
                     KDFYQNSPGDLAKPIMIRYGTDFGTINIRMQAGSTSLQGFKKVEEIIKKYNPGYITELQ
                     FADEDVANSFQQRKNASVLINSFTLIAIFIACMGLLGLTAYMVEMRKREVGIRKVLGAS
                     VATVTSLLTKEFVKLVCVSVIIASPIAWFFMNSFLQQFSYRTNLSWWILPASGVIAIIV
                     AVATISFQTIKTAIANPVNALRSE*"
     gene            complement(2430..3341)
                     /locus_tag="NODE_87_length_82681_cov_5.993707_3"
     CDS             complement(2430..3341)
                     /locus_tag="NODE_87_length_82681_cov_5.993707_3"
                     /dbxref="PFAM:HTH_3"
                     /dbxref="PFAM:Peptidase_S24"
                     /translation="MPTFFASNLSFLRKNKGLTQAEVATALGLKRNTFSNYETTHSEPD
                     LDTLEKIASFFDISMDELISLDLSKGGLVELKGGNDEKNDDRDKKNSAVGGGNVTSAVR
                     QYLPPIDEDLPVSVVGSTLYPYRRFQAPKIITIDSQGEENIIYVPVKARAGYMSGYSDP
                     QFIQSLSAYRLPGYTNGTYRIFEVEGHSMFPTLQDADRVIVRWADISEVRDDRVYVLVT
                     RTQGVLIKRLINRHHEGKIIVKSDNNHAGEFPTIVMDVDEVAEIWYVVERWTRQLPGPG
                     EIYKRLVNIEAELAMLKQKMGE*"
     gene            3421..3663
                     /locus_tag="NODE_87_length_82681_cov_5.993707_4"
     CDS             3421..3663
                     /locus_tag="NODE_87_length_82681_cov_5.993707_4"
                     /translation="MHEAKSRKNNDEMMLRRRDVAGLVAEIHGVTADHVRKVVRGDREN
                     EQILATYMHIIENDNMLLRAAKDVVPFKSNLNPEA*"
     gene            3677..4075
                     /locus_tag="NODE_87_length_82681_cov_5.993707_5"
     CDS             3677..4075
                     /locus_tag="NODE_87_length_82681_cov_5.993707_5"
                     /translation="MIHLFIQNDLATHLKAQICHLLNWDELQYGEFQFQCGCLYLQYYI
                     SKDPVAIDEVLLHQLYWKWWKNEWLDRDYVLAGTLMKCDKLSIEEKRRLYRNWHDARVL
                     ADECSPVGSIMSNGYKTMISEIIKTEVL*"
     gene            4072..4521
                     /locus_tag="NODE_87_length_82681_cov_5.993707_6"
     CDS             4072..4521
                     /locus_tag="NODE_87_length_82681_cov_5.993707_6"
                     /translation="MNILTRKELSIVSHVITRAQSEIQLQAGIDVVLVPRYSNKRVEDD
                     VRQLFESMCECWNVQLAWVSDKSRANDRPIMRKLLWMAGKKRFPQMSYCVLANLTGATD
                     HAGVIKGIRSGYDWLRVQDEKFLKYYEPVKSYLMELEEEQVLSAH*"
     gene            4592..5116
                     /locus_tag="NODE_87_length_82681_cov_5.993707_7"
     CDS             4592..5116
                     /locus_tag="NODE_87_length_82681_cov_5.993707_7"
                     /gene="gam"
                     /dbxref="PFAM:Phage_Mu_Gam"
                     /translation="MFREKKRVINNVDYDQAQEASARYAEVAARLGFIEAQMNERINSI
                     KDEFADEIIHLTREKEKQFETLEVYAKEQKDNWGRRKSFDLLHSVIGFRTGTPRVTKDK
                     MFTWDNIVDMVKERFPSLIRVKCELDKEAIIAMRDDKEFLELQKQCYVDVEQGESFFVE
                     TKMLELQRQRA*"
     gene            complement(5189..6259)
                     /locus_tag="NODE_87_length_82681_cov_5.993707_8"
     CDS             complement(5189..6259)
                     /locus_tag="NODE_87_length_82681_cov_5.993707_8"
                     /translation="MHRSNHTTVSSYRILCRMLSLYRQVKKQELYLKNHLPAALAGLAH
                     DFHQTFSPVLIKRVTKYWQLGLNLVCKNLYDLTGKELQPPEHKRIVLLSVFGPLFDDLF
                     DDKILGREQIASLVAKPETYVAVNDTDRLVVKIYLEILQTLPEKQLFIEQLQAVAWWQQ
                     ESLKQLNENISEEELYRITYYKSYYAVLLYCAVLDEYPNSAIREMLFPIAGLMQLTNDA
                     FDVYKDVNNNVYTLPILYRNFEQLQQHFMAEVARINNTLWQLPGTAKAKNNYAITVHSL
                     HAMGWMALEQLKQITTGIPTVAALRSLSRKSLVCDMDSFEQKRKWLGHIRRFTNYSDPS
                     AGNRPTIAMPVLNATL*"
     gene            complement(6271..7113)
                     /locus_tag="NODE_87_length_82681_cov_5.993707_9"
     CDS             complement(6271..7113)
                     /locus_tag="NODE_87_length_82681_cov_5.993707_9"
                     /EC_number="1.1.1.31"
                     /dbxref="KEGG:R05066"
                     /translation="MKAFLGMGLLGSNFVRAMLKRGETVHVWNRTASKAQELEKAGAKA
                     FVQAQDAVKGATEIFLTLKDDAAVDEVLKAAEPALTPGATIIDHTTTSKEGAIKRTRDW
                     KEKGFTYQHAPVFMGPANALDGTGFILLSGDEAVINSLTPALSKMTGKLLNFGSETGKA
                     AAMKLAGNAFLVCLTASLKDTLTLSNSLGVSVDDLLTLFNSWNPGALVPARVQRMTGAD
                     HSQPSWELNMARKDTQLFIDAAQQAGNQLVLMPAIAALMDEFINKGFGNYDWTVIGKQ*
                     "
     gene            7301..7576
                     /locus_tag="NODE_87_length_82681_cov_5.993707_10"
     CDS             7301..7576
                     /locus_tag="NODE_87_length_82681_cov_5.993707_10"
                     /translation="MAKDKRYNTVKNLITGGYIKSFSEILDTVPKTVVAHDLGMHHQTF
                     AKLIKSPERFNFKDAFRIASLIEVDDKHIIDLIYNQYANDRKRRKK*"
     gene            7701..8639
                     /locus_tag="NODE_87_length_82681_cov_5.993707_11"
     CDS             7701..8639
                     /locus_tag="NODE_87_length_82681_cov_5.993707_11"
                     /translation="MNVSIESTLENWVPYKLNSLEDGLHCEWLYTGDTEFTEPFFDETI
                     AKCRQLYYRGRKSISSIDVLPHWSNEIESVPPSAFIFHVSRCGSTLASQLLALDQTNIV
                     LSEVPFFDALLRSKENISPQLLKDAITFYSPVKNHRERLFIKTDSWHIFFYKQLRALFP
                     DTPFILLYRRPDEVMRSQQKRRGMHAIPGLIEPFLFGIENDDVQRMNLDEYLGMVLDKY
                     FQAFLDIREKDTNVFLINYNEGPVSMVEKIAAITKTIIGSDEMEKIKSRAMYHAKYPEQ
                     VFAEETLRDPVPVYCRAAYDKYEALEKIRNS*"
     gene            complement(8597..9130)
                     /locus_tag="NODE_87_length_82681_cov_5.993707_12"
     CDS             complement(8597..9130)
                     /locus_tag="NODE_87_length_82681_cov_5.993707_12"
                     /gene="cysC"
                     /go_function="GO:0003674"
                     /go_function="GO:0003824"
                     /go_function="GO:0004020"
                     /go_function="GO:0016301"
                     /go_function="GO:0016740"
                     /go_function="GO:0016772"
                     /go_function="GO:0016773"
                     /go_process="GO:0006793"
                     /go_process="GO:0006796"
                     /go_process="GO:0008150"
                     /go_process="GO:0008152"
                     /go_process="GO:0009987"
                     /go_process="GO:0016310"
                     /go_process="GO:0044237"
                     /EC_number="2.7.1.25"
                     /dbxref="KEGG:R00509"
                     /dbxref="KEGG:R04928"
                     /translation="MIIQLTGLSGAGKTTLAEGVKYLLEKDALKVVIIDGDVYRKTLCK
                     DLGFSKEDRIENIRRLGAAAFSFKDQADIIMIAAINPFEDIRNELKEKYGTKTVWIRCN
                     MPVLIKRDTKGLYKRALLHDDHPDKIFNLTGVNDTYETPSSPDLIIDTSIETAAESIQK
                     FYEFLIFSRASYLS*"

... and so on. I use this much because that's where I first ran into GO annotations in this bin, in case they're out of place somehow.

I think these files are the same thing, but I could be wrong. If they are, feel free to close the issue.

Thank you for your help!