Closed dzolier closed 2 years ago
Hi @dzolier,
I have looked at the output files from eggnog-mapper and I have detected the issue.
The genbank looks like this:
##gff-version 3
# Sequence Data: seqnum=1;seqlen=62214;seqhdr="U00096.3"
# Model Data: version=Prodigal.v2.6.3;run_type=Single;model="Ab initio";gc_cont=52.15;transl_table=11;uses_sd=1
U00096.3 Prodigal_v2.6.3 CDS 3 98 5.7 + 0 ID=1_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.427;conf=78.85;score=5.72;cscore=2.50;sscore=3.22;rscore=0.00;uscore=0.00;tscore=3.22;
U00096.3 Prodigal_v2.6.3 CDS 337 2799 397.7 + 0 ID=1_2;partial=00;start_type=ATG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp;gc_cont=0.531;conf=99.99;score=397.70;cscore=378.47;sscore=19.23;rscore=11.51;uscore=4.52;tscore=3.86;
And the protein fasta looks like this:
>U00096.3_1 # 3 # 98 # 1 # ID=1_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.427
LFILTATGNMSLCGLKKECLIAASELVTCRE*
>U00096.3_2 # 337 # 2799 # 1 # ID=1_2;partial=00;start_type=ATG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp;gc_cont=0.531
MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAMIEKTISGQDA
LPNISDAERIFAELLTGLAAAQPGFPLAQLKTFVDQEFAQIKHVLHGISLLGQCPDSINA
ALICRGEKMSIAIMAGVLEARGHNVTVIDPVEKLLAVGHYLESTVDIAESTRRIAASRIP
ADHMVLMAGFTAGNEKGELVVLGRNGSDYSAAVLAACLRADCCEIWTDVDGVYTCDPRQV
PDARLLKSMSYQEAMELSYFGAKVLHPRTITPIAQFQIPCLIKNTGNPQAPGTLIGASRD
EDELPVKGISNLNNMAMFSVSGPGMKGMVGMAARVFAAMSRARISVVLITQSSSEYSISF
CVPQSDCVRAERAMQEEFYLELKEGLLEPLAVTERLAIISVVGDGMRTLRGISAKFFAAL
ARANINIVAIAQGSSERSISVVVNNDDATTGVRVTHQMLFNTDQVIEVFVIGVGGVGGAL
LEQLKRQQSWLKNKHIDLRVCGVANSKALLTNVHGLNLENWQEELAQAKEPFNLGRLIRL
VKEYHLLNPVIVDCTSSQAVADQYADFLREGFHVVTPNKKANTSSMDYYHQLRYAAEKSR
RKFLYDTNVGAGLPVIENLQNLLNAGDELMKFSGILSGSLSYIFGKLDEGMSFSEATTLA
REMGYTEPDPRDDLSGMDVARKLLILARETGRELELADIEIEPVLPAEFNAEGDVAAFMA
NLSQLDDLFAARVAKARDEGKVLRYVGNIDEDGVCRVKIAEVDGNDPLFKVKNGENALAF
YSHYYQPLPLVLRGYGAGNDVTAAGVFADLLRTLSWKLGV*
The issue is that in the protein fasta, the ID of the protein sequence is U00096.3_1
or U00096.3_2
but in the GFF it is 1_1
or 1_2
(the number after the ID=
).
With the default behavior of emapper2gbk it is logical to have this error. But as these outputs are the direct outputs of eggnog-mapper, I have added a way to handle them in emapper2gbk in the commit https://github.com/AuReMe/emapper2gbk/commit/18ec5e83662b4ce5772896a0d2d912996e07aaec.
To use it, you will have to remove your version of emapper2gbk and install the github version instead.
With this version, you can use a new option in the command-line called -gt eggnog
. This version should handle correctly the GFF of eggnog-mapper.
If you have the time, can you test it and say if it works?
Best regards, Arnaud.
There was a warning about extending the name, but it seems to have worked.
the command:
emapper2gbk genomes -fn /mnt/d/eggnog_fnas -fp /mnt/d/eggnog_faas -o /mnt/d/gbk_files -g /mnt/d/eggnog_gffs -gt eggnog -go /mnt/d/GO_annotations/go-basic.obo -nf /mnt/d/namefile.txt -a /mnt/d/eggnog_annot -c 2 --keep-gff-annotation
the reply:
Creating GFF database (gffutils) for bin.7
Creating GFF database (gffutils) for bin.3
Assembling Genbank informations for bin.7
/home/alicopthera/anaconda3/lib/python3.8/site-packages/Bio/SeqIO/InsdcIO.py:683: BiopythonWarning: Increasing length of locus line to allow long name. This will result in fields that are not in usual positions.
warnings.warn(
Assembling Genbank informations for bin.3
/home/alicopthera/anaconda3/lib/python3.8/site-packages/Bio/SeqIO/InsdcIO.py:683: BiopythonWarning: Increasing length of locus line to allow long name. This will result in fields that are not in usual positions.
warnings.warn(
Creating GFF database (gffutils) for bin.2
Creating GFF database (gffutils) for bin.5
Assembling Genbank informations for bin.2
Assembling Genbank informations for bin.5
Creating GFF database (gffutils) for bin.1
Creating GFF database (gffutils) for bin.4
Assembling Genbank informations for bin.1
Assembling Genbank informations for bin.4
Creating GFF database (gffutils) for bin.6
Creating GFF database (gffutils) for bin.8
Assembling Genbank informations for bin.8
Assembling Genbank informations for bin.6
All genbanks have been created.
--- Total runtime 22.90 seconds ---
The genbank files look like they have everything, but just in case that warning was for bin.3,
Here's bin.1
LOCUS NODE_27_length_174714_cov_8.012086 174714 bp DNA BCT 20-MAY-2022
DEFINITION Bacteria genome.
ACCESSION NODE_27_length_174714_cov_8
VERSION NODE_27_length_174714_cov_8.12086
KEYWORDS Bacteria.
SOURCE .
ORGANISM Bacteria
.
FEATURES Location/Qualifiers
source 1..174714
/scaffold="NODE_27_length_174714_cov_8.012086"
/db_xref="taxon:2"
gene 638..3424
/locus_tag="NODE_27_length_174714_cov_8.012086_1"
CDS 638..3424
/locus_tag="NODE_27_length_174714_cov_8.012086_1"
/gene="ppc"
/go_component="GO:0005575"
/go_component="GO:0005622"
/go_component="GO:0005623"
/go_component="GO:0005737"
/go_component="GO:0005829"
/go_component="GO:0044424"
/go_component="GO:0044444"
/go_component="GO:0044464"
/go_function="GO:0003674"
/go_function="GO:0003824"
/go_function="GO:0004611"
/go_function="GO:0008964"
/go_function="GO:0016829"
/go_function="GO:0016830"
/go_function="GO:0016831"
/EC_number="4.1.1.31"
/dbxref="KEGG:R00345"
/translation="MIENVIGLQKQGTNNLLRRDVRFLGHILGEVLVHQGGNDLLDVVE
KIREMSKSLRATYVIEIYDDFKQTISSLDPEIRHQVIRAFAIYFQLVNIAEQNHRIRRK
RDYERSTGESVQPGSIESIVQELKNNDTPYEEVQEILKSISLELVMTAHPTEAMRRAVL
DINLRISQDMMKLDNPMLTAREREQLREKLLGEVLNLWQTDELRDRKPTVIDEVRNGLY
YFDETLFDVLPEIYHELERCLNKYYPQEKWHVPSFLKFGSWIGGDRDGNPSVRANVTWE
TLGLHRQLALQKYEEVLKQTLEHMSFSKNIVTVSDALLASIQNDRDALGNVQDVWRNEK
EPYRIKTTYMIEKVHNTGNAHLPASQKYNSPDEFISDLQIIDASLRSHYADYVADKYIK
KLIRQVELFGFHLAALDIRQHSKEHENAMTEILAKMGITSDYSKLSEEEKISLLTDVLN
DPRPITSTYLDYSEGTKECLDVYRTVGKAQKEFGRNCINSYLISMTQGASDLLEVVVFA
KEAGLYRKESDGTVTSTLQSVPLFETIDDLHAAPGIMSTLLAIPAYKASLDPVTQLHEI
MLGYSDSNKDGGVITANWELRMALQDITEAAKKFGVKLKFFHGRGGALGRGGMPLNRSI
LAQPVETLGGGIKITEQGEVLSSRYSLQGIAYRSLEQATFALITASKLSRSPQRHPKED
KWETIMRGISEQAQTKYQDLIFRDEDFLTFFKESTPLPEIGELNIGSRPSKRKNSDKFE
DLRAIPWVFSWTQTRYLLPAWYAAGTGLQSFYQGNSANLETLKEMYEDWSFFRTMIDNL
QMALAKADLQIAKEYGNLVKESQIAERIFNLIREEYELTSSIILQITGQQEILDNVPVI
QESIRLRNPYVDPLSYMQVELLTELRALRDNNEDDAILLREVLLTINGIAAGLRNTG*"
... and it goes on for quite a while; I'll spare you the entirety of it. For bin.3, we get:
LOCUS NODE_87_length_82681_cov_5.993707 82681 bp DNA BCT 20-MAY-2022
DEFINITION Bacteroidetes genome.
ACCESSION NODE_87_length_82681_cov_5
VERSION NODE_87_length_82681_cov_5.993707
KEYWORDS Bacteroidetes.
SOURCE .
ORGANISM Bacteroidetes
Bacteria.
FEATURES Location/Qualifiers
source 1..82681
/scaffold="NODE_87_length_82681_cov_5.993707"
/db_xref="taxon:976"
gene 2..70
/locus_tag="NODE_87_length_82681_cov_5.993707_1"
CDS 2..70
/locus_tag="NODE_87_length_82681_cov_5.993707_1"
/translation="YVADLITAGLGNIKGAYDQKLF*"
gene 51..2384
/locus_tag="NODE_87_length_82681_cov_5.993707_2"
CDS 51..2384
/locus_tag="NODE_87_length_82681_cov_5.993707_2"
/dbxref="PFAM:FtsX"
/dbxref="PFAM:MacB_PCD"
/translation="MIKNYFKTAFRNLFKTPLLSFINIAGLALGMAGTGLLLLNIYYMV
SIDQFHEKKDRVFKVYNKTSINDRVHCHDHSQAPLGPTLQKEFPRIRQMARIAYTGKQF
SYKDKKLQADGYYADAPFLSMFTFPLVTGSKQAVLKDPDAIVLTETMAKKIFGDEDPLH
KVIRLDNTRDVTVTGVLKDIPRNSSLKFDYLLSWEDNNNNWDIYFANTFVELNSPEEKG
VVDKQIAYIISKHSKNEQHSQVFLHPVGKMSLQRHFDEKGNPEIRSEIYFLSVLAVIML
LIGCINFMNLSTAHSGKRGKEVGVRKIMGAVRKSLIIQFITESTLLAFLAGCVGLLIVQ
LVWPSFSNMAKVRINIPWHLPVFWISTLAFVLFTGILAGSYPAFYLSSFKPVRVLKGVF
SNKGALITPRRILVVVQFVLAIFLMNFAILVRKQTNFTENREMGFAKGGLVFHSMTQDL
RKNFDAVQQELVNTGMVEAICKTNSPITRAGGAISGLEWNGREDNKYVSFSLYTTIGDF
VKTNGLTLLAGRDIDYSNYKTDNRSCVINESAARELGFANPVGQTVKEDDRKWTIIGVV
KDFYQNSPGDLAKPIMIRYGTDFGTINIRMQAGSTSLQGFKKVEEIIKKYNPGYITELQ
FADEDVANSFQQRKNASVLINSFTLIAIFIACMGLLGLTAYMVEMRKREVGIRKVLGAS
VATVTSLLTKEFVKLVCVSVIIASPIAWFFMNSFLQQFSYRTNLSWWILPASGVIAIIV
AVATISFQTIKTAIANPVNALRSE*"
gene complement(2430..3341)
/locus_tag="NODE_87_length_82681_cov_5.993707_3"
CDS complement(2430..3341)
/locus_tag="NODE_87_length_82681_cov_5.993707_3"
/dbxref="PFAM:HTH_3"
/dbxref="PFAM:Peptidase_S24"
/translation="MPTFFASNLSFLRKNKGLTQAEVATALGLKRNTFSNYETTHSEPD
LDTLEKIASFFDISMDELISLDLSKGGLVELKGGNDEKNDDRDKKNSAVGGGNVTSAVR
QYLPPIDEDLPVSVVGSTLYPYRRFQAPKIITIDSQGEENIIYVPVKARAGYMSGYSDP
QFIQSLSAYRLPGYTNGTYRIFEVEGHSMFPTLQDADRVIVRWADISEVRDDRVYVLVT
RTQGVLIKRLINRHHEGKIIVKSDNNHAGEFPTIVMDVDEVAEIWYVVERWTRQLPGPG
EIYKRLVNIEAELAMLKQKMGE*"
gene 3421..3663
/locus_tag="NODE_87_length_82681_cov_5.993707_4"
CDS 3421..3663
/locus_tag="NODE_87_length_82681_cov_5.993707_4"
/translation="MHEAKSRKNNDEMMLRRRDVAGLVAEIHGVTADHVRKVVRGDREN
EQILATYMHIIENDNMLLRAAKDVVPFKSNLNPEA*"
gene 3677..4075
/locus_tag="NODE_87_length_82681_cov_5.993707_5"
CDS 3677..4075
/locus_tag="NODE_87_length_82681_cov_5.993707_5"
/translation="MIHLFIQNDLATHLKAQICHLLNWDELQYGEFQFQCGCLYLQYYI
SKDPVAIDEVLLHQLYWKWWKNEWLDRDYVLAGTLMKCDKLSIEEKRRLYRNWHDARVL
ADECSPVGSIMSNGYKTMISEIIKTEVL*"
gene 4072..4521
/locus_tag="NODE_87_length_82681_cov_5.993707_6"
CDS 4072..4521
/locus_tag="NODE_87_length_82681_cov_5.993707_6"
/translation="MNILTRKELSIVSHVITRAQSEIQLQAGIDVVLVPRYSNKRVEDD
VRQLFESMCECWNVQLAWVSDKSRANDRPIMRKLLWMAGKKRFPQMSYCVLANLTGATD
HAGVIKGIRSGYDWLRVQDEKFLKYYEPVKSYLMELEEEQVLSAH*"
gene 4592..5116
/locus_tag="NODE_87_length_82681_cov_5.993707_7"
CDS 4592..5116
/locus_tag="NODE_87_length_82681_cov_5.993707_7"
/gene="gam"
/dbxref="PFAM:Phage_Mu_Gam"
/translation="MFREKKRVINNVDYDQAQEASARYAEVAARLGFIEAQMNERINSI
KDEFADEIIHLTREKEKQFETLEVYAKEQKDNWGRRKSFDLLHSVIGFRTGTPRVTKDK
MFTWDNIVDMVKERFPSLIRVKCELDKEAIIAMRDDKEFLELQKQCYVDVEQGESFFVE
TKMLELQRQRA*"
gene complement(5189..6259)
/locus_tag="NODE_87_length_82681_cov_5.993707_8"
CDS complement(5189..6259)
/locus_tag="NODE_87_length_82681_cov_5.993707_8"
/translation="MHRSNHTTVSSYRILCRMLSLYRQVKKQELYLKNHLPAALAGLAH
DFHQTFSPVLIKRVTKYWQLGLNLVCKNLYDLTGKELQPPEHKRIVLLSVFGPLFDDLF
DDKILGREQIASLVAKPETYVAVNDTDRLVVKIYLEILQTLPEKQLFIEQLQAVAWWQQ
ESLKQLNENISEEELYRITYYKSYYAVLLYCAVLDEYPNSAIREMLFPIAGLMQLTNDA
FDVYKDVNNNVYTLPILYRNFEQLQQHFMAEVARINNTLWQLPGTAKAKNNYAITVHSL
HAMGWMALEQLKQITTGIPTVAALRSLSRKSLVCDMDSFEQKRKWLGHIRRFTNYSDPS
AGNRPTIAMPVLNATL*"
gene complement(6271..7113)
/locus_tag="NODE_87_length_82681_cov_5.993707_9"
CDS complement(6271..7113)
/locus_tag="NODE_87_length_82681_cov_5.993707_9"
/EC_number="1.1.1.31"
/dbxref="KEGG:R05066"
/translation="MKAFLGMGLLGSNFVRAMLKRGETVHVWNRTASKAQELEKAGAKA
FVQAQDAVKGATEIFLTLKDDAAVDEVLKAAEPALTPGATIIDHTTTSKEGAIKRTRDW
KEKGFTYQHAPVFMGPANALDGTGFILLSGDEAVINSLTPALSKMTGKLLNFGSETGKA
AAMKLAGNAFLVCLTASLKDTLTLSNSLGVSVDDLLTLFNSWNPGALVPARVQRMTGAD
HSQPSWELNMARKDTQLFIDAAQQAGNQLVLMPAIAALMDEFINKGFGNYDWTVIGKQ*
"
gene 7301..7576
/locus_tag="NODE_87_length_82681_cov_5.993707_10"
CDS 7301..7576
/locus_tag="NODE_87_length_82681_cov_5.993707_10"
/translation="MAKDKRYNTVKNLITGGYIKSFSEILDTVPKTVVAHDLGMHHQTF
AKLIKSPERFNFKDAFRIASLIEVDDKHIIDLIYNQYANDRKRRKK*"
gene 7701..8639
/locus_tag="NODE_87_length_82681_cov_5.993707_11"
CDS 7701..8639
/locus_tag="NODE_87_length_82681_cov_5.993707_11"
/translation="MNVSIESTLENWVPYKLNSLEDGLHCEWLYTGDTEFTEPFFDETI
AKCRQLYYRGRKSISSIDVLPHWSNEIESVPPSAFIFHVSRCGSTLASQLLALDQTNIV
LSEVPFFDALLRSKENISPQLLKDAITFYSPVKNHRERLFIKTDSWHIFFYKQLRALFP
DTPFILLYRRPDEVMRSQQKRRGMHAIPGLIEPFLFGIENDDVQRMNLDEYLGMVLDKY
FQAFLDIREKDTNVFLINYNEGPVSMVEKIAAITKTIIGSDEMEKIKSRAMYHAKYPEQ
VFAEETLRDPVPVYCRAAYDKYEALEKIRNS*"
gene complement(8597..9130)
/locus_tag="NODE_87_length_82681_cov_5.993707_12"
CDS complement(8597..9130)
/locus_tag="NODE_87_length_82681_cov_5.993707_12"
/gene="cysC"
/go_function="GO:0003674"
/go_function="GO:0003824"
/go_function="GO:0004020"
/go_function="GO:0016301"
/go_function="GO:0016740"
/go_function="GO:0016772"
/go_function="GO:0016773"
/go_process="GO:0006793"
/go_process="GO:0006796"
/go_process="GO:0008150"
/go_process="GO:0008152"
/go_process="GO:0009987"
/go_process="GO:0016310"
/go_process="GO:0044237"
/EC_number="2.7.1.25"
/dbxref="KEGG:R00509"
/dbxref="KEGG:R04928"
/translation="MIIQLTGLSGAGKTTLAEGVKYLLEKDALKVVIIDGDVYRKTLCK
DLGFSKEDRIENIRRLGAAAFSFKDQADIIMIAAINPFEDIRNELKEKYGTKTVWIRCN
MPVLIKRDTKGLYKRALLHDDHPDKIFNLTGVNDTYETPSSPDLIIDTSIETAAESIQK
FYEFLIFSRASYLS*"
... and so on. I use this much because that's where I first ran into GO annotations in this bin, in case they're out of place somehow.
I think these files are the same thing, but I could be wrong. If they are, feel free to close the issue.
Thank you for your help!
Description
I am trying to convert the eggnogg-mapper output into gbk files. When entering using folders
eggnog_fnas
,eggnog_faas
,eggnog_annot
,eggnog_gff
, and anamefile.txt
, I keep getting ano corresponding protein ID
error and then the gbk file isn't madeWhat I Did
my command:
The reply:
I am a little confused. When I look at the
.gff
file, I see a header like thus:NODE_27_length_174714_cov_8.012086
And when I look in the
.faa
file, I see this kind of header:>NODE_27_length_174714_cov_8.012086_1
Is that the problem? How would I fix this?
Thank you for reading!