eggnogdb / eggnog-mapper

Fast genome-wide functional annotation through orthology assignment
http://eggnog-mapper.embl.de
GNU Affero General Public License v3.0
571 stars 105 forks source link

decorate_gff_files #454

Closed nguyenjn1906 closed 1 year ago

nguyenjn1906 commented 1 year ago

Hi all, I am trying to run eggnog and add the eggnog annotation to an existing .gff output from prokka. When I run the command, it looks like the program cannot recognize the .gff file. How do I fix this?

Thanks,

Here's the code that I used to do that: emapper.py -i panaroo_6_strains_results/pan_genome_reference.fa --itype CDS --translate -o eggnog_result_6_strains_decorate_prokka --output_dir eggnog_pan_reference_6_strains_decorate --decorate_gff prokka_panaroo_apudapuas_results/prokka_panaroo_apudapuas_annotations.gff --cpu 10 --override

Here's the slurm output error:

ESC[1;32mFunctional annotation of hits...ESC[0m ESC[1;32mDecorating gff file prokka_panaroo_apudapuas_results/prokka_panaroo_apudapuas_annotations.gff...ESC[0m Traceback (most recent call last): File "/home/nguyenjn/.conda/envs/emapperinstall/bin/emapper.py", line 708, in n, elapsed_time = emapper.run(args, args.input, args.annotate_hits_table, args.cache_file) File "/home/nguyenjn/.conda/envs/emapperinstall/lib/python3.7/site-packages/eggnogmapper/emapper.py", line 351, in run n, elapsed_time = self.run_generator(annotated_hits) File "/home/nguyenjn/.conda/envs/emapperinstall/lib/python3.7/site-packages/eggnogmapper/emapper.py", line 288, in run_generator for item in generator: File "/home/nguyenjn/.conda/envs/emapperinstall/lib/python3.7/site-packages/eggnogmapper/deco/decoration.py", line 91, in decorate_gff g_score, g_strand, g_phase, g_attrs) = list(map(str.strip, line.split("\t"))) ValueError: not enough values to unpack (expected 9, got 1)

emapper-2.1.10

emapper.py -i panaroo_6_strains_results/pan_genome_reference.fa --itype CDS --translate -o eggnog_result_6_strains_decorate_prokka --output_dir eggnog_pan_reference_6_strains_decorate --decorate_gff prokka_panaroo_apudapuas_results/prokka_panaroo_apudapuas_annotations.gff --cpu 10 --override

ESC[1;33m /home/nguyenjn/.conda/envs/emapperinstall/lib/python3.7/site-packages/eggnogmapper/bin/diamond blastp -d '/gpfs/accounts/epid582w23_class_root/epid582w23_class/shared_data/database/eggnog/eggnog_proteins.dmnd' -q '/home/nguyenjn/balunas_lab/emappertmp_dmdn_f5_96ftz/tmphkfre5_g' --threads 10 -o '/home/nguyenjn/balunas_lab/eggnog_pan_reference_6_strains_decorate/eggnog_result_6_strains_decorate_prokka.emapper.hits' --tmpdir '/home/nguyenjn/balunas_lab/emappertmp_dmdn_f5_96ftz' --sensitive --iterate -e 0.001 --top 3 --outfmt 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovhsp scovhspESC[0m slurm-50372823.out (END)

Cantalapiedra commented 1 year ago

Hi @nguyenjn1906 ,

File "/home/nguyenjn/.conda/envs/emapperinstall/lib/python3.7/site-packages/eggnogmapper/deco/decoration.py", line 91, in decorate_gff
g_score, g_strand, g_phase, g_attrs) = list(map(str.strip, line.split("\t")))
ValueError: not enough values to unpack (expected 9, got 1)

What is the format of your .gff file? Is it tab separated?

nguyenjn1906 commented 1 year ago

I believe it is tab separated. Here's a couple sample line from the .gff file.

mrdA_1  Prodigal:2.6    CDS     1       1893    .       +       0       ID=HODOENEF_00001;eC_number=3.4.16.4;Name=mrdA_1;dbxref=COG:COG0768;gene=mrdA_1;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:P0AD65;locus_tag=HODOENEF_00001;product=Peptidoglycan D%2CD-transpeptidase MrdA
cynR_1  Prodigal:2.6    CDS     1       960     .       +       0       ID=HODOENEF_00002;Name=cynR_1;gene=cynR_1;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:P27111;locus_tag=HODOENEF_00002;product=HTH-type transcriptional regulator CynR
proP_1  Prodigal:2.6    CDS     1       1302    .       +       0       ID=HODOENEF_00003;Name=proP_1;gene=proP_1;inference=ab initio prediction:Prodigal:2.6,similar to AA sequence:UniProtKB:P0C0L7;locus_tag=HODOENEF_00003;product=Proline/betaine transporter
Cantalapiedra commented 1 year ago

Hi @nguyenjn1906 ,

Just to try to be sure, could you try with awk?

cat GFF_FILE | awk -F $'\t' '{print NF}' | sort | uniq -c

steven-bioinfo commented 1 year ago

Hello,

I have the same error with also the prokka gff output file.

The results of our awk command line :

cat annotation/predicted_genes/prokka/Sample_Name.gff | awk -F $'\t' '{print NF}' | sort | uniq -c
 606887 1
  33056 9

All the line with only one columns are the header columns (at the start of the file)

 head annotation/predicted_genes/prokka/Sample_Name.gff
##gff-version 3
##sequence-region MGS_0 1 1720
##sequence-region MGS_1 1 3645
##sequence-region MGS_2 1 6843
##sequence-region MGS_3 1 5133
##sequence-region MGS_4 1 6541
##sequence-region MGS_5 1 8044
##sequence-region MGS_6 1 1667
##sequence-region MGS_7 1 2258
##sequence-region MGS_8 1 8431

and at the end of the file with the contigs sequences

tail annotation/predicted_genes/prokka/Sample_Name.gff
>MGS_2537
ACAGACTTGCCTTTCCCATTCTTCCCCACTAATACATTAACATCGTCCAAAACCCATTCA
ACATTATATTCATCGAAGAGGTTCTCTATACTTAATTTTTTTATTTTTACGCTCATTAAT
CTACCGACCTGATAATCCATTATGTTTGATGAGTAACACTGTAGCTTGATTCTCGCTTCA
...
CATCGTCCAAAACCCATTCAACATTATATTCATCGAAGAGGTTCTCTATACTTAATTTTT
TTATTTTTACGCTCATTAATC

I read your code and for me the problem is in this loop : https://github.com/eggnogdb/eggnog-mapper/blob/master/eggnogmapper/deco/decoration.py#L84. You can stop the read when you see the line "##FASTA". After it's the sequence of contigs

Regards,

Steven

Cantalapiedra commented 1 year ago

Hi @steven-bioinfo ,

Thank you very much for the info and for providing a solution. I will include the fix in the master branch. Hopefully, other tools won't take as a trend the adding of fasta sequences within gff files.

Best, Carlos