Sequences (contigs) ids should be equal in gff's table and fasta section

iferres commented 1 year ago

phrokka version: 1.3.2
Python version: 3.9.16
Operating System: Linux

Description

I annotated a bunch of viral genomes with pharokka and it looks that the sequences ids in the table and in the fasta header of the gff file are not the same. For instance:

##gff-version 3
##sequence-region AP017925.1 1 276958
AP017925.1      PHANOTATE       CDS     30      452     -116.87450862992809     -       0       ID=AP017925_CDS_0001;phrog=1198;top_hit=MG720308_p31;locus_tag=AP017925_CDS_0001;function=other;product=MutT/NUDIX hydrolase
AP017925.1      PHANOTATE       CDS     501     2687    -6900141919402.969      -       0       ID=AP017925_CDS_0002;phrog=2927;top_hit=NC_031039_p151;locus_tag=AP017925_CDS_0002;function=DNA, RNA and nucleotide metabolism;product=DNA polymerase
...
##FASTA
>AP017925.1 Ralstonia phage RP31 DNA, complete genome
ACGAGAGAGGAGGCGAATGCCTCCTCTCTCTATGCCGCTATGGTAATGCGGCTGGGTACA
AAACCCTTTTCCACCAGAGATTTCAACGGCGGAAAGAGATTCTCAGGCAACTTATCCCAT
...

In this case, AP017925.1 (first column in the gff table) is not equal to AP017925.1 Ralstonia phage RP31 DNA, complete genome (header in the fasta section of the gff file), which may cause 3rd party software to not being able to correctly read it. For comparison, the same genome annotated with prokka outputs:

##gff-version 3
##sequence-region AP017925.1 1 276958
AP017925.1      Prodigal:002006 CDS     30      452     .       -       0       ID=AP017925_00001;inference=ab initio prediction:Prodigal:002006;locus_tag=AP017925_00001;product=hypothetical protein
AP017925.1      Prodigal:002006 CDS     501     2687    .       -       0       ID=AP017925_00002;inference=ab initio prediction:Prodigal:002006;locus_tag=AP017925_00002;product=hypothetical protein
...
##FASTA
>AP017925.1
ACGAGAGAGGAGGCGAATGCCTCCTCTCTCTATGCCGCTATGGTAATGCGGCTGGGTACA
AAACCCTTTTCCACCAGAGATTTCAACGGCGGAAAGAGATTCTCAGGCAACTTATCCCAT
...

In this case, identifiers match so it's easy to parse.

PS. Thanks for this cool software!

gbouras13 commented 1 year ago

Hi @iferres

Thanks for spotting this mate - I am intending to update/refactor the code of pharokka in August so will add this to the changes then!

George

gbouras13 commented 1 year ago

Fixed by #275

gbouras13 / pharokka

Sequences (contigs) ids should be equal in gff's table and fasta section #267

Description