gbouras13 / pharokka

fast phage annotation program
MIT License
146 stars 14 forks source link

Number-like sequence ID would generate invalid genbank files. #334

Closed shenwei356 closed 6 months ago

shenwei356 commented 7 months ago

Description

Hi, all. This is a very interesting bug, I can reproduce this. Input fasta file contains a single sequence record, the ID is 01E2.

Command:

pharokka.py -d /home/shenwei/ws/db/pharokka/v1.4.0 \
    -i genomes.annotation.pharokka/01E2/01E2.fasta 
    -o genomes.annotation.pharokka/01E2/01E2 \
    --force -t 20 -p 01E2

The genbank file looks like below. It's split into two records, one with only sequence , another one with only annotations, haha.

LOCUS       01E2                   43095 bp    DNA     linear   PHG 06-MAR-2024
DEFINITION  01E2.
ACCESSION   01E2
VERSION     01E2
KEYWORDS    .
SOURCE      .
  ORGANISM  .
            .
FEATURES             Location/Qualifiers
ORIGIN
        1 cccctttagt accatcgcgt acttctttga gtatcgccac gacacgaccc ataccgtcct
 ....

//
LOCUS       100.0                  43095 bp    DNA     linear   PHG 06-MAR-2024
DEFINITION  .
ACCESSION   100
VERSION     100.0
KEYWORDS    .
SOURCE      .
  ORGANISM  .
            .
FEATURES             Location/Qualifiers
     CDS             complement(2..1480)
                     /ID="RWRSYBXE_CDS_0001"
   ..
     CDS             complement(42274..43095)
                     /ID="RWRSYBXE_CDS_0074"
                     /transl_table=nan
                     /phrog="3228"
                     /top_hit="NC_023608_p43"
                     /locus_tag="RWRSYBXE_CDS_0074"
                     /function="DNA"
                     /function=" RNA and nucleotide metabolism"
                     /product="replicative helicase-primase"
                     /source="PHANOTATE_1.5.1"
                     /score="-632407.7737456377"
                     /phase="0"
                     /translation="LNQTFGNWIDKFNENSAGQDGMGRVVAILKEVRDGTKGAVGVVHH
                     TPKGGSKARGSGALYAGVDVELTLVRATEKQINVAHTKNKNGMQQKTIGMVLEPVQFRE
                     APPPKEFQAVEFVGGEGYGEIVNLDLPEPHKALVLMPWGFQPFETDEEKERNEGLDGKG
                     KDYVKDTVKRSKDASARESVMSALEDLQQADDTGRGFTQRQIVARAGDHSITNLVLEKM
                     LREGELMLGCDENGEVVTNTYRLPTGIDDRKRPKNRYEPNDNIKATEGDLE"
ORIGIN
//

OK, if I rename the FASTA ID with some string not starting with 0, everything is right.

Possible reason

01E2 might be parsed as a scientific notation (100), because the second genbank record is

LOCUS       100.0

If I change the ID to 0102, it panics again, with the second genbank record as

LOCUS       102     

So a note should be shown to users, do not use sequence IDs that look like a number, including scientific notation.

gbouras13 commented 6 months ago

Hi @shenwei356 ,

Thanks for picking this bug up.

Pharokka needs a refactor now that I am a much better coder than I was when I first wrote it, I just need to find some time!

This bug seems to be caused by bad typing when parsing the gene prediction summary file.

I can reproduce the error in v1.7.0 for scientific notation and integers that have leading 0s - these were being parsed as Int not as str.

I've put in a fix in v1.7.1 that parses everything always as str and tested it locally - on the dev branch if you are keen, otherwise it should be available once all the CI checks have passed.

George