Number-like sequence ID would generate invalid genbank files.

pharokka version: 1.7.0
Python version: Python 3.9.18
Operating System: Linux mBio 6.6.16-2-MANJARO SMP PREEMPT_DYNAMIC Sat Feb 10 09:40:02 UTC 2024 x86_64 GNU/Linux

Description

Hi, all. This is a very interesting bug, I can reproduce this. Input fasta file contains a single sequence record, the ID is 01E2.

Command:

pharokka.py -d /home/shenwei/ws/db/pharokka/v1.4.0 \
    -i genomes.annotation.pharokka/01E2/01E2.fasta 
    -o genomes.annotation.pharokka/01E2/01E2 \
    --force -t 20 -p 01E2

The genbank file looks like below. It's split into two records, one with only sequence , another one with only annotations, haha.

LOCUS       01E2                   43095 bp    DNA     linear   PHG 06-MAR-2024
DEFINITION  01E2.
ACCESSION   01E2
VERSION     01E2
KEYWORDS    .
SOURCE      .
  ORGANISM  .
            .
FEATURES             Location/Qualifiers
ORIGIN
        1 cccctttagt accatcgcgt acttctttga gtatcgccac gacacgaccc ataccgtcct
 ....

//
LOCUS       100.0                  43095 bp    DNA     linear   PHG 06-MAR-2024
DEFINITION  .
ACCESSION   100
VERSION     100.0
KEYWORDS    .
SOURCE      .
  ORGANISM  .
            .
FEATURES             Location/Qualifiers
     CDS             complement(2..1480)
                     /ID="RWRSYBXE_CDS_0001"
   ..
     CDS             complement(42274..43095)
                     /ID="RWRSYBXE_CDS_0074"
                     /transl_table=nan
                     /phrog="3228"
                     /top_hit="NC_023608_p43"
                     /locus_tag="RWRSYBXE_CDS_0074"
                     /function="DNA"
                     /function=" RNA and nucleotide metabolism"
                     /product="replicative helicase-primase"
                     /source="PHANOTATE_1.5.1"
                     /score="-632407.7737456377"
                     /phase="0"
                     /translation="LNQTFGNWIDKFNENSAGQDGMGRVVAILKEVRDGTKGAVGVVHH
                     TPKGGSKARGSGALYAGVDVELTLVRATEKQINVAHTKNKNGMQQKTIGMVLEPVQFRE
                     APPPKEFQAVEFVGGEGYGEIVNLDLPEPHKALVLMPWGFQPFETDEEKERNEGLDGKG
                     KDYVKDTVKRSKDASARESVMSALEDLQQADDTGRGFTQRQIVARAGDHSITNLVLEKM
                     LREGELMLGCDENGEVVTNTYRLPTGIDDRKRPKNRYEPNDNIKATEGDLE"
ORIGIN
//

OK, if I rename the FASTA ID with some string not starting with 0, everything is right.

Possible reason

01E2 might be parsed as a scientific notation (100), because the second genbank record is

LOCUS       100.0

If I change the ID to 0102, it panics again, with the second genbank record as

LOCUS       102

So a note should be shown to users, do not use sequence IDs that look like a number, including scientific notation.

gbouras13 / pharokka

Number-like sequence ID would generate invalid genbank files. #334

Description

Possible reason