deprekate / PHANOTATE

PHANOTATE: a tool to annotate phage genomes.
GNU General Public License v3.0
69 stars 9 forks source link

Internal stop codons and symbols in version 1.6.5 #41

Closed dcampbdc closed 1 month ago

dcampbdc commented 1 month ago

I'm trying to run Phanotate as follows:

$ phanotate.py -f faa -o 10F-1.faa 10F-1.fna

But this output I get has symbols (#,+,*) inside the sequence, or ends with a symbol that is not a stop codon (*):

10F_NODE_1_length_1907_cov_3.266739CDS[complement(66..398)] [note=score:-1.931059E-01] HPCFLILSRKTSPNQKTWVKFL*KRNS#NCIGVFVVSVCPHTTYTHTGVEY+GSPFS#NDDP#KYGKPRHFPPVSVLLLVITLRE*RQAVKKPFDVGVFKQREKDDDD+DI 10F_NODE_1_length_1907_cov_3.266739CDS[complement(476..565)] [note=score:-5.421236E-02] NCF#NHAW++SNRVPPKQIVRQCRRLKVKT 10F_NODE_1_length_1907_cov_3.266739CDS[579..944] [note=score:-9.828631E+01] MLYTEKEKHEIERVKEVFAEHLRQSPDFELLWSDKVGYVWLTIGVNPVYVDTGIRIESAADLCGRCLDDVATDVLYTTGNDHALEVADPLELAEIKRRWEPYINQLPDYAYLCKDLLNGKM# 10F_NODE_1_length_1907_cov_3.266739CDS[1004..1846] [note=score:-3.306728E+04] MKKSLTFRLWQDRKSILISCGARLAPFDIQELRDLTMYDELQLDTLGDKKTALFLIMSDTDSTFNFLISMVYTQLFNLLCDKADDQYGGKLPVHVRCLIDECANIGQIPNLEKLVATIRSREISACLVLQARSQLKAIYKDNADTIVGNMDSQIFLGGSEPTTLKDLSEILGKETIDAFNTSDTRGNSPSYGTTFQKMGHELLSRDELAVLDAGKCILQLRGVRPFLSDKYDLTQHPN YKLTSDYDPKNTFDIEKYLNRKEKIYPDDEFIVVDADSLPPA*

I saw this was an issue in an older version, has it re-emerged?

dcampbdc commented 1 month ago

I think the issue arises at the level of nucleotide sequence. I tried "Phanotate -f fna" and then translated with transeq. The special characters are all stop codons now...

10F_NODE_1_length_1907_cov_3.266739CDS[complement(66..398)]_1 [note=score:-1.931059E-01] HPCFLILSRKTSPNQKTWVKFL*KRNS*NCIGVFVVSVCPHTTYTHTGVEY*GSPFS*ND DP*KYGKPRHFPPVSVLLLVITLRE*RQAVKKPFDVGVFKQREKDDDD*DI 10F_NODE_1_length_1907_cov_3.266739CDS[complement(476..565)]_1 [note=score:-5.421236E-02] NCF*NHAW**SNRVPPKQIVRQCRRLKVKT 10F_NODE_1_length_1907_cov_3.266739CDS[579..944]_1 [note=score:-9.828631E+01] MLYTEKEKHEIERVKEVFAEHLRQSPDFELLWSDKVGYVWLTIGVNPVYVDTGIRIESAA DLCGRCLDDVATDVLYTTGNDHALEVADPLELAEIKRRWEPYINQLPDYAYLCKDLLNGK M* 10F_NODE_1_length_1907_cov_3.266739CDS[1004..1846]_1 [note=score:-3.306728E+04] MKKSLTFRLWQDRKSILISCGARLAPFDIQELRDLTMYDELQLDTLGDKKTALFLIMSDT DSTFNFLISMVYTQLFNLLCDKADDQYGGKLPVHVRCLIDECANIGQIPNLEKLVATIRS REISACLVLQARSQLKAIYKDNADTIVGNMDSQIFLGGSEPTTLKDLSEILGKETIDAFN TSDTRGNSPSYGTTFQKMGHELLSRDELAVLDAGKCILQLRGVRPFLSDKYDLTQHPNYK LTSDYDPKNTFDIEKYLNRKEKIYPDDEFIVVDADSLPPA*

deprekate commented 1 month ago

yeah, it is probably an off by 1 error that somehow re-emerged. The #,+,* symbols are each of the respective stop codons (differentiating them is useful in some cases). I'll try to get it fixed by tomorrow

dcampbdc commented 1 month ago

Any update on this? Thank you!

deprekate commented 1 month ago

I was able to replicate it. Working on a fix now

deprekate commented 1 month ago

whew, so I finally tracked it down and it actually came from my genbank dependency, and that I forgot to push the newest version (0.118) of genbank to pypi. Doh