AnantharamanLab / VIBRANT

Virus Identification By iteRative ANnoTation
GNU General Public License v3.0
142 stars 37 forks source link

GenBank ValueError with BioPython #71

Open MrTomRod opened 1 year ago

MrTomRod commented 1 year ago

Hey! I ran into trouble with GenBank output. I tried to parse a file that contains these lines with BioPython:

LOCUS       scf_19                 41458 bp    DNA     linear   VRL 2022-09-23
DEFINITION  scf_19.
COMMENT     Annotated using VIBRANT v1.2.1
FEATURES             Location/Qualifiers
     source          /organism="scf_19"

Normally, BioPython uses this code to parse your "invalidly spaced" GenBank.

But because the LOCUS line is less than 79 characters long, the BioPython parser goes into this code, triggering a ValueError on line 1438:

  File ".../lib64/python3.10/site-packages/Bio/GenBank/Scanner.py", line 1438, in _feed_first_line
    raise ValueError(
ValueError: LOCUS line does not contain - at position 71 in date:
LOCUS       scf_19                 41458 bp    DNA     linear   VRL 2022-09-23

If it's not too much trouble, please fix this issue.

KrisKieft commented 1 year ago

Hi,

I apologize but I probably will not get to fixing this issue. Please try other methods of building a genbank from the source genomes/proteins.

MrTomRod commented 1 year ago

No problem, there are easy workarounds.

peterjc commented 1 year ago

The source feature is also malformed, which used to trigger a warning in Biopython but recently we had a regression and errored - see https://github.com/biopython/biopython/issues/4274

peterjc commented 1 year ago

There appear to be more GenBank issues flagged by the Biopython parser in the example flagged in https://github.com/biopython/biopython/issues/4274 and are probably general issues: