Bug in parsing genbank features spanning the origin for plasmids with long names

rra-coder commented 4 years ago

This is my first bug report so please let me know if I should do this differently.

Setup

I am reporting a problem with Biopython version, Python version, and operating system as follows:

3.7.6 (default, Jan 8 2020, 20:23:39) [MSC v.1916 64 bit (AMD64)] CPython Windows-10-10.0.18362-SP0 1.76

Expected behaviour

Genbank files containing features that span the origin should be fixed in the Bio.Genbank.init.py _loc funtion (lines 349-367).

# Attempt to fix features that span the origin
    s_pos = _pos(s, -1)
    e_pos = _pos(e)
    if int(s_pos) > int(e_pos):
        if seq_type is None or "circular" not in seq_type.lower():
            warnings.warn(
                "It appears that %r is a feature that spans "
                "the origin, but the sequence topology is "
                "undefined. Skipping feature." % loc_str,
                BiopythonParserWarning,
            )
            return None
        warnings.warn(
            "Attempting to fix invalid location %r as "
            "it looks like incorrect origin wrapping. "
            "Please fix input file, this could have "
            "unintended behavior." % loc_str,
            BiopythonParserWarning,
        )

This code expect that the seq_type contains "circular" in order to fix the feature location. The value of seq_type is passed in the following method: Bio.Genbank.init.py (lines 1121-1137) _FeatureConsumer.location method:

# Special case handling of the most common cases for speed
        if _re_simple_location.match(location_line):
            # e.g. "123..456"
            s, e = location_line.split("..")
            try:
                cur_feature.location = SeqFeature.FeatureLocation(
                    int(s) - 1, int(e), strand
                )
            except ValueError:
                # Could be non-integers, more likely bad origin wrapping
                cur_feature.location = _loc(
                    location_line,
                    self._expected_size,
                    strand,
                    seq_type=self._seq_type.lower(),
                )
            return

Actual behaviour

For plasmids with "short names" (<26 characters) the value of self._seq_type is "'DNA circular'" which will result in the correct behavior. However, with the new genbank file definitions, the name can be longer (>25 characters). This results in the value of self._seq_type being "DNA". This will result in the _loc function to return None for this feature. When writing these sequence record containing features with None as locations, it will crash the script.

Steps to reproduce

from Bio import SeqIO

with open("pTest.gb") as gb_file:
    for record in SeqIO.parse(gb_file, "genbank"):
        SeqIO.write(record, f"pTest2.gb", "genbank")

Using the attached genbank file. pTest.zip

Potential solution

# Special case handling of the most common cases for speed
        if _re_simple_location.match(location_line):
            # e.g. "123..456"
            s, e = location_line.split("..")
            try:
                cur_feature.location = SeqFeature.FeatureLocation(
                    int(s) - 1, int(e), strand
                )
            except ValueError:
                # Could be non-integers, more likely bad origin wrapping
                cur_feature.location = _loc(
                    location_line,
                    self._expected_size,
                    strand,
                    seq_type=self.data.annotations["topology"].lower(),
                )
            return

peterjc commented 4 years ago

I think we need to review the LOCUS line parsing for this...

Where is your GenBank format plasmid from? Is the tool generating it up to date, as it sounds like it is outputting invalid feature locations?

rra-coder commented 4 years ago

This specific example is something I generated to illustrate the problem. Genbank files generated by Geneious (Biomatters) with long names also have this problem. Geneious can export via a strickt Genbank format which will truncate names longer than 16 characters. In Geneious the name from the locus line is used as the primary name for the sequence entity (in most cases plasmids used in the lab do not follow the genbank naming schema). Thus exporting with the strick genbank definitions can result in loss of critical information in the name. Alternatively, Geneious can generate genbank files with longer names, which is in line with the new genbank file definitions which allows for LOCUS lines longer than 79 characters. The requirement is that there is a space between the name and the length entity. https://ftp.ncbi.nih.gov/genbank/gbrel.txt section 3.4.4

hagen-wende commented 3 years ago

I've been running into the same issue with genbank files from Vector NTI (arguably a fossil unter the sequence editors). But in my case the name length is under 26 letters. The first of the following headers is not parsed properly, while the second is

LOCUS       31_KIR2D_Final_NEW      178872 bp    DNA     circular     29-APR-2021
LOCUS       31_KIR2D_Final_NEW    178872 bp    DNA     circular     29-APR-2021

The second one brings the total line length to 79, but this does not seem to matter, as I can addd further spaces behind "circular". The important point seems to be that the number of letters before the sequence length digits needs to be 34 letters or less.

Since the isssue was raised already a year ago, I was wondering whether any fix is on the horizon?

peterjc commented 3 years ago

As to the new example, the spacing probably routes the code via the legacy column based parsing, the modern field based parsing, or if neither NCBI standard applies, our best effort to guess what the author intended.

Unless you fancy working on it, I'm unlikely to in the short term - no direct funding and growing family demands on my personal time.

biopython / biopython