Open rra-coder opened 4 years ago
I think we need to review the LOCUS line parsing for this...
Where is your GenBank format plasmid from? Is the tool generating it up to date, as it sounds like it is outputting invalid feature locations?
This specific example is something I generated to illustrate the problem. Genbank files generated by Geneious (Biomatters) with long names also have this problem.
Geneious can export via a strickt Genbank format which will truncate names longer than 16 characters. In Geneious the name from the locus line is used as the primary name for the sequence entity (in most cases plasmids used in the lab do not follow the genbank naming schema). Thus exporting with the strick genbank definitions can result in loss of critical information in the name.
Alternatively, Geneious can generate genbank files with longer names, which is in line with the new genbank file definitions which allows for LOCUS lines longer than 79 characters. The requirement is that there is a space between the name and the length entity. https://ftp.ncbi.nih.gov/genbank/gbrel.txt
section 3.4.4
I've been running into the same issue with genbank files from Vector NTI (arguably a fossil unter the sequence editors). But in my case the name length is under 26 letters. The first of the following headers is not parsed properly, while the second is
LOCUS 31_KIR2D_Final_NEW 178872 bp DNA circular 29-APR-2021
LOCUS 31_KIR2D_Final_NEW 178872 bp DNA circular 29-APR-2021
The second one brings the total line length to 79, but this does not seem to matter, as I can addd further spaces behind "circular". The important point seems to be that the number of letters before the sequence length digits needs to be 34 letters or less.
Since the isssue was raised already a year ago, I was wondering whether any fix is on the horizon?
As to the new example, the spacing probably routes the code via the legacy column based parsing, the modern field based parsing, or if neither NCBI standard applies, our best effort to guess what the author intended.
Unless you fancy working on it, I'm unlikely to in the short term - no direct funding and growing family demands on my personal time.
This is my first bug report so please let me know if I should do this differently.
Setup
I am reporting a problem with Biopython version, Python version, and operating system as follows:
3.7.6 (default, Jan 8 2020, 20:23:39) [MSC v.1916 64 bit (AMD64)] CPython Windows-10-10.0.18362-SP0 1.76
Expected behaviour
Genbank files containing features that span the origin should be fixed in the Bio.Genbank.init.py _loc funtion (lines 349-367).
This code expect that the seq_type contains "circular" in order to fix the feature location. The value of seq_type is passed in the following method: Bio.Genbank.init.py (lines 1121-1137) _FeatureConsumer.location method:
Actual behaviour
For plasmids with "short names" (<26 characters) the value of self._seq_type is "'DNA circular'" which will result in the correct behavior. However, with the new genbank file definitions, the name can be longer (>25 characters). This results in the value of self._seq_type being "DNA". This will result in the _loc function to return None for this feature. When writing these sequence record containing features with None as locations, it will crash the script.
Steps to reproduce
Using the attached genbank file. pTest.zip
Potential solution