biopython / biopython

Official git repository for Biopython (originally converted from CVS)
http://biopython.org/
Other
4.36k stars 1.75k forks source link

Parsing Genbanks with cross-origin features fails since PR #1497 #1608

Closed Zulko closed 5 years ago

Zulko commented 6 years ago

Hi there and thanks for maintaining the library.

Since version 1.71 and pull request #1497, FeatureLocation will throw an error if its end parameter is lower than it's start. The problem is that other programs are OK with reading or generating such features in genbank files. As a consequence, our programs crash everytime they meet this case, and there is no easy workaround I could think of.

A typical case where this occurs is when the record represents a 7000bp circular plasmid, and a feature goes from coordinate 6800, over the end, to coordinate 100. (therefore, a 300bp feature). I believe this is very common, not against the Genbank standard (as far as I know), and therefore should still be supported without throwing errors.

I understand that the motivation for PR #1497 was that such features may break some of the API, like len(feature), but I would argue the Biopython user could do that check themselves, or maybe there shold be some sort of option to raise an error or not.

peterjc commented 6 years ago

Would it be acceptable in your use case to have a warning here, and a feature with no location?

i.e. What we currently do with other invalid locations?

peterjc commented 6 years ago

P.S. Where did your invalid GenBank file come from? i.e. Which program generated it? It would be worth reporting the bug in their output to them as well.

peterjc commented 6 years ago

Your example should have a location of join(6800..7000,1..100) rather than 6800..100 as per NCBI and EMBL usage.

I agree that as currently written the standard is not explicit enough about this: http://www.insdc.org/files/feature_table.html#3.4

peterjc commented 6 years ago

Another option would be (provided the LOCUS line says this is a circular molecule) to give a warning an convert the location string 6800..100 into join(6800..7000,1..100) but that seems more complicated and fragile, so I am far less keen.

Or, downgrade the new exception to a warning? CC @JamesJeffryes

Zulko commented 6 years ago

At least SnapGene and Benchling (two popular sequence editors) use 6800..100 in their outputs, but I agree that join(6800..7000,1..100) is way more explicit (especially when the feature is detached from the record).

To me the best solution would be to raise a warning on read, and raise an exception with a proper message when for instance using len(feature). When the join syntax gets more commonly used, it could become an exception.

peterjc commented 6 years ago

With that approach, the parser wouldn't have to change - only the SeqFeature location objects must.

First, downgrade the new __init__ exception @JamesJeffryes added in #1497 to a warning. Although we'd be able to give a more explicit warning if the parser handled this...

Second, add a new exception in the location __len__ method when end < start.

peterjc commented 6 years ago

I'd like more opinions, so I tweeted this: https://twitter.com/Biopython/status/984420765444591616

peterjc commented 6 years ago

Paging @kblin for his input (as he deals with a lot of user-submitted GenBank files with anitSMASH).

JamesJeffryes commented 6 years ago

I'm concerned that having start > end also contradicts the documented behavior of start and end: http://biopython.org/DIST/docs/api/Bio.SeqFeature.FeatureLocation-class.html

I'd most prefer the converting the spanning location to a joined location.

peterjc commented 6 years ago

A very positive reply on Twitter from SnapGene:

@djinnome @Biopython @benchling If you think a SnapGene GenBank file violates a format convention, please send an example to feedback@snapgene.com. Thanks!

https://twitter.com/SnapGene/status/984457036783026181

@Zulko could you email them an example and a link to this issue please?

Zulko commented 6 years ago

I will, and I'll follow up here, thanks very much for the effort on your side. I am also starting to think that automatically replacing 6800...100 with join(6800..7000,1..100) would be the best solution, even though I understand it is more efforts.

djinnome commented 6 years ago

Hi folks,

As promised, here is an example genbank file that generates the following error message when parsed by BioPython

from Bio import SeqIO

with open("JPUB_008754.gb", "rU") as input_handle:
    for record in SeqIO.parse(input_handle, "genbank"):
        print(record)
AssertionError: LOCUS line does not contain space at position 64:
LOCUS       pEH010                  5743 bp    DNA     circular

The genbank file was generated from the Public JBEI ICE registry. (you will need to create an account to see the file)

I have also reported this bug to them so that they can fix it in their genbank exporter.

Sincerely,

Jeremy

peterjc commented 6 years ago

@djinnome thank you, but would you mind opening a new issue for that - its the LOCUS line which is a problem in that case, nothing to do with a feature location question as being discussed on this issue.

Zulko commented 6 years ago

Quick update: I had an answer from SnapGene: "Thanks for alerting us to this issue. We are not aware of specific guidelines from NCBI in this regard, but based on a sample record, they do seem to use the join() format for origin-spanning features. (...) We are discussing a quick fix in which we will adopt the join() convention for the GenBank exporter."

They will also contact some customers for opinions. So the join() notation seems to win, but location notations like 6800..100 will still be around for a long time, and I believe Biopython should tolerate these. For now, on our side, I had to freeze Biopython to 1.70.

peterjc commented 6 years ago

That's great from SnapGene, but you are right there will still be plenty of these kinds of origin-wrapping location files out there.

Automatically upgrading these to a join (with a warning) may be the best option...

Paging @kblin one more time for his input.

kblin commented 6 years ago

Sorry, GitHub is being stupid about email notifications, so they keep ending up on an address I don't look at while busy. Having a look now.

kblin commented 6 years ago

So I'm pretty sure I've seen origin-wrapping features from NCBI using the join() notation. Note that this does introduce problems if you just use feature.location.start and feature.location.end, because those will return the lowest and highest coordinate of the CompoundLocation, so you end up with a feature looking like it's spanning the whole genome.

kblin commented 6 years ago

I do think we just raise a warning on other invalid locations, though. So that might be the better way to handle this.

kblin commented 6 years ago

On a quick test, if I have a feature like:

     regulatory      435..
                     /regulatory_class="ribosome_binding_site"

(note the missing end), I'll get this output when parsing it:

>>> from Bio import SeqIO
>>> rec = SeqIO.read('NC_003888.gbk', 'gb')
/home/kai/dev/libs/biopython/Bio/GenBank/__init__.py:1143: BiopythonParserWarning: Couldn't parse feature location: '435..'
  % location_line))

In that case, the FeatureLocation of the corresponding feature will be None, and asking for a len will cause a TypeError:

>>> rec.features[2]
SeqFeature(None, type='regulatory')
>>> len(rec.features[2])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/kai/dev/libs/biopython/Bio/SeqFeature.py", line 383, in __len__
    return len(self.location)
TypeError: object of type 'NoneType' has no len()

I'd handle an origin-wrapping feature the same, I guess.

peterjc commented 6 years ago

Thanks Kai. That's what I was suggesting initially (but you've spelt out the details with an example which is helpful for the others in this discussion). I think I'll make this change, which is a big usability improvement over the current situation, but remain open to considering automatically converting these problematic locations into full origin wrapping joins as a later refinement.

Zulko commented 6 years ago

I am not sure I understand it correctly, but is the plan to convert the feature location to None ? This would silently loose information, and still fail at len(), so what would the advantage be over simply reverting PR #1497 (the feature will be created with an end lower than its start) ?

peterjc commented 6 years ago

My plan right now is to ignore the problematic locations (giving None), but not silently - you get a warning (update - see pull request #1657). This happens already on other kinds of problem location.

The advantage of this is you get an explicit warning and missing data marker, as opposed to silently getting corrupt data.

peterjc commented 6 years ago

As of Biopython 1.72 you will get a warning with these invalid locations, parsing will continue but the problem feature will have None as the location.`

paxelito commented 6 years ago

Dear all,

sorry to read here just now. I am really interested in this issue. I think that this topic should not be related to the current distinct standards or formats (e.g. genbank). Unfortunately, they change in time and it is not possible to stick to all of them. Nevertheless, I think the problem concerns the biological description of a particular context, is an origin wrapping feature possible? yes, for me working in the field of the synthetic biology it can be possible. So, how to deal with that? Actually, I have yet not understood how this issue has been solved, have I to handle it? So, have I to check start and end and use join if necessary, does biopython handle it automatically? Since I have a circular plasmid, the word origin is only functional for us, and a origin-wrapping feature should be exactly the same of any other feature.

Zulko commented 6 years ago

From the discussion, I think everyone agrees that cross-origin features exist, the question is how to represent their location, either 6800..100, or join(6800..7000,1..100) (assuming a 7kb plasmid), and what should Biopython do with 6800..100.

Software like Benchling and SnapGene use 6800..100, which is why in synbio we see a lot of these. The problem with this representation is that it is not self-contained: if you have only the annotation, and no plasmid info, you don't know how many nucleotides are under 6800..100. Plus, some bio-informatics algorithms always assume "direct sense" annotations where (start < end), so 6800..100 makes them crash.

The current solution (from @peterjc' message above) is to drop the data, and replace 6800..100 by None. This is an improvement on 1.71, which was to stop parsing the file and raise an exception. But data gets lost, and there is no easy way to get these locations infos as far as I can see. It is very damaging, in that when customers come with plasmid maps, some features may be missing after processing with biopython.

In conclusion, Biopython 1.70 worked perfectly fine, the 1.71 version broke all our applications. Version 1.72 is better but loses data. We use 1.70 our web apps.

paxelito commented 6 years ago

@Zulko thank you for the summary, now everything is clear. I agree that Biopython 1.70 works fine, nevertheless I think a solution must be found, not only to parse data from a whatever source but also to produce data from analysis. As far as I have understood the current solution now for cross-wrapping features should be using together ´CompoundLocaltion´and ´FeatureLocation´ as in the following

f1 = FeatureLocation(8,10)
f2 = FeatureLocation(0,3)
c1 = CompoundLocation([f1,f2])
s = Seq("ATGCACTGTA")
c1.extract(s)

and manually handle these cases when parsing from sources adopting the format 8..3. Am I right?

peterjc commented 6 years ago

@paxelito yes, for creating origin wrapping features we expect you to explicitly construct the two halves, and combine them into a compound location as you have shown, or equivalently you can use the addition support:

from Bio.Seq import Seq
from Bio.SeqFeature import FeatureLocation
# Simple example wrapping with a reference of length 10
wrapping_location = FeatureLocation(8,10) +  FeatureLocation(0,3)
s = Seq("ATGCACTGTA")
wrapping_location.extract(s)

P.S. Note that order is very important! The easiest way to confirm you've done what Biopython expects is with wrapping_location.extract(...) or iter(wrapping_location) to confirm the start/end codon placement.

peterjc commented 6 years ago

The best solution from my perspective is to report and fix non-standard tools (with no further action in Biopython).

Snapgene plan to fix their output for follow the NCBI GenBank join convention - are any of you paying customers of Benchling and in a position to ask they do the same?

However, it seems there would be value in going further and handling this in Bioypthon by automatically converting the problematic locations into the explicit join style, logged as #1745.

kblin commented 6 years ago

There's NCBI RefSeq records that have features spanning the origin on circular bacterial genomes. They are always a pain to deal with, even if the GenBank file contains a correct CompoundLocation. Basically, on a circular genome you always need to check if you're working on a regular FeatureLocation or a CompoundLocation that spans the origin. Locations like that also make the logic of "get me all features 5 kbp upstream and downstream of this feature" way more fun.

peterjc commented 6 years ago

@kblin That sounds like good use case for adding feature/location methods for extracting upstream/downstream/flanking sequences - does that seem worth opening an enhancement issue for?

kblin commented 6 years ago

Can do.

jamesabbott commented 6 years ago

I've also just come across this problem with features spanning the origin in some Ensembl bacteria embl format records. Although a parser warning is generated for the invalid location, in this case subsequent exceptions are thrown resulting in the records being unparesable.

For example, parsing the Helicobacter_sp_mit_91_6242 record, which contains:

FT   mRNA            1594494..84
FT                   /gene="BBW65_00005"
FT                   /note="transcript_id=ANV98697"

results in:

BiopythonParserWarning: Ignoring invalid location: '1594494..84'

Traceback (most recent call last): File "/Users/jabbott/miniconda3/envs/PASynth/lib/python3.7/site-packages/Bio/SeqIO/InsdcIO.py", line 269, in _insdc_location_string parts = location.parts AttributeError: 'NoneType' object has no attribute 'parts' During handling of the above exception, another exception occurred: [snip] AttributeError: 'NoneType' object has no attribute 'ref'

This is using biopython 1.72. If the problem occurs with Ensembl data then I suspect that location strings spanning the origin without using a join() shoud be considered valid. I'll raise an issue wiht the Ensembl helpdesk and point them at this issue and see what they say.Interestingly the 'ID' line in these records leaves the field for 'circular|linear' blank so it is not actually possible to determine if the location is valid with regards to the molecule type.

peterjc commented 6 years ago

@jamesabbott From the partial information you've given, this looks like a problem on output - could you file a new issue with full details including a URL for this EMBL format Helicobacter_sp_mit_91_6242 record please?

In short, bits of the output code are not gracefully handling a location of None.

ybendana commented 5 years ago

I am getting this exception ValueError: End location (2) must be greater than or equal to start location (<7342) with 1.73 for this feature:

     LTR             <7343..2
                     /label="ITR-R"

I would rather have just a warning rather than raising an exception or setting the location to None (losing information even if improperly formatted is never a good thing). Vector NTI and Geneious also can output these kind of cross-origin locations. Genbank format is too ambiguous and each vendor/organization implements it differently.

Perhaps some kind of strict parameter can be added where if it is True it throws the exception and if it's False it just warns.

peterjc commented 5 years ago

@ybendana can you try Biopython from github (not 1.73 which is too old) which should give a warning?

We ought to do the Biopython 1.74 release soon anyway...

ybendana commented 5 years ago

@peterjc , just tried with 1.74.dev0 and it gives the same exception.

peterjc commented 5 years ago

@ybendana can you share the test file with this feature (privately by emailing me if needed)? Or make a similar example with the same tool which we could share publicly? Thanks.

ybendana commented 5 years ago

@peterjc , I'm unable to share the original Genbank file but I copied the pUC19c vector from NCBI and added an annotation in Geneious that is cross origin and has a "truncated" start coordinate. I then exported it as "strict Genbank" to remove the Geneious-specific annotations. It displays the start coordinate as "<2644" and biopython gives the same exception noted above. The original file extension is '.gb' but I had to change it to '.txt' to attach it here.

SYNPUC19CV_cross_origin_feat.txt

kblin commented 5 years ago

We clearly still raise a ValueError in https://github.com/biopython/biopython/blob/master/Bio/SeqFeature.py#L759

The correct way to annotate this feature would be join(<2644..2686, 1..159), and that's what Geneious should be doing. As there is no way for Biopython to know if <2644..159 is an incorrectly written cross-origin feature join(<2644..2686, 1..159) or an incorrectly written complement(159..<2544), I don't think there's a good way to fix this in the library.

For my use case raising warnings instead of exceptions make life so much harder, especially when Biopython then goes ahead to put None into the location of a SeqFeature and code only crashes way down the road and you have no idea what happened.

I agree that maybe a "relaxed" and a "strict" mode would maybe help with the different requirements here.

peterjc commented 5 years ago

You can upgrade warnings to exceptions - that makes sense in a context like Kai's use case.

I suppose we could consider adding a strict mode in Bio.SeqIO.parse(..., strict=False) and Bio.SeqIO.read(..., strict=False) with default false for the current behaviour (warnings were the parser can continue). Then when strict=True, the top level function catches any BiopythonParserWarning, and rather than returning the problematic record, raises an exception.

(This would put the switch in a single place and cover the whole of Bio.SeqIO, rather than the alternative of adding strict modes to individual parsers within the framework.)

peterjc commented 5 years ago

@ybendana I've logged the problem you found as a new issue, #2133.