Open krobison13 opened 2 years ago
thanks for reporting a bug rather than just twitter rant!
maybe others can look at this too @cjfields -- its been 15+ years with that code - I think this is more about assumptions about features vs locations here that are obscuring the problem you describe.
If one is reading and writing multi-exons as individual features which is the typical way the frame is encoded this all works as planned - but if a single feature is encoded as a split-location - frame isn't encoded in a multi-location genbank file location necessarily.
probably If you wanted it to be computed from the data that might be helpful but it also make assumptions about the Generic feature being a CDS. This goes back to pre-GFF3 when the assumptions about how parent/child relationships were encoded and there were multiple interpretations of how to do this from gff1->gff2->gff2.5 /gtf etc.
I think much better validators and correctors for GFF (perhaps http://genometools.org/ ) have implemented a more dedicated logic.
maybe you can show input data that you used - are you are converting genbank to GFF and expecting frame to be computed and the assumption that it is a CDS with a frame to be carried through?
The zip file has a simple Genbank-formatted entry and a simple program that exposes the problem -- the correct sequence of frames is 0,0,2,1,0
Yeah I agree w/ @hyphaltip , I suspect there's bit rot from prior logical assumptions that have changed over time. I also vaguely recall Bio::Tools::GFF was to be deprecated in preference to Bio::DB::SeqFeature, though I'm not sure that is still the case.
Would it be worth looking into Bio::Tools::GFF or should we check Bio::DB::SeqFeature? If @scottcain around, maybe he would know? I think there was a GenBank-to-GFF conversion script for Bio::DB::SeqFeature (maybe within the GBrowse2 code?), we could check to see if if gives the correct frames.
Ugh. Bio::Tools::GFF was old and janky a long time ago and should probably be marked as such, since I don't think it is likely to have improved with age. It is hard to remember the logic that went into that bit of code (I don't recall if I wrote it--I hope not--but I certainly might have!). I think @cjfields is right about there having been a GB to GFF3 script, but I don't recall where it lived.
There is a script with GBrowse, https://github.com/GMOD/GBrowse/blob/master/bin/load_genbank.pl, but it loads into a Bio::DB::GFF database (so, GFF2 and mysql or postgres). I don't have the time to do the code archeology to determine if it handles strand better.
Chris Mungall wrote a gbk to gff script that used feature or name overlap to assign genes mRNA CDS to common parent group. It should be in scripts folder.
I think Tools::GFF predates you Scott and was before we had really the same workflow and full feature implementations. I think split location support was an add on previously it round tripped features where locations were explicitly start/stop only. A more db style interface with Lincoln’s DB::GFF was one solution .
now I would use NCBI tbl Format more aggressively and map to GFF / GBK / / ASN.1 from there anyways.
Jason
On Fri, Apr 8, 2022 at 10:02 AM Scott Cain @.***> wrote:
Ugh. Bio::Tools::GFF was old and janky a long time ago and should probably be marked as such, since I don't think it is likely to have improved with age. It is hard to remember the logic that went into that bit of code (I don't recall if I wrote it--I hope not--but I certainly might have!). I think @cjfields https://github.com/cjfields is right about there having been a GB to GFF3 script, but I don't recall where it lived.
There is a script with GBrowse, https://github.com/GMOD/GBrowse/blob/master/bin/load_genbank.pl, but it loads into a Bio::DB::GFF database (so, GFF2 and mysql or postgres). I don't have the time to do the code archeology to determine if it handles strand better.
— Reply to this email directly, view it on GitHub https://github.com/bioperl/bioperl-live/issues/369#issuecomment-1093091967, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAL5O6BKOMI6UK57PLTAHDVEBRCHANCNFSM5S2GDZ3A . You are receiving this because you were mentioned.Message ID: @.***>
-- Sent from Gmail Mobile
Jason Stajich - @.***
I think @cjfields is right about there having been a GB to GFF3 script, but I don't recall where it lived.
There is bin/bp_genbank2gff3
which is in this repo and part of the BioPerl distribution.
There was also a bin/bp_genbank2gff
which was moved to the Bio-DB-GFF distributio.
Apologies to @krobison13 about the wait, but all of us 'old-timers' are pretty time constrained these days.
Coming back around to this, I think we should deprecate Bio::Tools::GFF particularly if there are better options, but we should definitely point in the right direction regardless what we decide.
When writing GFF, the same frame is assigned to every range in a multi-exon gene rather than correctly assigning 0,1 or 2 to specify the frame
Twitter note including image of the offending loop