Bio::Tools::GFF doesn't write eukaryotic multi-exon genes correctly

krobison13 commented 2 years ago

When writing GFF, the same frame is assigned to every range in a multi-exon gene rather than correctly assigning 0,1 or 2 to specify the frame

Twitter note including image of the offending loop

hyphaltip commented 2 years ago

thanks for reporting a bug rather than just twitter rant!

maybe others can look at this too @cjfields -- its been 15+ years with that code - I think this is more about assumptions about features vs locations here that are obscuring the problem you describe.

If one is reading and writing multi-exons as individual features which is the typical way the frame is encoded this all works as planned - but if a single feature is encoded as a split-location - frame isn't encoded in a multi-location genbank file location necessarily.

probably If you wanted it to be computed from the data that might be helpful but it also make assumptions about the Generic feature being a CDS. This goes back to pre-GFF3 when the assumptions about how parent/child relationships were encoded and there were multiple interpretations of how to do this from gff1->gff2->gff2.5 /gtf etc.

I think much better validators and correctors for GFF (perhaps http://genometools.org/ ) have implemented a more dedicated logic.

maybe you can show input data that you used - are you are converting genbank to GFF and expecting frame to be computed and the assumption that it is a CDS with a frame to be carried through?

krobison13 commented 2 years ago

The zip file has a simple Genbank-formatted entry and a simple program that exposes the problem -- the correct sequence of frames is 0,0,2,1,0

gff-bug-reveal.zip

cjfields commented 2 years ago

Yeah I agree w/ @hyphaltip , I suspect there's bit rot from prior logical assumptions that have changed over time. I also vaguely recall Bio::Tools::GFF was to be deprecated in preference to Bio::DB::SeqFeature, though I'm not sure that is still the case.

Would it be worth looking into Bio::Tools::GFF or should we check Bio::DB::SeqFeature? If @scottcain around, maybe he would know? I think there was a GenBank-to-GFF conversion script for Bio::DB::SeqFeature (maybe within the GBrowse2 code?), we could check to see if if gives the correct frames.

scottcain commented 2 years ago

Ugh. Bio::Tools::GFF was old and janky a long time ago and should probably be marked as such, since I don't think it is likely to have improved with age. It is hard to remember the logic that went into that bit of code (I don't recall if I wrote it--I hope not--but I certainly might have!). I think @cjfields is right about there having been a GB to GFF3 script, but I don't recall where it lived.

There is a script with GBrowse, https://github.com/GMOD/GBrowse/blob/master/bin/load_genbank.pl, but it loads into a Bio::DB::GFF database (so, GFF2 and mysql or postgres). I don't have the time to do the code archeology to determine if it handles strand better.

hyphaltip commented 2 years ago

Chris Mungall wrote a gbk to gff script that used feature or name overlap to assign genes mRNA CDS to common parent group. It should be in scripts folder.

I think Tools::GFF predates you Scott and was before we had really the same workflow and full feature implementations. I think split location support was an add on previously it round tripped features where locations were explicitly start/stop only. A more db style interface with Lincoln’s DB::GFF was one solution .

now I would use NCBI tbl Format more aggressively and map to GFF / GBK / / ASN.1 from there anyways.

Jason

On Fri, Apr 8, 2022 at 10:02 AM Scott Cain @.***> wrote:

Ugh. Bio::Tools::GFF was old and janky a long time ago and should probably be marked as such, since I don't think it is likely to have improved with age. It is hard to remember the logic that went into that bit of code (I don't recall if I wrote it--I hope not--but I certainly might have!). I think @cjfields https://github.com/cjfields is right about there having been a GB to GFF3 script, but I don't recall where it lived.

There is a script with GBrowse, https://github.com/GMOD/GBrowse/blob/master/bin/load_genbank.pl, but it loads into a Bio::DB::GFF database (so, GFF2 and mysql or postgres). I don't have the time to do the code archeology to determine if it handles strand better.

— Reply to this email directly, view it on GitHub https://github.com/bioperl/bioperl-live/issues/369#issuecomment-1093091967, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAL5O6BKOMI6UK57PLTAHDVEBRCHANCNFSM5S2GDZ3A . You are receiving this because you were mentioned.Message ID: @.***>

-- Sent from Gmail Mobile

Jason Stajich - @.***

carandraug commented 2 years ago

I think @cjfields is right about there having been a GB to GFF3 script, but I don't recall where it lived.

There is bin/bp_genbank2gff3 which is in this repo and part of the BioPerl distribution.

There was also a bin/bp_genbank2gff which was moved to the Bio-DB-GFF distributio.

cjfields commented 2 years ago

Apologies to @krobison13 about the wait, but all of us 'old-timers' are pretty time constrained these days.

Coming back around to this, I think we should deprecate Bio::Tools::GFF particularly if there are better options, but we should definitely point in the right direction regardless what we decide.

bioperl / bioperl-live

Bio::Tools::GFF doesn't write eukaryotic multi-exon genes correctly #369