chapmanb / bcbb

Incubator for useful bioinformatics code, primarily in Python and R
http://bcbio.wordpress.com
604 stars 243 forks source link

Trans-spliced genes cause ValueError: Did not find remapped ID location #91

Closed sjackman closed 9 years ago

sjackman commented 9 years ago

Hi, Brad. I've annotated a trans-spliced gene according to this recommendation—look for Trans-spliced transcript. The gene rps12 is composed of two gene features, two mRNA features with the same ID and each with two parents (the two genes), three exons, three CDS and one intron (the cis-spliced intron). This situation causes the following error:

❯❯❯ gff_to_genbank.py pg29-plastid-manual.gff pg29-plastid-manual.fa
…
ValueError: Did not find remapped ID location: gene84, [[112441, 113241]], [9558, 9672]

So, any hope to support trans-splicing? Thanks!

The annotation looks like this:

1   manual  gene    9559    9672    .   +   .   ID=gene83;Name=rps12|lcl|NC_021456.1_cdsid_YP_008082803.1_8-gene;exception=trans-splicing
1   manual  gene    112442  113241  .   +   .   ID=gene84;Name=rps12|lcl|NC_021456.1_cdsid_YP_008082803.1_8-gene;exception=trans-splicing
1   manual  mRNA    9559    9672    .   +   .   ID=mRNA43;Parent=gene83,gene84;Name=rps12|lcl|NC_021456.1_cdsid_YP_008082803.1_8;exception=trans-splicing
1   manual  mRNA    112442  113241  .   +   .   ID=mRNA43;Parent=gene83,gene84;Name=rps12|lcl|NC_021456.1_cdsid_YP_008082803.1_8;exception=trans-splicing
1   manual  exon    9559    9672    .   +   .   Parent=mRNA43
1   manual  CDS 9559    9672    .   +   0   Parent=mRNA43
1   manual  exon    112442  112673  .   +   .   Parent=mRNA43
1   manual  CDS 112442  112673  .   +   0   Parent=mRNA43
1   manual  intron  112674  113215  .   +   .   Parent=mRNA43
1   manual  exon    113216  113241  .   +   .   Parent=mRNA43
1   manual  CDS 113216  113241  .   +   2   Parent=mRNA43
chapmanb commented 9 years ago

Shaun; Apologies for the long delay in looking at this. I'm getting badly out of practice at looking into complex GFFs. I'll admit here I don't totally understand the ID mapping in this case. This mRNA has coordinates from 112442 to 113241:

1   manual  mRNA    112442  113241  .   +   .   ID=mRNA43;Parent=gene83,gene84

yet maps back to this parent, which has coordinates from 9559 to 9672:

1   manual  gene    9559    9672    .   +   .   ID=gene83

I had a sanity check for proper coordinate mapping, which this triggered. I've been thinking totally top down in terms of nested features, but I guess in this case the mRNA is more of the primary feature since it fuses from the parents.

Despite my not fully grasping complicated nested associations, the parent mapping is unambiguous in this case, so we can return two fused genes with the children and folks can choose which one they like downstream.

I pushed a v0.6 release which handles this again and will not choke. Thanks again for the report and sorry for being so slow at fixing it.

sjackman commented 9 years ago

Sorry I've taken so long to respond to your message (three months!). Your fix worked beautifully. Thanks!

Note that the mRNA that you pointed out as not overlapping its parent features has two components. Because the ID of these two records are identical, they in fact represent one feature that is "split across two discontinuous genomic locations". See http://www.sequenceontology.org/gff3.shtml

1   manual  mRNA    9559    9672    .   +   .   ID=mRNA43;Parent=gene83,gene84;Name=rps12|lcl|NC_021456.1_cdsid_YP_008082803.1_8;exception=trans-splicing
1   manual  mRNA    112442  113241  .   +   .   ID=mRNA43;Parent=gene83,gene84;Name=rps12|lcl|NC_021456.1_cdsid_YP_008082803.1_8;exception=trans-splicing
chapmanb commented 9 years ago

Shaun; Thanks for confirming this worked, and for the explanation. That makes a lot of sense now, I think the two parent mapping was confusing me. Really glad it does what you need it to do.

sjackman commented 9 years ago

I see references to this mythical CompoundFeatureLocation, but can't find it anywhere. Does it exist? If it existed, BCBio.GFF could combine these two records of mRNA43 into a single a CompoundFeatureLocation. https://github.com/biopython/biopython/search?q=CompoundFeatureLocation

sjackman commented 9 years ago

Trans-splicing definitely pulls out all the weird features of GFF.