GMOD / Apollo

Genome annotation editor with a Java Server backend and a Javascript client that runs in a web browser as a JBrowse plugin.
http://genomearchitect.readthedocs.io/
Other
128 stars 85 forks source link

Genome insertion trigers recalculate CDS on non-coding features #30

Closed childers closed 9 years ago

childers commented 10 years ago

If a sequence insertion is made on the exon for a feature, Web Apollo calculates the longest CDS. This is a problem for non-coding features, which are not supposed to have CDS features. Here is an example:

Scaffold1   WebApollo   gene    870322  875660  .   +   .   Name=73DAC4E888056EE3754995061F15375C;
Scaffold1   WebApollo   tRNA    870322  875660  .   +   .   Name=LdecTmpM001045-RA;
Scaffold1   WebApollo   exon    871543  871777  .   +   .   Name=C3C416C288FA09E123B0F0F225FC685E;
Scaffold1   WebApollo   exon    870322  870366  .   +   .   Name=5B5C479A0399008A64FA8D01EF3813CE;
Scaffold1   WebApollo   exon    873246  873730  .   +   .   Name=8835B8B0B924A447756DFBD92D945E67;
Scaffold1   WebApollo   exon    875510  875660  .   +   .   Name=8DB97ABB2C1623E701E05FEE5B4E454F;

After adding insertion:

Scaffold1   WebApollo   gene    870322  875660  .   +   .   Name=73DAC4E888056EE3754995061F15375C;
Scaffold1   WebApollo   tRNA    870322  875660  .   +   .   Name=LdecTmpM001045-RA;
Scaffold1   WebApollo   exon    871543  871777  .   +   .   Name=C3C416C288FA09E123B0F0F225FC685E;
Scaffold1   WebApollo   exon    870322  870366  .   +   .   Name=5B5C479A0399008A64FA8D01EF3813CE;
Scaffold1   WebApollo   CDS 870329  870366  .   +   0   Name=BE5E8F6277CB18AEDD3684C093B5E99C-CDS;
Scaffold1   WebApollo   CDS 871543  871777  .   +   2   Name=BE5E8F6277CB18AEDD3684C093B5E99C-CDS;
Scaffold1   WebApollo   CDS 873246  873730  .   +   0   Name=BE5E8F6277CB18AEDD3684C093B5E99C-CDS;
Scaffold1   WebApollo   CDS 875510  875625  .   +   2   Name=BE5E8F6277CB18AEDD3684C093B5E99C-CDS;
Scaffold1   WebApollo   exon    873246  873730  .   +   .   Name=8835B8B0B924A447756DFBD92D945E67;
Scaffold1   WebApollo   exon    875510  875660  .   +   .   Name=8DB97ABB2C1623E701E05FEE5B4E454F;

childers commented 10 years ago

Oops, left out the insertion:

Scaffold1   WebApollo   insertion   875615  875615  .   +   .   Name=024CEEAB2AE0C4F9AE1490815C67FDC3;

FASTA

>024CEEAB2AE0C4F9AE1490815C67FDC3
AAAAAAAAAAAAAAAAACCCCCCCCC
monicacecilia commented 9 years ago

@nathandunn Expected output should be "same as above", not a non-coding feature (tRNA in this case) with CDSs.

nathandunn commented 9 years ago

Interesting, when you delete the insertion, everything else stays.

Also, if the insertion precedes the annotation you don't get the extra CDS calculations.

monicacecilia commented 9 years ago

just fyi -- and I haven't looked to see if it applies here -- current, sequence alterations cannot overlap each other.

On Nov 26, 2014, at 1:50 PM, Nathan Dunn notifications@github.com wrote:

Interesting, when you delete the insertion, everything else stays.

Also, if the insertion precedes the annotation you don't get the extra CDS calculations.

— Reply to this email directly or view it on GitHub.

nathandunn commented 9 years ago

@monicacecilia Do we only want to calculate for "mRNA" transcripts, or other ones, as well?

nathandunn commented 9 years ago

@monicacecilia It is calculating the CDS when setting the longest ORF. I am telling it to do this only if it is a transcript of type mRNA. Not sure if snRNA, miRNA, etc. should also be included. . . for calculating or if we should exclude tRNA, ncRNA, etc.

selewis commented 9 years ago

yes, also exclude tRNA (any kind of ncRNA in other words). Only protein coding should have the protein coding calculated.

p.s. it would be difficult to allow genomic sequence alterations to overlap because then we'd have to know the order with which to apply the alterations. However when/if we reuse this code for indicating natural variation/alterations we'll need to allow overlaps coming from different individuals, even if the alterations for each individual are non-overlapping... (need to draw a picture)

-S

On Wed, Nov 26, 2014 at 2:08 PM, Nathan Dunn notifications@github.com wrote:

@monicacecilia https://github.com/monicacecilia It is calculating the CDS when setting the longest ORF. I am telling it to do this only if it is a transcript of type mRNA. Not sure if snRNA, miRNA, etc. should also be included. . . for calculating or if we should exclude tRNA, ncRNA, etc.

— Reply to this email directly or view it on GitHub https://github.com/GMOD/Apollo/issues/30#issuecomment-64717778.

nathandunn commented 9 years ago

@cmdcolin @childers Sorry, I should have fixed this during the break as it was failing the regression tests. The original code was correct, we want to calculate the CDS if a coding transcript. The original fix should do this. I just need to create a test that handles both cases. Unfortunately the default is "transcript", which should not be encoding.

selewis commented 9 years ago

to be finicky

"transcript" is not necessarily encoding.

if it is a transcript of type mRNA (a subclass of transcript in SO) then it is an encoding transcript.

the transcripts of protein coding genes should properly be typed using the sub-class mRNA

On Mon, Dec 1, 2014 at 8:50 AM, Nathan Dunn notifications@github.com wrote:

@cmdcolin https://github.com/cmdcolin @childers https://github.com/childers Sorry, I should have fixed this during the break as it was failing the regression tests. The original code was correct, we want to calculate the CDS if a coding transcript. The original fix should do this. I just need to create a test that handles both cases. Unfortunately the default is "transcript", which should not be encoding.

— Reply to this email directly or view it on GitHub https://github.com/GMOD/Apollo/issues/30#issuecomment-65094782.

nathandunn commented 9 years ago

You are exactly correct. “Transcript” is actually used with pseudogenes and MRNA (at least in our current system) has no sub-classess.

I’m pretty sure this implementation is correct, but I will need others to test. Once I get the rest of the 1.0.3 bugs sorted out, I’ll put it up for testing.

Nathan

On Dec 1, 2014, at 11:02 AM, selewis notifications@github.com wrote:

to be finicky

"transcript" is not necessarily encoding.

if it is a transcript of type mRNA (a subclass of transcript in SO) then it is an encoding transcript.

the transcripts of protein coding genes should properly be typed using the sub-class mRNA

On Mon, Dec 1, 2014 at 8:50 AM, Nathan Dunn notifications@github.com wrote:

@cmdcolin https://github.com/cmdcolin @childers https://github.com/childers Sorry, I should have fixed this during the break as it was failing the regression tests. The original code was correct, we want to calculate the CDS if a coding transcript. The original fix should do this. I just need to create a test that handles both cases. Unfortunately the default is "transcript", which should not be encoding.

— Reply to this email directly or view it on GitHub https://github.com/GMOD/Apollo/issues/30#issuecomment-65094782.

— Reply to this email directly or view it on GitHub https://github.com/GMOD/Apollo/issues/30#issuecomment-65116219.

monicacecilia commented 9 years ago

works now. Thanks!

nathandunn commented 9 years ago

@monicacecilia I want to retest this in 2.0.0 I looked and it appears that we have the correct code in there, but I want some separate eyes to verify.

monicacecilia commented 9 years ago

@nathandunn You are correct, genomic insertions are no longer causing the generation of a CDS in tRNA features. However, while investigating this issue, I found a few other things. Please see #263

nathandunn commented 9 years ago

@deepakunni3 Could this be relevant?

deepakunni3 commented 9 years ago

@nathandunn This bug doesn't occur in Apollo 2.0

The rationale holds for recalculating CDS only when the transcript is an instance of mRNA.