GMOD / Apollo

Genome annotation editor with a Java Server backend and a Javascript client that runs in a web browser as a JBrowse plugin.
http://genomearchitect.readthedocs.io/
Other
127 stars 85 forks source link

import script should import sequence alterations and read_through_stop_codons #1514

Closed nathandunn closed 7 years ago

nathandunn commented 7 years ago

add_transcripts_from_gff3_to_annotations.pl should:

FYI @deepakunni3 @monicacecilia @mpoelchau

mpoelchau commented 7 years ago

A note on stop_codon_read_through features - I had a discussion with Nathan and Deepak on why it's necessary to maintain these when importing annotations to Apollo2. The main question was - why is the stop_codon_read_through feature necessary after it's instantiated, if the primary function of this feature is to correct the CDS feature coordinates (similar to modifying the translation start or stop)? The reason we'd like them to be maintained is that this feature serves an additional function beyond recalculating the CDS, namely as an indication to 1) other curators and 2) us that the protein sequence does not terminate at the area of the stop_codon_read_through feature, even if the CDS coordinates may say otherwise. (Also - we can't assume that the CDS coordinates are always correct, since we're importing 'legacy' annotations from Apollo1 that still may be affected by issue #55.) Without this extra cue, there may not be any remaining evidence that the curator added the stop_codon_read_through (or if the curator added a Note, it may be difficult to extract programmatically). This will confuse both us and other curators.

nathandunn commented 7 years ago

From @deepakunni3

Currently if an annotator sets 'translation start' and / or 'translation end', the CDS is modified to reflect this choice but there is no additional metadata associated with the annotation itself. Thus, at the time of GFF3 export the annotation will have the modified CDS but no explicit indication of 'set translation start' or 'set translation end'.

In Apollo 1, AnnotationEditor::setManuallySetTranslationStart (https://github.com/GMOD/Apollo/blob/1.0/src/main/java/org/bbop/apollo/editor/AnnotationEditor.java#L1322) method sets the translation start but in addition to that it also adds a comment for the CDS feature indicating that the start has been set manually. Similarly for setting translation end.

And these comments are exported into column 9 of CDS record in the GFF3.

So it looks like this preservation of this info was lost while transitioning from Apollo 1.0.x to Apollo 2.

nathandunn commented 7 years ago

@deepakunni3 I think that you are right in that it is inconsistent with the set readthrough stop codon behavior.

However, I think the difference is that that behavior is specifically addressed as a visible annotation in @mpoelchau 's workflow that indicates a specific justification for creating a modification to an annotation. Since we are preserving the CDS location, regardless of how it is set, I'm not sure if it matters.

Of course, what would be best would be to have a notation (a pin icon!) to set translation regions and create a notation if each that piece of annotation information, but I'm unsure what cases would justify that.

deepakunni3 commented 7 years ago

'Set translation start', 'Set translation end', 'Set read through stop codon' - All of these operations indicate user intervention on CDS calculated by Apollo. Thus keeping track of these actions in the database and in exports will be necessary.

nathandunn commented 7 years ago

@deepakunni3 I had some more questions.

I agree that for consistency it very much makes sense. I'm unsure if it will be necessary to address this case, but might be worth doing to maintain that consistency.

Questions: 1 - are you advocating adding an additional feature relationship (right now I think its a feature property) to support the translation start and ends? 2 - are you also suggesting providing a visual queue to the user, as well?

deepakunni3 commented 7 years ago

@nathandunn

Yes, for consistency.

Answers:

  1. Right now there is no feature property. We can add a feature property / comment (as it was done in Apollo 1.x) to CDS when translation start and end is set manually.
  2. We can provide a visual cue as well but will have to decide on how to render it: a. Either as a block like the way 'stop codon read throughs' are shown on a transcript feature b. Or as you suggested earlier, as pins that indicate manual setting of CDS start and / or end
deepakunni3 commented 7 years ago

@nathandunn And the issue of 'set translation start' and 'set translation end' can be made into a separate issue since it doesn't affect anything related to the importing and exporting of stop codon read through.

mpoelchau commented 7 years ago

Thanks for fixing this, @deepakunni3 and @nathandunn! We pulled and deployed off master (latest commit be12c7ae03fd11e30a4f4487e8ee896858867c0e), and I tested add_features_from_gff3_to_annotations.pl on our data, and it works great for the stop_codon_read_through features. However, I'm still having trouble with importing substitutions - perhaps I'm doing something wrong?

In the gff, I have the value of the substitution as an attribute (e.g. seq=A), and included the 'FASTA' section that is usually exported to in Apollo1 at the end of the gff3 file when substitutions or insertions are present.

Command used: perl add_features_from_gff3_to_annotations.pl -o 'Blattella germanica' -i blager_substitutions.gff -u stuff -p stuff -U https://apollo-dev.nal.usda.gov/apollo -X -a -g substitution -G substitution

The substitution imports to the UcA, but the substitution value in the UcA is 'undefined'.

sub

Any ideas what I am doing wrong?

nathandunn commented 7 years ago

@mpoelchau If you define a substitution and export it as GFF3 and re-import it does it work?

If so (that's what I'd tested), does that GFF3 match the GFF3 you were creating the substitution on?

My intuition is that there is a difference in the headers in Column 9 causing the problem.

deepakunni3 commented 7 years ago

@mpoelchau Could you also share the GFF3 from Apollo 1 that you used as input to the script?

mpoelchau commented 7 years ago

I tried it your way - on our end insertion works, but substitution doesn't.

Created substitution and insertion in Apollo2:

sub_in

Exported gff of reference sequence:

gff-version 3

sequence-region Scaffold398 1 947682

Scaffold398 . substitution 556426 556426 . + . ID=0d174ba7-7006-481f-8730-7798e7f7dc41;residues=G;seq=G Scaffold398 . insertion 556414 556413 . + . ID=17f24137-5278-4a6b-bb77-a130838bb51f;residues=AGCGC;seq=AGCGC

Cleared UcA, then uploaded exported gff to UcA:

groovy delete_annotations_from_organism.groovy -adminusername stuff -adminpassword stuff -destinationurl https://apollo-dev.nal.usda.gov/apollo -organismname 'Blattella germanica'

perl add_features_from_gff3_to_annotations.pl -o 'Blattella germanica' -i Reference\ sequence-Scaffold398-556336..556492.gff3 -u stuff -p stuff -U https://apollo-dev.nal.usda.gov/apollo -X -a -g substitution -G substitution

Stderror:

Use of uninitialized value $transcript_type in pattern match (m//) at add_features_from_gff3_to_annotations.pl line 792, line 5. Use of uninitialized value $transcript_type in pattern match (m//) at add_features_from_gff3_to_annotations.pl line 796, line 5. Use of uninitialized value $type in pattern match (m//) at add_features_from_gff3_to_annotations.pl line 645, line 5. Use of uninitialized value $type in pattern match (m//) at add_features_from_gff3_to_annotations.pl line 649, line 5. Processing Scaffold398 Processing chunk 1: https://apollo-dev.nal.usda.gov/apollo/annotationEditor/addFeature success Processing Scaffold398 Processing chunk 1: https://apollo-dev.nal.usda.gov/apollo/annotationEditor/addSequenceAlteration success

View of import in Apollo2:

sub_in_2

deepakunni3 commented 7 years ago

For the issue of exporting sequence alterations from Apollo 2 and reimporting it back, the following command is incorrect:

perl add_features_from_gff3_to_annotations.pl -o 'Blattella germanica' -i Reference\ sequence-Scaffold398-556336..556492.gff3 -u stuff -p stuff -U https://apollo-dev.nal.usda.gov/apollo -X -a -g substitution -G substitution

-g and -G argument is used for specifying the type name for gene features.

To import sequence alterations you needn't specify the type on the command line. i.e.,

perl add_features_from_gff3_to_annotations.pl -o 'Blattella germanica' -i Reference\ sequence-Scaffold398-556336..556492.gff3 -u stuff -p stuff -U https://apollo-dev.nal.usda.gov/apollo -X -a

The script is capable of identifying insertion, deletion and substitution and processing them accordingly.

mpoelchau commented 7 years ago

Great, that did it! Thanks!

deepakunni3 commented 7 years ago

@mpoelchau As for the import of substitutions from Apollo 1 to Apollo 2, the quickest solution would be to run a sed command on the GFF3 from Apollo 1:

sed 's/seq=/residues=/g' Apollo1.gff3 > Apollo1_edited.gff3

This should get rid of the undefined problem that you see after you import substitutions from Apollo 1 to Apollo 2.

nathandunn commented 7 years ago

@mpoelchau ,

@deepakunni3 and I talked and he was going to update the code to allow imports with the seq= as well so you don't have to use sed on all of the files as I assume you'll have a few of them.

mpoelchau commented 7 years ago

In our hands, Apollo1 exports the substitution and insertion values as a FASTA section at the end of the gff, and doesn't add attributes containing the value in the gff line (using seq= or residues=). If there's a way to change the Apollo1 config to allow the value to be exported in the gff line, please let me know! If not though, not a big deal - we were just going to post-process the exported gff to get the substitutions and insertions in the format needed for import.

nathandunn commented 7 years ago

I don't think there is anything in Apollo 1 to add that as I think this was a bug fix to put it in Apollo 2. Maybe @deepakunni3 knows more. It sounds like we don't need to anything else then.

deepakunni3 commented 7 years ago

@nathandunn Yes, this was a bug in Apollo 1, which has been fixed.

@mpoelchau The easiest solution would be to parse the GFF3 output for import into Apollo 2.