TAMU-CPT / training-material

A collection of Galaxy-related training material
https://training.galaxyproject.org
Other
3 stars 9 forks source link

GTN format tutorial for finding intron-containing genes #43

Closed jrr-cpt closed 4 years ago

jrr-cpt commented 5 years ago

to be continued...

jrr-cpt commented 5 years ago

What we have right now on intron-containing genes is in the Annotation in Apollo tutorial. This needs to be beefed up with material from @jasonjgill lecture slides. It is more of a tutorial on how to interpret a track that is already present from the functional workflow.

jrr-cpt commented 4 years ago

Re-orient this idea to all the types of interrupted genes that the intron tool may detect:

  1. Intron
  2. Frameshift
  3. Separated gene calls resulting from low-quality long sequence reads
jrr-cpt commented 4 years ago

There is already a branch for this, but it does not appear to have been used to generate the tutorial. Start a fresh branch when beginning to avoid merge conflicts.

Title: Finding Interrupted Genes directory: finding-interrupted-genes

Agenda:

Intro Refer to Functional for when tool is typically run, link to Annotation in Apollo tutorial. Describe how tool works, link to it (help text will explain some). Pull some material from Jason's lecture slides, JRR can supply.

Use cases For each of the three sections, give a screenshot of what it would look like in Apollo (JRR/Mei can help supply). Also show noise that is not real.

Note: Link to NCBI guidelines on handling these features in a Genbank/Bankit submission (Mei has)

meiliuCPT commented 4 years ago

Intron interrupted genes

Based on the tool output, protein alignments to the target proteins in the database need to be carefully verified to determine the interrupted exon boundary.
(1) When the exon boundary can be identified (based on the alignment positions to the known proteins), drag to set the exon boundary to the gene features. SD sequence can be deleted easily from the second or the third exon. Merge the exons together (select by clicking and holding down “shift”, and right click, select “merge”). If needed, set the first base of the fused gene as translation start (right click on the first base, click translation start), and the last base of the fused gene as translation end (right click on the last base and click translation end). Check the accuracy of the fused protein sequence. Using an intron interrupted terminase as example, the fused gene will be annotated as “Terminase large subunit”, with a note stating “contains introns with known boundaries”.

image

(2) When it is not possible to identify the exon boundary, the exons can not be merged together (because intron splicing sites are not known and the merged sequence will be be complete). In this case, NCBI does not accept keeping the exons as separate intron-truncated CDS fragments (results of an interrupted gene), so the whole region needs to be annotated as one gene, with note indicating that the coding boundaries of this gene are not determined. In the example below where the exon boundary can not be determined, the three exons can NOT be annotated as CDS. Instead, the three coding genes need to be DELETED and a new gene that spans from the start of the first gene, to the end of the last gene needs to be created. This gene feature needs to be created off Apollo (either in the 5 column table submitted to Genbank, or in a software like Artemis) as it can not be promoted from any of the gene call tracks. This new gene will not have associated CDS and will have a note stating "coding region spans undetermined; unable to determine intron boundaries"

image

meiliuCPT commented 4 years ago

@ltmaddox

Here is the complete tutorial https://docs.google.com/document/d/1rYXQ6CKljDvjihcV_7Jlyub6uvTFDQsdcoR1d9lksic/edit

ltmaddox commented 4 years ago

@meiliuCPT Thank you!