The-Sequence-Ontology / SO-Ontologies

Collect of SO Ontologies
Creative Commons Attribution 4.0 International
96 stars 37 forks source link

tRNA (SO:0000253) revisions #583

Open sjm41 opened 2 years ago

sjm41 commented 2 years ago

The current tRNA (SO:0000253) tree has children for each tRNA isotype plus 'mt_tRNA', but this arrangement has a few problems/deficiencies:

We propose that additional terms are added to group cytosolic/mitochondrial/plastid tRNAs, with terms for additional isotype terms added as necessary under each branch - see below. This arrangement would mirror the recent revisions to the rRNA terms (#493).

tRNA (SO:0000253)
|_cytosolic_tRNA (NEW)
    |_cytosolic_alanyl_tRNA (SO:0000254)
    |_cytosolic_arginyl_tRNA (SO:0001036)
    |_cytosolic_asparaginyl_tRNA (SO:0000256)
    |_cytosolic_aspartyl_tRNA (SO:0000257)
    |_cytosolic_cysteinyl_tRNA (SO:0000258)
    |_cytosolic_glutaminyl_tRNA (SO:0000259)
    |_cytosolic_glutamyl_tRNA (SO:0000260)
    |_cytosolic_glycyl_tRNA (SO:0000261)
    |_cytosolic_histidyl_tRNA (SO:0000262)
    |_cytosolic_initiator_methionyl_tRNA (NEW)
    |_cytosolic_isoleucyl_tRNA (SO:0000263)
    |_cytosolic_leucyl_tRNA (SO:0000264)
    |_cytosolic_lysyl_tRNA (SO:0000265)
    |_cytosolic_methionyl_tRNA (SO:0000266)
    |_cytosolic_phenylalanyl_tRNA (SO:0000267)
    |_cytosolic_prolyl_tRNA (SO:0000268)
    |_cytosolic_pyrrolysyl_tRNA (SO:0000766)
    |_cytosolic_selenocysteinyl_tRNA (SO:0005857)
    |_cytosolic_seryl_tRNA (SO:0000269)
    |_cytosolic_threonyl_tRNA (SO:0000270)
    |_cytosolic_tryptophanyl_tRNA (SO:0000271)
    |_cytosolic_tyrosyl_tRNA (SO:0000272)
    |_cytosolic_valyl_tRNA (SO:0000273)
|_mt_tRNA (SO:0002129)
    |_mt_alanyl_tRNA (NEW)
    |_mt_arginyl_tRNA (NEW)
    |_mt_asparaginyl_tRNA (NEW)
    |_mt_aspartyl_tRNA (NEW)
    |_mt_cysteinyl_tRNA (NEW)
    |_mt_glutaminyl_tRNA (NEW)
    |_mt_glutamyl_tRNA (NEW)
    |_mt_glycyl_tRNA (NEW)
    |_mt_histidyl_tRNA (NEW)
    |_mt_initiator_methionyl_tRNA (NEW)
    |_mt_isoleucyl_tRNA (NEW)
    |_mt_leucyl_tRNA (NEW)
    |_mt_lysyl_tRNA (NEW)
    |_mt_methionyl_tRNA (NEW)
    |_mt_phenylalanyl_tRNA (NEW)
    |_mt_prolyl_tRNA (NEW)
    |_mt_seryl_tRNA (NEW)
    |_mt_threonyl_tRNA (NEW)
    |_mt_tryptophanyl_tRNA (NEW)
    |_mt_tyrosyl_tRNA (NEW)
    |_mt_valyl_tRNA (NEW)
|_plastid_tRNA (NEW)
    |_plastid_alanyl_tRNA (NEW)
    |_plastid_arginyl_tRNA (NEW)
    |_plastid_asparaginyl_tRNA (NEW)
    |_plastid_aspartyl_tRNA (NEW)
    |_plastid_cysteinyl_tRNA (NEW)
    |_plastid_glutaminyl_tRNA (NEW)
    |_plastid_glutamyl_tRNA (NEW)
    |_plastid_glycyl_tRNA (NEW)
    |_plastid_histidyl_tRNA (NEW)
    |_plastid_initiator_methionyl_tRNA (NEW)
    |_plastid_isoleucyl_tRNA (NEW)
    |_plastid_leucyl_tRNA (NEW)
    |_plastid_lysyl_tRNA (NEW)
    |_plastid_methionyl_tRNA (NEW)
    |_plastid_phenylalanyl_tRNA (NEW)
    |_plastid_prolyl_tRNA (NEW)
    |_plastid_seryl_tRNA (NEW)
    |_plastid_threonyl_tRNA (NEW)
    |_plastid_tryptophanyl_tRNA (NEW)
    |_plastid_tyrosyl_tRNA (NEW)
    |_plastid_valyl_tRNA (NEW)

Note that the existing tRNA isotype terms, e.g. "alanyl_tRNA (SO:0000254)", will be clarified/defined as being 'cytosolic tRNAs' under this proposal, which we assume is the current intention of these terms (given there is an existing term for 'mt_tRNA') though this isn't currently specified. A potential problem here is if the existing isotype terms have been used to annotate mitochondrial/plastid tRNAs - but I don't see a good way around this...

New/revised definitions could be: cytosolic_tRNA (NEW): A tRNA that functions in the cytosol. cytosolic_tRNA subtypes (existing): Change current "A tRNA sequence that has..." in definition to say "A cytosolic tRNA that has..." cytosolic_initiator_methionyl_tRNA (NEW): A cytosolic tRNA that has a methionine anticodon and a 3' methionine binding region that functions to decode the start codon, setting the frame for translation of the mRNA. Sequence elements and modifications distinguish it from the elongator methionyl tRNA and help it to perform its varied tasks. (PMID: 19925799)

mt_tRNA (SO:0002129): A tRNA that functions in mitochondria. (PMID: 25734984) mt_tRNA subtypes (NEW): Mirror new 'cytosolic_tRNA' definitions, replacing "A cytosolic tRNA..." with "A mitochondrial tRNA..."

plastid_tRNA (NEW): A tRNA that functions in plastids (such as chloroplasts) (PMID: 9928487) plastid_tRNA subtypes (NEW): Mirror new 'cytosolic_tRNA' definitions, replacing "A cytosolic tRNA..." with "A plastid tRNA..."

Steven Marygold, Todd Lowe, Patricia Chan

Once all this is done, corresponding child terms should be added in the ncRNA-gene branch, under tRNA_gene (SO:0001272), which currently has no child terms at all. I'll make a separate ticket for that aspect.

egchristensen commented 2 years ago

So, if I understand correctly, it seems that the issue here is that SO is not representing commonly used sub-types of tRNA. My question is what role does location play in differentiating tRNAs here other than our own classification based on where we observe them? Is a tRNA transcript found in the cytosol fundamentally different from the same type of tRNA found in mitochondria? Does the mechanism that that distributes these tRNA subtypes across these different contexts involve sequence features or attributes? If so, are these sequence features or attributes already in SO somewhere else? While it's tempting to classify tRNA polymers here, I'd like to explore differences in the information contained within the polymers. That will guide how we structure the ontology. Any comments @keilbeck @sjm41 @patriciaplchan?

egchristensen commented 2 years ago

@sjm41 Could you tag Todd Lowe? I can't seem to find his GitHub profile.

sjm41 commented 1 year ago

I don't think I know Todd's GitHub handle either - @patriciaplchan will know.

sjm41 commented 1 year ago

Thanks for your thoughtful reply @egchristensen!

the issue here is that SO is not representing commonly used sub-types of tRNA Partly. There are two separable but related issues here:

  1. The current batch of terms are unsatisfactory/inconsistent: there are already separate terms for all individual tRNAs plus 'mt_tRNA', but it's unclear whether the individual tRNA terms are intended to be for cytosolic tRNAs (since there is no separate 'cyto_tRNA' term), or if they are intended to more generic and cover cytosolic/mito/plastid tRNAs (in which case we'd need to add the corresponding 'cyto_tRNA' and 'plastid_tRNA' terms). This needs clarifying and made consistent one way or another.

  2. I think it would be better to represent cytosolic/mitochondrial/plastid tRNAs with specific terms, rather than having to cooannotate with two different SO terms - e.g. "alanyl_tRNA" + "mt_tRNA". Mitochondrial tRNAs are quite distinct from cytosolic tRNAs - in addition to their mito location, mito tRNAs are encoded by the mitochondrial genome, and I believe they have distinctive sequence features. Not sure about plastid tRNAs - I assume the same can be said for them. @patriciaplchan will be able to say more on this issue. As noted above, we've already divided the rRNA terms into cytosolic/mitochondrial/plastid axes for similar reasons.

sjm41 commented 1 year ago

It would probably be better not to change the names of the current SO terms (and assume that any current usage was intended to be for the cytosolic form) and instead make new terms for all the compartment-specific tRNAs:

tRNA (SO:0000253)
    |_alanyl_tRNA (SO:0000254)
        |_cytosolic_alanyl_tRNA (NEW)
        |_mt_alanyl_tRNA (NEW)
        |_plastid_alanyl_tRNA (NEW)
    |_arginyl_tRNA (SO:0001036)
        |_cytosolic_arginyl_tRNA (NEW)
        |_mt_arginyl_tRNA (NEW)
        |_plastid_arginyl_tRNA (NEW)
    |_asparaginyl_tRNA (SO:0000256)
        |_cytosolic_asparaginyl_tRNA (NEW)
        |_mt_asparaginyl_tRNA (NEW)
        |_plastid_asparaginyl_tRNA (NEW)
    |_aspartyl_tRNA (SO:0000257)
        |_cytosolic_aspartyl_tRNA (NEW)
        |_mt_aspartyl_tRNA (NEW)
        |_plastid_aspartyl_tRNA (NEW)
    |_cysteinyl_tRNA (SO:0000258)
        |_cytosolic_cysteinyl_tRNA (NEW)
        |_mt_cysteinyl_tRNA (NEW)
        |_plastid_cysteinyl_tRNA (NEW)
    |_glutaminyl_tRNA (SO:0000259)
        |_cytosolic_glutaminyl_tRNA (NEW)
        |_mt_glutaminyl_tRNA (NEW)
        |_plastid_glutaminyl_tRNA (NEW)
    |_glutamyl_tRNA (SO:0000260)
        |_cytosolic_glutamyl_tRNA (NEW)
        |_mt_glutamyl_tRNA (NEW)
        |_plastid_glutamyl_tRNA (NEW)
    |_glycyl_tRNA (SO:0000261)
        |_cytosolic_glycyl_tRNA (NEW)
        |_mt_glycyl_tRNA (NEW)
        |_plastid_glycyl_tRNA (NEW)
    |_histidyl_tRNA (SO:0000262)
        |_cytosolic_histidyl_tRNA (NEW)
        |_mt_histidyl_tRNA (NEW)
        |_plastid_histidyl_tRNA (NEW)
    |_initiator_methionyl_tRNA (NEW)
        |_cytosolic_initiator_methionyl_tRNA (NEW)
        |_mt_initiator_methionyl_tRNA (NEW)
        |_plastid_initiator_methionyl_tRNA (NEW)
    |_isoleucyl_tRNA (SO:0000263)
        |_cytosolic_isoleucyl_tRNA (NEW)
        |_mt_isoleucyl_tRNA (NEW)
        |_plastid_isoleucyl_tRNA (NEW)
    |_leucyl_tRNA (SO:0000264)
        |_cytosolic_leucyl_tRNA (NEW)
        |_mt_leucyl_tRNA (NEW)
        |_plastid_leucyl_tRNA (NEW)
    |_lysyl_tRNA (SO:0000265)
        |_cytosolic_lysyl_tRNA (NEW)
        |_mt_lysyl_tRNA (NEW)
        |_plastid_lysyl_tRNA (NEW)
    |_methionyl_tRNA (SO:0000266)
        |_cytosolic_methionyl_tRNA (NEW)
        |_mt_methionyl_tRNA (NEW)
        |_plastid_methionyl_tRNA (NEW)
    |_phenylalanyl_tRNA (SO:0000267)
        |_cytosolic_phenylalanyl_tRNA (NEW)
        |_mt_phenylalanyl_tRNA (NEW)
        |_plastid_phenylalanyl_tRNA (NEW)
    |_prolyl_tRNA (SO:0000268)
        |_cytosolic_prolyl_tRNA (NEW)
        |_mt_prolyl_tRNA (NEW)
        |_plastid_prolyl_tRNA (NEW)
    |_pyrrolysyl_tRNA (SO:0000766)
        |_cytosolic_pyrrolysyl_tRNA (NEW)
    |_selenocysteinyl_tRNA (SO:0005857)
        |_cytosolic_selenocysteinyl_tRNA (NEW)
    |_seryl_tRNA (SO:0000269)
        |_cytosolic_seryl_tRNA (NEW)
        |_mt_seryl_tRNA (NEW)
        |_plastid_seryl_tRNA (NEW)
    |_threonyl_tRNA (SO:0000270)
        |_cytosolic_threonyl_tRNA (NEW)
        |_mt_threonyl_tRNA (NEW)
        |_plastid_threonyl_tRNA (NEW)
    |_tryptophanyl_tRNA (SO:0000271)
        |_cytosolic_tyrosyl_tRNA (NEW)
        |_mt_tyrosyl_tRNA (NEW)
        |_plastid_tyrosyl_tRNA (NEW)
    |_tyrosyl_tRNA (SO:0000272)
        |_cytosolic_tryptophanyl_tRNA (NEW)
        |_mt_tryptophanyl_tRNA (NEW)
        |_plastid_tryptophanyl_tRNA (NEW)
    |_valyl_tRNA (SO:0000273)
        |_cytosolic_valyl_tRNA (NEW)
        |_mt_valyl_tRNA (NEW)
        |_plastid_valyl_tRNA (NEW)

Could then still consider having the orthogonal groupings/parentage via: |_cytosolic_tRNA (NEW) |_mt_tRNA (SO:0002129) |_plastid_tRNA (NEW)

patriciaplchan commented 1 year ago

@sjm41 has provided a very good description on the issue. In addition to the distinct sequence features between cytosolic and mitochondrial tRNAs, plastids may have different genetic code in some species, making some of those tRNAs having distinct properties.

egchristensen commented 1 year ago

@sjm41 I wonder if we need to be differentiating further up the ontology? I'm trying to find another part of the ontology that differentiates types of sequence or sequence features by cytological context. I see that we've done that under rRNA, but that's a relatively recent change made just last year since we started collaborating. I just wonder if there is another way we might want to go about this. If necessary, we could treat each of these tRNAs individually, but I'm curious if there are common elements here that might aid in our organization and definition of these terms.

egchristensen commented 1 year ago

@sjm41 has provided a very good description on the issue. In addition to the distinct sequence features between cytosolic and mitochondrial tRNAs, plastids may have different genetic code in some species, making some of those tRNAs having distinct properties.

@patriciaplchan might I ask you to elaborate on the differences in sequence features and genetic code when it comes to tRNAs? If you could also suggest some literature, I'd be happy to read through it.

egchristensen commented 1 year ago

It looks like snoRNA (SO:0000275) and snRNA (SO:0000274) also mention cytological location in their definitions.

patriciaplchan commented 1 year ago

Unfortunately it is not practical to explain tRNA biology in a GitHub comment box. If you are interested in the topic, you can check out the following book: Soll and RajBhandary. tRNA: Structure, Biosynthesis, and Function. 1994, ASM Press. https://www.wiley.com/en-us/tRNA%3A+Structure%2C+Biosynthesis%2C+and+Function-p-9781683672739 Although this book is rather old, it provides the fundamental information on tRNA biology.

keilbeck commented 1 year ago

@patriciaplchan, what Evan is trying to get at, is what are the differentiae of these newly proposed terms other than location. The Sequence Ontology contains sequence features. So do the sequence features of cytosolic tRNA differ from the sequence features of plastid tRNA. I could read a book from 1994 and hope that somewhere it covered what the gene models look like and how they are different but that will take some time.

patriciaplchan commented 1 year ago

tRNA is one of the largest gene families. They are divided into isotype, isoacceptor, and isodecoder sub-families. The proposed ontology terms are at the isotype sub-family level - tRNAs that encode different amino acids. Cytosolic tRNAs in nuclear genome of most organisms belong to 46 isoacceptors - tRNAs with different anticodons for translating codons. Some microorganisms may have as few as 30+ isoacceptors. The number of tRNA genes in a nuclear genome of organisms vary by clade, ranging from 30+ tRNA genes in some bacteria to 500+ tRNA genes in human. The sequences of the cytosolic tRNA genes of the same isoacceptor are mostly identical in microorganisms but highly vary in large eukaryotes. This is what we call isodecoders - tRNAs with the same anticodon but different sequence body.

Mitochondrial genome usually only has 22 tRNA genes - one isoacceptor per isotype except mt-tRNA-Leu and mt-tRNA-Ser that have two isoacceptors. The genetic code of mitochondrial genome varies by clade. Therefore, the anticodon of some mt-tRNAs are different from cytosolic tRNAs. Many mt-tRNAs have degenerative secondary structure, with shortened or missing D-arm and/or T-arm, depending on clades, making the sequence length of mt-tRNAs generally shorter than the cytosolic tRNAs. Similarly, plastids only have 20+ to 30+ tRNA genes, missing a number of isoacceptors found in the nuclear genome. While plastids of many organisms use the "universal' genetic code as the nuclear genome, chloroplasts in some algae have different genetic code. Although not as common as mt-tRNAs, some plastid tRNAs also have degenerative secondary structure with shortened stems. making those sequences not conserved with cytosolic tRNAs.

I hope this gives some background on the differences among cytosolic, mitochondrial, and plastid tRNAs. The complexity level of tRNA gene family is many times greater than protein-coding genes, and it is difficult to get into the details of specific sequence feature differences without going through multiple publications or giving a full presentation. If you are interested in checking out some of the tRNA sequences and alignments, you can find the cytosolic tRNA genes in over 5000 genomes at the Genomic tRNA Database (http://gtrnadb.ucsc.edu/). In addition, the mitotRNAdb (http://mttrna.bioinf.uni-leipzig.de/mtDataOutput/) has mt-tRNA genes from 1500 metazoan mitochondrial genomes. The plantRNA database (http://plantrna.ibmp.cnrs.fr) also has cytosolic, mitochondrial, and chloroplast tRNA genes in over 50 plant genomes.

sjm41 commented 1 year ago

This recent-ish eference may be useful wrt mitochondrial tRNAs: https://pubmed.ncbi.nlm.nih.gov/25734984/

egchristensen commented 1 year ago

It seems like classifying tRNAs strictly using sequence features may present some challenges. I wonder if this is approaching the limit of SO's scope? Maybe we should be treating tRNA sub-types as cellular components in GO instead of as sequence features? I'm curious if @sabrinatoro, @cmungall or anyone else at the @OBOFoundry has an opinion here?

@patriciaplchan Thank you for the reading suggestions! Do you feel that the terms "isotype", "isoacceptor", and "isodecoder" should be included under tRNA in SO? If so, might I ask what sets these terms apart at the sequence level for nuclear, mitochondrial, and plasmid tRNA?

@sjm41 How do RNAcentral and FlyBase treat the sub-types of tRNA? Do they treat the sub-types in the same way? Do these location-dependent terms (nuclear, mitochondrial, and plastid tRNA) apply to all sub-types as proposed above?

@patriciaplchan & @sjm41, If you have any old slides or additional literature you think would be helpful, please feel free to include them in this thread. I'm going to have to read and do some thinking here.

sjm41 commented 1 year ago

How do RNAcentral and FlyBase treat the sub-types of tRNA? Do they treat the sub-types in the same way? Do these location-dependent terms (nuclear, mitochondrial, and plastid tRNA) apply to all sub-types as proposed above?

Both RNAcentral and FlyBase use the SO as the primary way to classify tRNAs, so we can currently classify the different isotypes (alanyl, valyl etc) but cannot clearly distinguish cytosolic and mitochondrial (and plastid) tRNAs (as explained above).

FlyBase additionally has gene group pages for cytosolic and mitochondrial tRNA genes.

sjm41 commented 1 year ago

@egchristensen If the proposal to have separate terms for the isotypes of cytosolic/mitochondrial/plastid tRNAs isn't going to fly (because they aren't sufficiently distinct from each other in terms of sequence features), then it doesn't make sense to keep the existing 'mt_tRNA (SO:0002129)' term for the same reason - it should be obsoleted.

Doing this would at least resolve the current inconsistency/ambiguity I mentioned above - ie. there are not corresponding SO terms for 'cytosolic_tRNA' or 'plastid_tRNA' to distinguish those tRNA types. Then it would be clear that the existing set of tRNA SO terms are agnostic as to subcellular location or genomic origin.

Note we'll still need to add the corresponding gene-level tRNA terms for the different isotypes - currently there is only 'tRNA_gene' with no children.

egchristensen commented 1 year ago

@sjm41 I agree that the current tRNA branch needs some work and initially I'm not opposed to the current proposal (isotypes of cytosolic/mitochondrial/plastid tRNAs) since it seems that it's what's used the most. I'd just like to explore things here a bit more before committing to a specific direction. Organizing these terms in such a way could have implications for how we treat similar sequences.

cmungall commented 1 year ago

I don't think tRNAs are in scope for the GO cell component hierarchy.

At the risk of further muddying the water, it's worth noting that CHEBI has tRNA plus subtypes for each AA http://purl.obolibrary.org/obo/CHEBI_17843

I think if we were doing SO from scratch then we might not have created the location-based subtypes, and instead have encouraged post-composition via col9 of GFF. I think choosing to obsolete these would be a big step. I don't know what the current SO SOP is for obsoletion, but it's very hard to know who has used which terms where, and migrating existing GFF files is somewhere between non-trivial and impossible.

I think @sjm41's proposal (https://github.com/The-Sequence-Ontology/SO-Ontologies/issues/583#issuecomment-1313646576) is valid and will not introduce any issues with legacy data. I would not manually enter all of these in Protege, but instead using a templating system like DOSDPs, and use reasoning to infer the lattice

egchristensen commented 1 year ago

When it comes to obsoleting @cmungall my training was pretty simple since we almost never do it. @davidwsant only obsoleted a term once before I started working with SO, but in the last year we've encountered more and more requests to obsolete. I wonder if this is a symptom of a larger issue with SO and its treatment of RNA?