The-Sequence-Ontology / MSO

Molecular Sequence Ontology
9 stars 5 forks source link

Definition of isoform (MSO:3000321) #1

Open nataled opened 6 years ago

nataled commented 6 years ago

The definition states, at the end, that an isoform may differ from other variants owing to post-translational modification. Is this intended to mean exclusively proteolytic cleavages, or does it include covalent attachments? Either way, I think it would be good to bring some clarity to terms like 'isoform' which have been used to mean just about anything that differs in some way from anything else. Is that term used to mean anything other than splice variants (mRNA and the resulting protein) these days?

msinclair2 commented 6 years ago

Good point, it is important that definitions reflect current usage and bring implicit distinctions to light. Personally I have also heard the word isoform being used mainly of splice variants. Polymorphism should probably be its own class to reflect the different usage. I'm not sure about post-translational modifications.

mikebada commented 6 years ago

This is an excellent ontological question, and I'm also unsure of how broadly it should be defined. The Human Protein Atlas, in discussing the human isoform proteome, says: "The structural space of the human proteome is large and diverse due to the presence of various protein variants (isoforms), including post-translational modifications, splice variants, proteolytic products, genetic variations and somatic recombination." This seems to be pretty close to anything that differs in some way from anything else, as you say. Note that we also have a variant class, essentially carried over from the current SO:sequence_variant class, so we'll have to decide if we want to merge these or differentiate them.

nataled commented 6 years ago

I can point you to this: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4114032/ which describes a term that--at least for proteins--has that all-inclusive meaning. In PRO we have 'categories' of terms that give an indication of origin of differences. Terms belonging to the 'gene' category will differ from its siblings based on the gene that encodes them. Terms belonging to the 'sequence' category will differ based on sequence differences even if the proteins are translation products of the same gene. Within that category we distinguish between 'isoforms' (which we consider solely as deriving from distinct splice variants) and 'sequence variant' (which would be derived from distinct alleles, for example). Finally, we have the 'modification' category, in which sibling classes can come from the exact same gene and allele thereof and even the same splice variant (if you consider that to be based on exons), but differ in some other post-translational modification (which includes covalent attachments and cleavage events. That just gives an idea of what my thinking is in terms of providing some clarity.

cmungall commented 6 years ago

The most straightforward approach is for MSO to be analogous to PRO, and to model this as a metaclass (in PRO these are represented as subsets, but really they are metaclasses). But the challenge here is the inability to answer questions such as 'how many distinct {reference proteins, structural isoforms} are there in the human genome' in a straightforward way. Perhaps these are actually GDC questions, in which case SO would not parallel MSO here.

mikebada commented 6 years ago

@cmungall I'm surprised to hear you say that metaclasses would be the most straightforward approach. Why not just create different subclasses of isoforms/variants? I use metaclasses for my work, and they're useful for me, but they don't seem to be straightforward for less-ontologically-minded folks.

Also, I'm not sure if I'm seeing everything, but the biolink-model document doesn't seem to really address the ontology of isoforms/variants briefly discussed above.