The-Sequence-Ontology / MSO

Molecular Sequence Ontology
9 stars 5 forks source link

sequence molecular entity (MSO:3100003) #2

Open nataled opened 6 years ago

nataled commented 6 years ago

This class includes (surprisingly, IMO) protein complexes. How would you go about describing the 'sequence' of a complex? You wouldn't. You'd instead describe the sequence(s) of it's components. I see no reason to include protein complexes here. What's the use case and/or reason? I would suggest scoping this to include only those entities that can be described by a sequence.

Also regarding this term: it does not include glycans, which too can be described by sequence.

Finally, a typo (that might disappear if my suggestion is taken): In the comment #3, I believe it was intended to say "OR multi-stranded nucleic acids" rather than "OF multi-stranded nucleic acids" (emphasis added by me).

msinclair2 commented 6 years ago

Should protein complex (MSO:3100326) be made a child of peptide collection (MSO:0001501)?

Are glycans truly sequence molecules on the level of - in terms of information encoding - nucleic acids and peptides? They serve as signals and binding sites for the purpose of recognition, mainly as adjuncts to protein function, if memory serves. I'd like to see more discussion on this issue.

nataled commented 6 years ago

I wouldn't say glycans are on that same level. However, your definition as currently written would still include them, at least as I understand it. It's that definition--which I interpret to include "anything that can be described by sequence"--that makes me question the inclusion of protein complexes.

msinclair2 commented 6 years ago

Thanks for bringing this distinction to light. My personal feeling is that the MSO is not intended to represent every kind of biologically relevant polymer. Propose amending definition to "anything that can be described by sequence and is of the type of molecule that could potentially serve as a bearer for a generically dependent sequence feature, variant, or collection as described by the Sequence Ontology."

mikebada commented 6 years ago

My short, first answer is that I lean toward deleting MSO:'protein complex' because the GO CC already handles macromolecular complexes.

We've integrated most of the MSO into the ChEBI hierarchy, and the modeling of MSO:'sequence molecular entity' was based pretty directly on CHEBI:'molecular entity', whose definition explicitly includes complexes. (Notably, double-stranded DNAs, which are complexes in the sense that they're also not covalently bound, are molecular entities in ChEBI.) We've defined sequence molecular entities essentially as molecular entities in the form of either single chains or complexes of two or more such chains.

Finally, though there certainly are carbohydrate-based macromolecular entities, we haven't really tackled them yet, partly because their macromolecular structuring can be so complex. I'm not totally opposed to incorporating these into the ontological modeling, but I don't know how much stretching of the definitions would be required.

nataled commented 6 years ago

@msinclair2 that is better. @mikebada if that's the case then the definition would have to exclude the notion of 'information bearing'. Not opposed to it, but I'm not sure you'll want to mess with such a high-level term in the future. Your call.

cmungall commented 6 years ago

I agree with @mikebada, use GO for mm complexes

mikebada commented 6 years ago

I'm not sure exactly is meant by "information-bearing" here, but ChEBI has an information biomacromolecule class; it isn't defined, but one of its synonyms is "genetically encoded biomacromolecule". If this is what is meant, then I'll point out that there are already a number of classes in the (current) SO that represent sequence entities that aren't genetically encoded, e.g., primers, clones, gene trap constructs, and even broader classes such as engineered regions and synthetic sequences, so we can't really limit the SO/MSO to genetically encoded entities. In fact, I think this is why we didn't just directly use CHEBI:'information biomacromolecule'. I strongly believe that the SO/MSO should be able to handle any sequence entity (at least nucleotide- and amino-acid-based ones).

msinclair2 commented 6 years ago

@mikebada, I think what is considered "information bearing" for our purposes is crystal clear. It is the type molecule that SO is used to annotate (e.g. proteins, nucleic acids).

Sure, some particular subtypes of these are not annotated by SO, but they belong to the broader class of molecules that are.

To my knowledge, SO is not used to annotate glycans, at least not at present.

If you want to give up any kind of requirement that the molecules that MSO describes are of the type that SO annotates, that's fine with me. But I don't see any confusion at all about what "information bearing" means.

msinclair2 commented 6 years ago

I strongly believe that the SO/MSO should be able to handle any sequence entity (at least nucleotide- and amino-acid-based ones).

Based on how I define sequence molecular entities, I am in complete agreement with this for nucleic acids and polypeptides.

msinclair2 commented 6 years ago

To my knowledge, SO is not used to annotate glycans, at least not at present.

Nor is it used to describe protein complexes, I should add, so I would support removing that term as well from the MSO.

mikebada commented 6 years ago

Sequence annotation is certainly the most prominent task to which the SO is applied, but I think it'd be a mistake to strictly limit the SO/MSO to sequence entities that are directly annotated, as the SO is used for other tasks as well. My lab and others have used it a lot for text-mining purposes, and I think I've seen it used to annotate experimental protocols, for example. Eventually we'll want to see the SO/MSO classes being used in logical definitions of classes of other OBOs so as to better integrate it within the OBO ecosystem, and those classes may or may not directly be used to annotate database sequences.

nataled commented 6 years ago

We use SO terms for logical definitions of processed proteins in PRO, for example when we say that a particular protein lacks a signal peptide. I should add that we do so somewhat begrudgingly, fully aware that it might be incorrect, and mostly as a placeholder until "SOM" (as it was tentatively called) clarified what exactly was the entity in question. I say that because the classical SO, as I understand it, was not meant to refer to physical entities, but rather to annotations of those entities as represented in an information content way.

As for the glycan question, whether or not to include/allow them is up to you. I understand the intent of SO is not meant to refer to these, but I only raise the issue as a way of raising awareness that your definition might need tweaking if you want to specifically exclude them.

mikebada commented 6 years ago

@nataled Yes, in their "Evolution of the Sequence Ontology terms and relationships" paper, it's stated that classes of the current public SO represent generically dependent continuants. However, the natural-language definitions of many/most of these classes instead describe material (molecular) sequence entities, and they acknowledge in the paper that at least some of the logical definitions involving qualities/attributes actually describe material (molecular) sequence entities, so there's been some conflation for a while. Much of the motivation for this refactoring work is deconflating (unconflating?) these different sequence senses. With the refactoring, it's our intention that the SO will remain an ontology of GDCs, and the MSO an ontology of corresponding mostly material (molecular) sequence entities upon which the sequence GDCs generically depend.

nataled commented 6 years ago

Okay, great, that's what I thought was happening. Follow-up question under the appropriate ticket #3