The-Sequence-Ontology / MSO

Molecular Sequence Ontology
9 stars 5 forks source link

biological sequence entity (MSO:3000265) vs sequence molecular entity (MSO:300003) #3

Open nataled opened 6 years ago

nataled commented 6 years ago

The names of these make it unclear which type of entity each is meant to refer to. Based on child terms of the latter, I'm guessing that it refers to the actual molecules and the former refers to the description of those molecules. If that's the case, then wouldn't biological sequence entity be a generically dependent continuant, as it would depend on the existence of the molecules they describe? Indeed http://purl.obolibrary.org/obo/BFO_0000031 (generically dependent continuant) uses 'sequence of a protein' as an example.

cmungall commented 6 years ago

Agreed, let's use molecule in name if molecular

On 16 Mar 2018, at 17:40, Darren A. Natale wrote:

The names of these make it unclear which type of entity each is meant to refer to. Based on child terms of the latter, I'm guessing that it refers to the actual molecules and the former refers to the description of those molecules. If that's the case, then wouldn't biological sequence entity be a generically dependent continuant, as it would depend on the existence of the molecules they describe? Indeed http://purl.obolibrary.org/obo/BFO_0000031 (generically dependent continuant) uses 'sequence of a protein' as an example.

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/The-Sequence-Ontology/MSO/issues/3

msinclair2 commented 6 years ago

I believe biological sequence entity was created as an umbrella term for entities that could, if integrated into CheBI, be put in more than one place and so result in multiple parentage. But I'll let Mike elaborate more on that.

mikebada commented 6 years ago

In the MSO, all biological sequence entities (including that class) are independent continuants. The modeling of these classes was based on ChEBI. Sequence molecular entities are ChEBI molecular entities in the form of either single chains or complexes composed of these chains. The top-level MSO:'biological sequence entity' was intended to be analogous to the top-level CHEBI:'chemical entity', which covers everything in ChEBI (apart from the role classes). So, like CHEBI:'chemical entity', a biological sequence entity is, as Michael noted, something of an umbrella class that covers sequence units, sequence unit boundaries, or collections of these.

Part of the motivation here is that different instances of some of the SO/MSO classes may be instances of different classes. For example, an allele is usually a multibase region, but it could also be zero-length boundary in the case of a complete deletion (or it could theoretically be a single base as well), so it can't be entirely subsumed by the sequence unit, sequence unit boundary, or collection classes. Thus, an allele is categorized as a nucleotide sequence entity, which analogously is a nucleotide unit, nucleotide unit boundary, or a collection of these.

BTW, I obviously haven't gotten around to it yet, but I do need to add a lot of comments into the ontology explaining the motivations for the modeling.

nataled commented 6 years ago

The main confusion I (and others) had regarding SO is whether it refers to physical polymeric molecules or to the sequences of those molecules. I can think of three interrelated classes that could be made surrounding sequenceable molecules. One is the molecule itself. Another is the sequence of that molecule. The third is the representation of that sequence on a piece of paper or in a database entry. I think there's no doubt that the molecules themselves (sequence molecular entity in your parlance) are independent continuants, and other than the mildly confusing name I have no issue with it. I had thought that biological sequence entity referred to the sequences of those molecules, but now I'm not sure since you say they are independent continuants. I'm fairly certain you don't refer to pieces of paper.

msinclair2 commented 6 years ago

This is something I have been pondering as well, since I am new to the SO/MSO project. In my mind the MSO is meant to refer to the molecules themselves. The SO describes the sequence wherever it is recorded, be it in molecules or on paper. I don't know what the place is for the sequence itself*.

*After thinking about it some more, I think that the sequences themselves are the particulars of universals in SO. For example, SO:intron is a universal, while any particular sequence of an intron is an instance of SO:intron, which depends on some instance of MSO:intron as the physical molecule.

nataled commented 6 years ago

I now think I understand what's happening and the source of my confusion regarding these two terms. Basically it comes down to the expectation that sequence molecular entity (MSO:300003) would be a child of biological sequence entity (MSO:3000265). I see why it currently is not; it's because of the complexes. Considering that complexes are covered elsewhere (other ontologies, that is) and that SO is not (cannot?) be used to annotate them, I'd suggest removing them from MSO as well. The result would be a cleaner and more consistent hierarchy.

mikebada commented 6 years ago

@nataled Good catch: Sequence molecular entities should be categorized as biological sequence entities; more specifically, they're biological sequence unit collections, which are collections of sequence units, either from the same molecular entity or from different ones. I'll edit this, and I'll also deprecate MSO:'protein complex', since it seems to be agreed it doesn't belong here.

msinclair2 commented 6 years ago

This means sequence molecular entities will not longer be material entities and a key parallel construction with ChEBI will be lost. Are we okay with that? I know that biological sequence entities are not chemical entities because the term as a whole includes boundaries, which are immaterial. But I feel a bit queasy about not categorizing molecules as chemical entities under ChEBI.

cmungall commented 6 years ago

The concern is with zero-length sequences? What's the use case for 0-len seqs in MSO?

mikebada commented 6 years ago

@msinclair2 No need to worry: You're correct that not all biological sequence entities are chemical entities because the former includes boundaries, but sequence molecular entities are more specifically sequence unit collections, all of which which are material entities, and MSO:'sequence molecular entity' is subsumed by CHEBI:'molecular entity'. (Even more specifically, MSO:'sequence molecular entity' is subsumed by CHEBI:'macromolecular entity'.)

mikebada commented 6 years ago

@cmungall I don't know if I'd refer to them as sequences, but the sequence unit boundaries are essentially zero-length along the linear sequence dimension. These boundaries are needed to represent concepts such as deletions, breakpoints, and cleavage sites. This actually already partly exists in the current SO in the form of the deletion and junction classes; this has just been slightly broadened and renamed to better align with the BFO, which also has a few boundary classes.

bpeters42 commented 5 years ago

I read the announcement of the MSO/SO refactoring that Chris had forwarded to obo-discuss. I had some feedback that Chris recommended I should submit to your tracker. Looking here, it seems that similar feedback was given before. Here is what I wrote originally:

"Upon looking at it quickly, following the suggested 'places to look' recommended by Mike, I am a bit puzzled. I remain unclear what a 'biological sequence entity is'. Is that intended as a grouping term of material and immaterial entities? If so, that is very confusing. I was hoping specifically that MSO would clear up what the material entities are. And even for what I considered non controversial, such as an amino acid (MSO_0001237), MSO seems to be neutral on that. And the typical things we struggle with, such as a free amino acid vs. an amino acid residue found within a protein, is not clear to me how it is modeled here."

Reading this tracker item, it seems that I am echoing previous concerns. As it stands, I am not sure that MSO/SO is less confusing for me than SO was.

cmungall commented 5 years ago

I would be a bit hesitant before committing to BFO immaterial entities here. For one thing, as can be seen here, it results in some awkwardness when biologists want to lump on one criteria and ontologists want to split on another. This manifests in either things that are intuitively alike being split across hierarchies, sometimes with the creation of "patch" grouping classes to unify them. This isn't a particular issue in MSO, I have seen this before in premature commitments to BFO.

Note that in Uberon, we originally had a bunch of classes under CARO "anatomical boundary". It later turned out that the main use case for things that had names like "midrain-hindbrain boundary" were actually material entities. Same for entities like joints. We now tend to use "anatomical junction" which is an ME. We still have things under the immaterial boundary, but this is largely reserved for "structural" classes that are used in logical axioms but not in the set of things used by curators.

It may be similar for what is currently under sequence junction in SO. These may be actual physical junctions, capable of bearing dispositions such as breakability. I note that chromosomal breakpoint has "region" as the text definition genus, and in practice these are often recorded as ranges.

Not picking on MSO here, but I am increasingly realizing that committing to too many abstract upper ontology categories carries hidden complexity costs. I favor MEs, Ps and Qs for anything in the domain of discourse for scientists and other categories for "structural support" classes.

mikebada commented 5 years ago

'biological sequence entity' is really a grouping term of three main types of sequence entities that (I'd argue) can capture any sequence or sequence feature: boundaries/junctions, sequence units, and collections of any combination of these. It functions as a root for everything in the sequence_feature, sequence_collection, and sequence_variant subhierarchies of the current SO (as well as other types of sequence entities not currently captured in the current SO), and it was named as such to be analogous to the top-level ChEBI 'chemical entity'. I think it's also useful because some sequence entities don't entirely fit in any of the three main types of sequence entities, e.g., variation, which could be any one of these, so it aggregates these as well.

With regard to the sequence boundaries/junctions, they're immaterial entities in this draft, but I understand Chris's point (and have waffled on it myself), so I'm agnostic as to whether they should be represented as immaterial or material entities. Either way, 'biological sequence entity' would still function as a grouping of the three main types of sequence entities, so the nature of the boundaries/junctions is an orthogonal issue.

With regard to the amino acids, they should be explicitly asserted as material entities; that was just an oversight. (Actually, they should be subsumed by CHEBI:group, as a sibling of 'sequence molecular entity region'.) However, they all have labels ending in 'residue', which I intended to be clearer than the current amino acid labels in the SO.

I hope this is helpful; lemme know if you have any other questions.

mikebada commented 5 years ago

@cmungall I agree that we should try to avoid making things difficult for users in terms of integration with the BFO. However, with the exception of the nature of the sequence boundaries/junctions (i.e., whether they're immaterial or material--and it seems it'd be OK to make them material entities if that would make things easier), I'd argue that there's really not a lot of complexity in this draft MSO with respect to its integration with the BFO.

What I'd argue has had a greater influence on the structuring in this draft is the integration of the MSO with ChEBI. ChEBI has a high-level differentiation between molecular entities (i.e., isolable molecules, complexes, etc.) and groups, which are connected proper parts of these. Additionally, among its represented sequence-related molecular entities and groups, ChEBI has distinct axes of differentiation between peptides & their groups and nucleic acids & their groups (in turn subdivided into DNA and RNA entities, among others). I'd argue that the current SO isn't well-aligned with ChEBI with regard to either of these axes, and a good chunk of the work has gone into introducing these axes into it. I think integrating with ChEBI is even more important than with the BFO; for example, if we query for ChEBI molecular entities, it'd be great to also get back MSO entities such as transcripts, plasmids, and probes, and if we query for ChEBI groups, it'd be great to also get back MSO entities such as sequence units and regions.

bpeters42 commented 5 years ago

'grouping terms' such as 'biological sequence entity' are fine within an ontology, to make it easy to find the terms in a certain editing scope. But once you integrate ontologies, they are a hindrance. What is the relationship of 'biological sequence entities' in to 'anatomical entities' or 'chemical entities'? The division seems to be one of expertise. The one thing that BFO got right is to focus on what things actually are. Are you describing a material entity? That better have mass. Or an information content entity? That better be able be lossless transferred.