geneontology / go-shapes

Schema for Gene Ontology Causal Activity Models defined using RDF Shapes
2 stars 0 forks source link

Add ProteinContainingComplex -part_of-> AnatomicalEntity ? #213

Open dustine32 opened 4 years ago

dustine32 commented 4 years ago

Should we add the <ProteinContainingComplex> part_of: @<AnatomicalEntity> relation to go-cam-shapes.shex?

I found this connection missing when attempting to translate this GPAD annotation with gocamgen:

MGI     MGI:103013      part_of GO:0002095      MGI:MGI:3628972|PMID:16648270   ECO:0000314                     20070117        MGI     part_of(EMAPA:16105),part_of(CL:0000187)

Specifically, trying to create GO:0002095 part_of(CL:0000187): "caveolar macromolecular signaling complex" part_of "muscle cell"

thomaspd commented 4 years ago

This would only be OK if it's for a CC annotation, but not for an actual GO-CAM activity unit. In an activity unit, a complex can be an enabler of the MF but not a location for the activity. So in GO-CAM a gene product is part of a complex, but a complex is not part of a larger entity like a cell.

dustine32 commented 4 years ago

@thomaspd OK, so we would need to make this "not connected to activity unit" distinction in the ShEx spec if we were to add <ProteinContainingComplex> part_of: @<AnatomicalEntity>, only allowing it for "CC-only" assertions.

@vanaukenk Is this described situation possible/easy to write in ShEx syntax?

vanaukenk commented 4 years ago

@thomaspd Just want to make sure I'm clear on what you're saying.

For the statement, GP 'part of' protein-containing complex (either in the graph or the CC only version of the form), curators can say that a GP is 'part of' a protein-containing complex, but then not qualify the location of the complex further with 'part of' cell and/or 'part of' anatomy contextual information?

And for an activity unit, a protein-containing complex can only be the enabler whose activity 'occurs in' a non-protein-containing complex CC that can be qualified further with 'part of' cell and/or 'part of' anatomy contextual information?

Thx.

goodb commented 4 years ago

@thomaspd so for @dustine32 's GPAD to GO-CAM conversion project, when he sees a GPAD line like:

MGI     MGI:103013      part_of GO:0002095      MGI:MGI:3628972|PMID:16648270   ECO:0000314                     20070117        MGI     part_of(EMAPA:16105),part_of(CL:0000187)

what should he output? What does "okay for a CC annotation" mean once everything is in the GO-CAM world? To be consistent, shouldn't he assemble a set of assertions that fit the activity unit structure using top-level terms for the missing pieces? e.g. something like the following:

top-level-MF enabled_by MGI:103013 MGI:103013 part_of GO:0002095 top-level-MF occurs_in top-level-CC top-level-CC part_of CL:0000187
top-level-CC part_of EMAPA:16105

Screen Shot 2020-03-02 at 12 47 28 PM

goodb commented 4 years ago

@thomaspd @vanaukenk note that the schema currently allows MF occurs_in Complex assertions: occurs_in: ( @<AnatomicalEntity> OR @<ProteinContainingComplex> ) {0,1}; which seems to be in conflict with "a complex can be an enabler of the MF but not a location for the activity" . So we will need to update the schema if that is the idea.

ukemi commented 4 years ago

If I am not mistaken, enables-o-occurs_in->part_of. If we don't think this is correct, we need to change this chain and the subsequent GPAD output.

goodb commented 4 years ago

(For me, I like the idea of enforcing the structure through the use of top-level classes when more specific information is lacking. We do this already in a number of places in the mod conversion and the reactome conversion, but I'm not sure if we are totally consistent with it. The stable structure makes it easier to query, helps convey the requirements of the model, and provides a nice to-do list for any curator examining a model.)

vanaukenk commented 4 years ago

I've added this ticket to the 03-04 gocam specs conference call so we can decide what we want to do.

vanaukenk commented 4 years ago

I also like the idea of enforcing the structure by using root node terms when information is lacking.

@tmushayahama and I have talked about that in the context of the Noctua form where curators sometimes skip a field. Whenever possible, I would like to automatically populate those skipped fields with a root term for exactly the reasons you give @goodb

thomaspd commented 4 years ago

A "CC-only" annotation is an assertion about a location of a gene product (or complex), but not about where the gene product is ACTIVE. We'd agreed that these should have the form: gp/complex part_of CC

An activity unit, on the other hand, specifies where the gene product is active, and has the form: (MF enabled_by gp/complex) occurs_in CC

We'd also agreed that we can specify gp part_of proteinContainingComplex, when either the gp or the complex is an active entity (connected to an MF via enabled_by).

So, as Ben and Kimberly said, when the CC instance is connected to an MF via an occurs_in relation, the CC instance cannot be a proteinContainingComplex.

Also, when we have: MF enabled by (gp part_of proteinContainingComplex) or MF enabled_by proteinContainingComplex then the proteinContainingComplex cannot be a part_of a larger entity

But, when there is no enabled_by edge directly from the CC, or from one of its parts, then it's a CC-only annotation, and we presumably want to allow the "extension" specifying the location of the complex. However, as Kimberly suggested, I'd be fine with disallowing such "extensions" to complex-only annotations, if it would be OK with curators.

goodb commented 4 years ago

@thomaspd let me just push on this a little bit. I am concerned that what you describe here (and yes has been discussed before) produces two different structures for associating information about location with a gene product. This will mean that when we want to write queries for location we will need to have two patterns and we will need to teach curators two ways of building these models. In the picture below, T0 is a CC only annotation represented as described above ( including the cell and tissue context). T1 is the activity unit representation. As information accumulates, we can predict that T0 models will eventually look like T1 models as people figure out what the activity of the gene is. I understand that the GPAD line that initiated this discussion does not indicate where the gene product is active. My little thought experiment here is to look at the model T1 with the raw, top-level MF annotation, and ask if that is providing false information. If you think it is, if the presence of the 'enabled by' edge is going beyond what we can infer from the GPAD line, then okay we are left with the two-model approach. If not, then I suggest we just use the one.

Screen Shot 2020-03-03 at 11 50 56 AM

ukemi commented 4 years ago

In addition, models should be a rich as we can get them with respect to information and ground truth. In T0, there is information about the relationships between the gene product AND the complex with respect to the cell and anatomical structure. In T1, we can make the inference that the gene product is part_of a muscle cell, but I don't think we can go so far as to make the inference that the signaling complex is part of a muscle cell (thinking about cell junctions as an example, but maybe that's not valid and we should discuss it). So there is a bit of information loss. If we want to have the mereotopological relations that the complex is part of the muscle cell, then I think we need to assert it. Note that this type of relation is different than saying the gene product functions in the cell. It is a spatial relation that is strictly between two continuants with no dependencies on one of the continuants executing a function. However, as the property chains now stand, we make a mereotopological inference that if an entity enables a function in a structure, the entity is part of that structure. I think that inference is perfectly reasonable. I think in the long run we will be hampered if we chose to have mereotopological relationships between some entities but ban them from others. All of this impinges on future work about spatial representation in GO and how we will represent complexes, their roles, and the roles of their components.

ukemi commented 4 years ago

Note that the property chain above enables-o-occurs_in->part_of seems to be specific for the GPAD. In RO enables-o-occurs_in->'is active in'. So we need to resolve this issue. But my point above still holds that part_of and is_active_in are separate relations and if we are going to express a partonomy, then I think it should be complete. If we decide that gene products and complexes should not be part_of higher order structures then we need to discuss how this will go forward.

In my mind, this brings up the importance of taking a very rigorous look at spatial relationships in GO, something we have been saying for a while. It also impinges on things like proteins embedded on a membrane executing their function outside the membrane in the aqueous environment. What exactly do we want with respect to the spatial relationships between the entities we represent?

vanaukenk commented 4 years ago

I'll update the ShEx with 'located in' and 'AnatomicalEntity' where appropriate based on the 2020-03-04 conference call.

I won't merge, though, until we've all had a chance to look at it and comment.