Closed goodb closed 5 years ago
The conclusion here is that those sub-sets are really sets and should be treated as such in the GO-CAM representation to the extent that is possible (without building deep hierarchical entity representations).
Noting that this impacts the part of the code that processes the active unit annotations as sometimes the active unit of a complex is itself a set. See the reaction 'Tyrosine phosphorylation of STAT1, STAT3 by IL6 receptor' in the pathway 'Interleukin-6 signaling' - R-HSA-1059683 .
We should look at this on the call with @deustp01, but to me it seems that you are correct. If the active unit is itself a set, how do we represent this as individuals on the canvas? I'm not even going to speculate on instances. If I understand correctly, the set is not a complex in and of itself, but a list/menu of genes that can each be a part of the complex and each of which would be the enabler of the MF if they are the menu item of choice. @deustp01 is that correct?
@ukemi in keeping with what we've done so far with sets, my intention for the active unit case would be to pull out a node representing the set and create the Union object (the list/menu thing) within it. This would be linked to the parent complex by the has_part relation just as if it was a protein.
My interpretation for the general issue here, apart from active unit information is that we need objects that represent these sets even when they are inside complexes. So when collapsing complexes that contain sets in their membership, each set would be kept together as its own node and linked to the complex as a part.
I think the owl union is the right logical construct. The UI could be improved such that these were actually shown in some kind 'menu' form.
I've taken a stab at implementing this. Have a look at the screenshot below of 'Tyrosine phosphorylation of STAT1, STAT3 by IL6 receptor' from the Interleukin-6 signaling pathway. Note that whenever you see a 'union' that should correspond to a set in the reactome model (R-HSA-1059683). You can see that we now have a set drawn out from its parent complex enabling the reaction. You can also see the sets appearing as members of other complexes such as the output of the reaction. Thoughts ?
Further tests look good. As things stand right now, Reactome complexes, which can contain deep levels of hierarchy, are flattened to a list of member molecules and member sets linked to the complex entity with the has_part relationship. Sets are treated just like proteins, but appear as an owl:union of their members. Note again that information about protein state (e.g. phosphorylation) is not currently captured. Below are some images showing the consequences of the transformation. Looking at the inputs to the reaction Ubiquitination of phospho-p27/p21 R-HSA-187575 we can see the loss of the protein state information and the loss of the hierarchical representation of the complex. It is worth noting here that both are modeling choices, not necessities. We could capture both bits of information in the GO-CAM representations based on what is currently in the BioPAX L3 export.
Note how the first input (the long set of ubiquitin variants) gets collapsed down to a union of the 4 distinct uniprot ids from that group) - see Ub at top. Above, the screenshot is only showing a small part of the other input complex in the Reactome viewer. That part becomes the phospho-p27/p21 union entity on the right. Note the loss of the intermediate entity 'Cyclin E/A:p-T160-CDK2:p-S130-CDKN1A,p-T187-CDKN1B' which contains this set.
It is worth noting here that both are modeling choices, not necessities.
I think it would be worth noting this in the manuscript as well. It is a border that we set. We should discuss the issues surrounding why we set this. Perhaps a reviewer will differ with our views and suggest that we include this information. We should propose how to handle this in future iterations of the imports.
I think what we need here is a clear statement from the ontology and annotation teams about how complexes are to be dealt with by the GOC as a whole. Perhaps I'm out of the loop, but to me this is still ambiguous and that ambiguity hampers efforts like this one.
Resolution from meeting today is that current representation (sets turn into Unions, complex components may be proteins, small molecules, or Unions, all collapsed to one level) is going to be good for now. Closing - though we should open a new Noctua ticket to improve the view of Unions.
Re-opening to consider the case where a set is composed of complexes or other sets (which could recursively do the same thing). I'm not sure if there is a way to flatten this while maintaining the correct set logic.
Here is an example from reaction reaction R-HSA-1250498: RAS guanyl-nucleotide exchange mediated by SOS1 in complex with GRB2 and phosphorylated EGFR:ERBB2 heterodimers. The question is how to represent the controller of the reaction GRB2:SOS1:P-ERBB2:P-EGFR in the GO-CAM model. I tried to compress the structure of the object below:
controller COMPLEX GRB2:SOS1:P-ERBB2:P-EGFR
component SET Phosphorylated ERBB2:EGFR heterodimers
member COMPLEX EGF:p-6Y-EGFR:p-6Y,Y1112-ERBB2
component COMPLEX EGF:p-6Y-EGFR
component PROTEIN p-6Y-EGFR
component PROTEIN EGF
component PROTEIN p-6Y,Y1112-ERBB2
member COMPLEX EGF:p-6Y-EGFR:p-7Y,Y1112-ERBB2
component COMPLEX EGF:p-6Y-EGFR
component PROTEIN Phospho-EGFR
component PROTEIN EGF
component COMPLEX GRB2-1:SOS1
component PROTEIN Ash-L
component PROTEIN SOS-1
The tricky element here is the SET Phosphorylated ERBB2:EGFR heterodimers
the correct logic is to convert that into: Union (COMPLEX EGF:p-6Y-EGFR:p-6Y,Y1112-ERBB2 & COMPLEX EGF:p-6Y-EGFR:p-7Y,Y1112-ERBB2)
But then to get to the proteins, we need to go down another level within each complex object. And there is nothing to prevent hitting another set and having to produce another level and so on indefinitely.
Options:
OK. After having thought a lot about this last night, I'm very uncomfortable with the flattening decision we made on the call yesterday. It seems like this would be a case where what we are describing semantically is incorrect. This really hit home when we pointed out that if we flatten in the case of nested sets, then we should flatten everything. We would get the correct annotations, but moving to the future, this would put limitations on more complex queries and the development of more sophisticated reasoning because there would be some incorrect assumptions that all members of a set are necessary parts of an instance of a complex. If @thomaspd approves, I would suggest we go with option 1 above. Specific considerations:
Still open to thoughts and arguments. @huaiyumi @balhoff @vanaukenk @cmungall
A practical issue for right now is the frequency of deeply nested annotations of complexes in Reactome (typically involving sets). This might be a question for Antonio: can make a tally of complexes, showing for each the numbers of gene products hidden within set components of the complex? Practically, the fraction of cases that cause a massive increase in bag content when the sets are unfurled may be quite small, so it's reasonable to ask whether information loss from flattening is large enough and frequent enough to be worth the extra effort to not flatten. Also, sets like ubiquitin are required to accommodate UniProt rules but arguably can be flattened with no information loss.
I would like to hear from @cmungall @kltm and @balhoff on the specific question of whether or not we want to allow - and therefore should really support properly - OWL constructs aside from relational assertions on instances within the Noctua/GO-CAM software framework. (Or when we want to do so.)
As things stand right now, the stack doesn't break when these other structures (e.g. Unions, Intersections) are added into the generated models. (Thanks to its semantic web, open world nature.) But (a) there is no way for curators to create them and (b) the UI does not do a create job displaying them (e.g. getting to see the insides of a Union requires going into the 'Evidence Folded' view which causes other problems). Further, these constructs will certainly result in more complex queries.
An answer to the general question above will help answer the specific issue regarding complexes.
@deustp01 to give an idea of the scale based on what I see in the BioPAX. There are a total of 46034 physical entities in the human database. Of these 4,689 have 'members' so use the set construct. Of these sets, 980 contain complexes (like the example used above) and 282 contain other sets. Entity types that contain sets include Complex, Protein, Rna, Dna, SmallMolecule and generic PhysicalEntity.
@goodb As a side comment, once upon a time there was plumbing supporting the creation of other structures in Minerva and ways of creating them in the client. I do not know if these were actively removed and may be recovered without too much work. That said, IIRC, the client methods were quite primitive and required basically manual text input.
The pathway 'Signaling by ERBB2' R-HSA-1227986 looks like it provides a good test case for this one (and there is also an example in there for #62 ). It uses Sets in many places. Perhaps the desired go-cam form for this pathway - and in particular its physical components - could be discussed at the next meeting. See the mostly current state of the conversion for this one here: http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-1227986 . (This version looks like it is missing proteins from the deeper levels of the hierarchical sets.)
The description of the second step in this pathway (reaction ERBB2 forms heterodimers with ligand-activated ERBB receptors: R-HSA-1963589) seems like a pretty good argument in favor of the use of Sets.
"ERBB2, which does not bind any known ligand, is activated through formation of a heterodimer with another ligand-activated ERBB family member. ERBB2 heterodimerization partners are EGF-stimulated EGFR (Wada et al. 1990, Karunagaran et al. 1996), ERBB3 stimulated by neuregulins NRG1 or NRG2 (Pinkas-Kramarski et al. 1996), and ERBB4 stimulated by neuregulins or EGF-like ligands (Li et al. 2007)."
I am away next week, but you should try to meet at our normal time with @balhoff to discuss this. He had some insights this morning when I discussed it with him.
That reaction is also interesting in terms of the causal relations that could be extracted from its Preceding Events. One of the inputs is the Set "Ligand-Activated EGFR/ERBB3/ERBB4". The members of the set correspond to the outputs of the different preceding events. e.g. the preceding event 'ERBB3 binds neuregulins' has the output NRG1/2:ERBB3 which is the second member of the input set. I think the implication here is that relations derived from the preceding events, e.g. 'causally upstream of', are also a logical OR. Any one of them could cause the next reaction 'ERBB2 forms heterodimers with ligand-activated ERBB receptors' and not all are required.
We only see one of these relations in the current conversion because of the limitation to stay in the same sub-pathway. If they were there, which they would be in other cases and should be in the long run, then would the GO-CAM be incorrect/incomplete/unspecified? This comes back to the basic underlying premises of the instance-based model (which I am still trying to understand completely). Are all the relationships supposed to be there - e.g. we are seeing something like a logical intersection? If not, how do we know which of them are alternates of one another?
@cmungall ?
Continuing the line of thought, if assertions in GO-CAMs are after all not fundamentally members of logical intersections in that they may or may not all hold together, then purposely overloading entity has_part statements with targets that are both 'probably there' (e.g. complex has_part proteinA) and 'alternately there' (proteinA might either be protein B or protein C so we'd get complex has_part proteinA, complex has_part proteinB) seems less bad. At least its no more ambiguous than other aspects of these models. Maybe we just go with the flat bag of parts and potential parts approach until/unless we provide for more specific modeling in the tooling?
@balhoff reminded me of something today that bears on this. The OWL reasoner basically does view the instance based models from the perspective that all of the edges are present. Even if Reactome may be implicitly coding OR relationships between different potential paths through pathways, the GO-CAM model doesn't really 'think' in that way. When considering an individual GO-CAM, we should assume what we see is what we get - all of it.
Have been batting around ideas, want to share one here for general review as its my current favorite. Ping @cmungall ..
Reactome Complexes, Sets, etc. are OWL classes just like gene products are. Let us load them into the tbox and use them just like we use other classes in the abox models. This allows us to capture all of the logic to define them properly in OWL without overcomplicating the instance-graph model. Pre-computed rule-based reasoning (e.g. Arachne) can then compute across these logical definitions. It is non-redundant as the same class can be re-used across instances and across models.
This could work like the following:
A. Given a full export from Reactome, generate an ontology that describes all of the physical entity classes used in it.
B. Now, when converting pathways, make use of these classes to define the instances of physical entities used in the reactions, just as we currently do when encountering a gene product class encoded in Neo.
First cut at an entity ontology - auto-generated for the Reactome ERBB2 signaling pathway . (Remove the .txt extension and double click to open in Protege).
Hmmmm. This is very intriguing and I think I like it. I am still a little concerned with the union classes versus the instances that are represented in a GO_CAM model and how the class relations will translate to instance relations and the subsequent inferences. @goodb we should take a closer look at this on today's call.
Looking at this in Protege, I think I see how it would work. It will be good to go over it.
Here is a better implementation of the ontology extraction. (drop the .txt for Protege load) Comments? ERBB2_Signalling_Entities.ttl.txt
@deustp01 in this one, I took the logical constraints on entity location out of the definitions. I am thinking that if I leave these out and go ahead and create new classes for each of the different location versions its going to work. For a given physical entity, we will end up with multiple equivalent classes in our new ontology, but I don't think that is a really bad thing and it lets us keep the one to one mapping between these classes and Reactome ids. When the pathway models get converted, I will use the appropriate entity and add location information as we were doing before. We can add the appropriate grouping classes and separate children based on location if its useful downstream. Okay @deustp01 ??
@balhoff this looks better when you classify it with Hermit than it does with Arachne (in Protege) because of the way the unions are handled. Thoughts on that? Is this a problem?
Drop location, preserve ids, put locations back later if useful sounds fine to me.
this looks better when you classify it with Hermit than it does with Arachne (in Protege) because of the way the unions are handled. Thoughts on that? Is this a problem?
Arachne won't classify the Tbox. There are similar issues even with EL content in the main ontology. We just need to preclassify the hierarchy and save a reasoned version.
Here is a shot of the inferred class hierarchy using Hermit. @ukemi how does this look to you?
@goodb the file you linked above looks the same as the previous.
Nevermind, I looked at the wrong file!
The only things that jump out at me is that EGF:EGFR:ERBB2, NRG1/2:ERBB3:ERBB2 and NRGs/EGFLs:ERBB4:ERBB2 don't look like heterodimers, they looks like trimers. But, the last one has dimers as children. If I recall, I was confused by this at the source too. Why the reciprocal classes for O14511 and why doesn't NRG1 have an equivalent UniProt classs?
I believe it is also bad form to put plurals in class names but I think these come directly from Reactome, right?
Ahha! At the source, the reaction is that ERBB2 forms heterodimers with the ligand-activated receptors. So in fact this is a heterodimer between ERBB2 and the receptor/ligand complex. Perhaps I am being too picky. @deustp01 ?
So why don't the ERBB2:ERBB4cyt1 and ERBB2:ERBB4cyt2 also contain NRGs/EGFLs, at least in the name?
The names for everything and the structure of the heterodimer hierarchy come directly from Reactome so I think both of those questions are curation concerns for @deustp01 to respond to. (I was pleased that Hermit inferred the hierarchy directly from the unions in the logical definitions - none of those subclasses are asserted relations).
The reciprocal classes for proteins arise because, as it stands, rather than directly re-using e.g. http://identifiers.org/uniprot/O14511 I am making a new URI that corresponds directly to the Reactome entity e.g. http://model.geneontology.org/R-HSA-1227953 (which you can find in reactome at https://reactome.org/content/detail/R-HSA-1227953 ). I then declare that the classes identified by those URIs are equivalent so you see the reciprocal relations. The reason for doing it this way is that it allows me to hang reactome-specific information onto the annotations and, downstream, if we decide to add more reactome-specific logic (such as the location constraint) our URIs won't need to change. Logically it is the same as long as there is a reasoner in the mix.
So why don't the ERBB2:ERBB4cyt1 and ERBB2:ERBB4cyt2 also contain NRGs/EGFLs, at least in the name?
Have to ask Peter about the name, but the logical definitions do contain NRGs/EGFLs: ERBB2:ERBB4cyt1 - ('has part' some 'NRGs/EGF-like ligands:ERBB4cyt1 (plasma membrane)') and ('has part' some 'ERBB2 (plasma membrane)')
So I think the only issue here is a difference between Reactome names and conventions for labeling ontology classes. It looks like the classification is working.
Ahha! At the source, the reaction is that ERBB2 forms heterodimers with the ligand-activated receptors. So in fact this is a heterodimer between ERBB2 and the receptor/ligand complex. Perhaps I am being too picky. @deustp01 ?
That's exactly what it is - one of the famous Reactome complexes assembled by accretion: first ligand binds receptor to form a heterodimer; then that heterodimer binds a molecule of ERBB2 to form a complex that contains one copy each of three different proteins. We don't have a well worked out and uniformly enforced rule for coming up with compact names for complexes (versus systematic names constructed by concatenating the names of all of the components separated by colons).
So why don't the ERBB2:ERBB4cyt1 and ERBB2:ERBB4cyt2 also contain NRGs/EGFLs, at least in the name?
Again, no enforced naming convention. Here, A whole group of complexes is being described, each member of the group consisting of one molecule of ERBB2 protein complexes with a dimer composed of one molecule of ERBB4 JM-A CYT isoform associated with any one of seven ligand proteins, three EGF-like or four NRG-like (so, as all combinations are possible, seven complexes, each containing one ERBB2, one ERBB4 etc., and one ligand, make up the group).
Where the name indicates a different composition than the tally gotten by drilling down into the nested complex / set hierarchy, follow the tally and ignore the name. But a list of some typical discrepancies of this sort might be useful for curator training at Reactome.
Here is the conversion of the complete entity set from human Reactome. It requires a little patience and a decent computer but it can be classified and explored in Protege. I fixed a few things and added some additional metadata (synonyms, xrefs, seeAlso links out to corresponding reactome web pages).
Reactome_Physical_Entities.ttl.zip
There are a pretty large number of unclassified things in there that clutter the view. e.g. for things like https://reactome.org/content/detail/R-HSA-5675414 I just don't have anything structured to go on to figure out what kind of thing it is. Its tempting to build some other grouping classes, but probably not important at this time. Probably a better next step would be to import neo and chebi and see if anything breaks..
For this example, it eventually drills down to a complex with a modified form of a modified form of RAF (UniProt:P04049 parent) and a set of RAS paralogs. The logical def looks ok. Not for this iteration, but something that pops out at me is grouping classes like 'p21 RAS-containing complex'. I can see them naturally as I scroll through the classes. I also see groupings based on location (from the names, not the logical defs). For now is it terrible that they don't classify any better than protein-containing complex? Every one that I drill down into eventually has a Uniprot equivalence which should match something in NEO. Even all by itself, I think this is really cool. @cmungall have you seen this?
I see another well-defined extension of this project.
Not for this iteration, but something that pops out at me is grouping classes like 'p21 RAS-containing complex'. I can see them naturally as I scroll through the classes.
That would be something to work on together with the complex portal (also in scope for PRO, so it's an issue of interest and resources).
I also see groupings based on location (from the names, not the logical defs).
My first reaction is that location is more likely an annotated attribute, but however it's captured definitely useful information.
We have a version of the conversion that uses the reactome entity ontology concept up now on noctua-dev. see e.g. http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-1296041
There are some configuration issues that impact reasoning and the display of some property labels, but if you want to see how things look, it works. WIP to fix the config.
I think we have a pretty solid cut at the new Reactome conversion based on the new ontology of Reactome physical entities It is live on dev now. See e.g. http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-1296041 http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-1227986
@deustp01 I see there was a Reactome release on May 27. Does this have all of the changes in it that we have discussed? The go-cams on dev now are from the March release. I can update everything when you say it is time.
@deustp01 @ukemi I'm worried about losing momentum on this project right when we are closing in on something of value. What are your next steps?
@deustp01 is away for a bit, then he will come on site here for about a month. I think the next steps for the biologists is to review some specific models and the annotations that are generated by them.
@ukemi I am processing the latest reactome release now and will have the noctua-dev models ready online with that one soon. Likely today or tomorrow depending on when @kltm is ready to update. This should provide you with what you need to do your review. If you can translate the results of that review into issues here (or chunks of manuscript :) ) that would be awesome.
I think the phrase 'the annotations that are generated by them' is an interesting one. Are those important for this exercise? Will they feed any other processing steps? If the answers are yes, then we will need to work on the gpad exporter. It does succeed now, but its output is based on the Reactome entities, not e.g. the UniProt ids. If producing more traditional gpad output is a priority, there are a few ways we can approach that - either in the gpad export code or in the code that generates the entity ontology.
I suggest that focus for now is really on the models at the go-cam level, with derivatives like gpad being something to consider downstream as they become a priority for some reason. If not, let me know!
If these models are incorporated as production models, they will generate annotations just like any other model. Biologically the annotation output should be sound/correct. In an ideal world this would become our source for Reactome annotations. I agree we need to work with @balhoff to enhance and refine the GPAD exporter.
Are they going to be incorporated as production models? I think that would be (sort of) great, but its never been clearly stated as a goal. A large part of me wants to keep them out of the pipeline process until the endpoint of that is no longer a (lossy) tab-delimited file. Basically to use these as the test cases for the next generation of tooling that consumes the RDF directly. But I'm not the boss and will defer...
@balhoff seems pretty occupied. If he is unavailable I can make the gpad look like you want it. I understand how it is generated.
@ukemi update from in-person discussion with @cmungall
1) we want to go ahead with plan of leaving gpad export working as it is, with reactome entities as the subjects. We can then process this gpad to turn it into e.g. uniprot-oriented gpads to allow for comparison to other annotations.
2) We want to avoid the large equivalence cliques that are generated in the current version because we ignore location and ptms in the protein entity class definitions. We can mostly get around this by adding in the (and located_in CC) aspect of the logical definitions as it was in the first cut. We could also add something like that for ptm information. @ukemi can you remind me why we took the location constraint out of these definitions?
3) We should go forward with treating this as a full, public ontology. Start the process of requesting approval to put it in obo. We should have a discussion about its relation to PRO and Complex Portal with @deustp01 and the relevant leaders of the other projects.
@goodb and @cmungall If I recall correctly we took out the location constraints because some entities were located in more than one place. @deustp01 do you recall that as well?
PS. To make this robust, I think we will also need to continue the work on spatial relations that we started in Geneva, continued a tiny bit in NYC with @vanaukenk and @deustp01 and had in our plans to continue with @balhoff.
PS. I am intrigued by bullet point 1 above and am interested in seeing how it will work. Bullet point 3 seems like a very logical next and productive step.
Solved by creating a class ontology for Reactome entities?
Solved by creating a class ontology for Reactome entities?
I think so. I feel like the ontology needs further vetting and likely tweaks but I think it is the core to the solution to this problem. I am okay to close this and open other entity ontology issues as needed.
@deustp01 I'm looking at the complex Cyclin E/A:CDK2:p-S130-CDKN1A,p-T187-CDKN1B:CUL1:SKP1:SKP2:CKS1B which contains a mixture of 'has_component' and 'has_member' relationships to its parts.
As I understand so far, the 'has_member' relations indicate a set, not a complex. When these occur at the top level, they are converted to Unions of the members. Right now the converter ignores this distinction and collapses all sub-complexes and sub-sets into one long parts list as if they were all equal members of the complex.
It seems like it would be more accurate to capture the set entities properly and attach them as units to the larger complex object. e.g. complex has_part union[a,b] , complex has_part c, etc..
Thoughts on that?