Closed goodb closed 4 years ago
Sorry for the delay. I’d written something a couple of weeks ago, and I thought I’d added it to the thread, but I think I closed my browser first and lost it.
Thanks, these are really good points. And I think maybe the problem was that I was thinking of how we get as much mechanistic detail as possible from Reactome into GO-CAM, using standard GO annotation practices. The standard practice would probably include annotating SMAD1/5/8 as “protein binding” to endofin.
But I agree that this wasn’t the right way to think about it.
There was one thing about my previous proposal we might want to keep. It was to try to assign an enabler of a protein binding event whenever possible, by choosing the input that came from the upstream step in the pathway. Then we could use the rule I think Ben has implemented, that if an enabler of an activity is an output of a previous step, then that previous step positively regulates the activity. This would help make a causal chain clearer.
But about how to model dissociation, I think there are a few ways we could go here. We could go the way of trying to make as literal a translation as possible of Reactome into GO-CAM, which is what I think you’ve been doing really well in general (with a couple of small tweaks, like making the “active subunit” the enabler of a catalytic reaction, rather than the whole complex). To do this, I agree one possibility would be to add something like a “dissociation” term to GO, to handle this type of Reactome event.
The problem with this proposal is that this kind of term would not really belong in GO MF or BP, because it is not carried out by one or more gene products. So we wouldn’t want, for example, to have a GO annotation like “dissociation enabled-by complex-X.” Dissociation is a spontaneous process, once SMAD1/5/8 has been phosphorylated. A spontaneous process is an event that just happens, and is not directly done by a gene product.
It seems to me that there are a few options of how we could handle this:
No matter which way we go, we could treat other spontaneous events, like some instances of localization, in the same way.
And on the localization/transport thread, if localization isn't spontaneous, then if it requires a specific protein, then that protein’s activity should be in the causal chain. If it requires a non-specific constitutive process carried out by multiple gene products, like nuclear transport, we already allow those types of “activity regulating processes” in causal chains, in the GO-CAM specification.
For the dissociation reactions, I tend to lean towards option 2 above, which it what @cmungall originally suggested. I believe the original objection to that proposal was that we would break the causal chain. However, if we make a rule that we follow the preceding event hierarchy transitively and we can deduce whether the dissociation products are inputs for the next 'usable' reaction or regulate the next 'usable' reaction, we can maintain the causal integrity.
Last night in my weird way of justification I thought about this analogy. In some cases enzymes hydrolyze a substrate (an MF) and we capture that as one reaction with an input and two outputs. We don't tend to split this into two steps. The case of dissociation is analagous, but this is split into two steps. An upstream reaction (MF) results in a complex dissociating into more than one piece. One of those pieces feeds into a downstream reaction (MF) either as an input or as an enabler. Seems like we could follow the same rule. We will lose the detail that Reactome captures about the behavior of the complexes, but I think that's ok. It's one of the pieces of value-added information that people get from Reactome, but out of scope for GO. @goodb is this reasonable and possible?
I have always liked having an enabler for binding reactions. I'm glad we are coming back around to this.
For the transport cases. I think early on we were convinced that the transport was never just diffusion (spontaneous). There was always an enabler. I propose that we just use transporter activity rather than transport. My initial hesitation about this was for cases like the nuclear pore, but I think they fit the definition of channel activity in the MF ontology. We could make the CC the enabler of the reaction, but we would not want to specify the pore gene products in the models that use the pore in a generic function capacity. Otherwise we would get annotations of those to the process in the model. I can already hear the complaints if we did that.
Transport sub-thread. "Never spontaneous" should be OK - all simple diffusion I can think of occurs within a single cell component (cytosol, lumen of vesicle X, extracellular space) so even though diffusion is happening it's outside the scope of GO and a GO-CAM model. Bullet ducked. "Nuclear pore and similar structures as enablers" - this seems analogous to the issue of figuring out how to specify that a chemical has a housekeeping or incidental role in a metabolic process without triggering a reasoner to insist the the process is therefore an instance of "metabolism of that chemical". We need something similar here to head off the reasoner. More thought needed on how to do this that also is compatible (as much as possible) with current Reactome practice for annotating a transport reaction (we would identify the nuclear pore complex explicitly as the enabler of a transport-related molecular function, which is exactly what triggers the unwanted reasoner behavior).
The simplest way to head off the reasoner is not to include the information if it isn't relevant to the point of the model. So even though we create a 'transporter activity', if we don't assign any genes as enablers, those genes won't be annotated to the pathway. The nice fix here is that this is already the case (I think) or we wouldn't be facing the issue.
Future not part of this project question: When it is a nuclear pore, do we want to specify that it is a channel activity enabled by a nuclear pore complex at the reactome end. Something to think about as our curation strategies mature.
Dissociation thread
Dissociation is a spontaneous process, once SMAD1/5/8 has been phosphorylated. A spontaneous process is an event that just happens, and is not directly done by a gene product.
Chemically, this actually is not true. Go back a step. The SMAD proteins have high affinity for one another so when they are present in the same place they associate to produce a complex with lower free energy due to conformational changes. Even though we don't annotate the conformational changes to the gene products involved, this association still qualifies as an uncontroversial GO molecular function, "binding". Then, one or more proteins in the complex undergo phosphorylation (an uncontroversial GO molecular function). That change causes conformational changes that make the complex energetically unstable and it comes apart - one might think of it as GO molecular function [new] "unbinding".
All of that said, however, if Ben can indeed excise the unbinding steps and make correct causal connections to stitch the remaining pieces back together in the correct order, that does sound workable.
Dissociation thread
One more thing, on enablers. No objections from here but we will need a lot of work in the weeds to implement it. Consider the first two steps of signaling via a tyrosine kinase receptor. 1) Ligand binds to receptor, forming L:R heterodimer. 2) Two L:R heterodimers bind each other to form the heterotetramer that mediates the rest of the signaling process. For step 1, it's easy - the receptor entity enables the reaction even if the ligand is also a gene product. For other pairs, identifying the enabler may be harder and an algorithm plus look-up list will probably be needed. Weedy but do-able. For step 2, the enabler will need to be designated arbitrarily. Can that be made to work?
I think ultimately the goal would be to assign the binding functions to more informative molecular functions down the road. I think the analysis that @huaiyumi did showed that this is indeed possible, but as yet not automatable. This should be a short section in the paper.
Chemically, this actually is not true. Go back a step. The SMAD proteins have high affinity for one another so when they are present in the same place they associate to produce a complex with lower free energy due to conformational changes. Even though we don't annotate the conformational changes to the gene products involved, this association still qualifies as an uncontroversial GO molecular function, "binding". Then, one or more proteins in the complex undergo phosphorylation (an uncontroversial GO molecular function). That change causes conformational changes that make the complex energetically unstable and it comes apart - one might think of it as GO molecular function [new] "unbinding".
Exactly! This is what in a way lead me to the analogy above. In the case of hydrolysis, the enzyme causes a change in the covalent bonds of the substrate, but we don't capture the reaction mechanism in several steps. In the dissociation examples, the preceding function does something that causes a conformational change such that the hydrogen bonds that hold the complex together are severed. This is captured in two steps because as biologists that are interested in reactions, we care about the products that are generated after the hydrogen bonds are severed. They go on to do interesting things, and thus Reactome represents the dissociation.
I had thought about an MF term for dissociation/unbinding, but abandoned it. I don't think we want GO curators to go down that route even though I think it is appropriate for reactome.
Are we on the same page @deustp01 ?
As above, if the unbinding bit can be cleanly excised without disrupting the causal chain, then yes, we're on the same page - it's a scope and curation strategy issue
Of course this will also mean that the Reactome->GO-CAM conversion will be a one-way street because it will be lossy. But I think we have already accepted that.
Lossy because Reactome embeds out-of-scope things within its in-scope material is required, and is the basis for future discussions of user requirements and scope for both of us. For instance, the removal of all drugs and function and regulation instances involving them. Also, when as here for dissociation (but not for drug) we can precisely identify what's lost and trace a path to it, that does a good job of preserving maximum information.
About Peter's example of tyrosine kinase receptor signaling, the enablers of the steps should correspond to the upstream gene product in each step. So the first step would be ligand activity enabled_by L (we might need a rule for this one, using available GO annotations, to recognize that a ligand's activity is upstream of the receptor's-- we know that the ligand is considered to have been released and then diffused to the receptor, not the other way around). The second step would be tyrosine kinase receptor activity enabled_by R, acting to phosphorylate (i.e. has_input) a target protein (a tyrosine kinase receptor should be defined in the ontology as has_part protein kinase activity). We may not want to represent the receptor dimerization step since at least as of now, stoichiometry is out of scope for GO. For the second step, since we already determined that L is the enabler of the first step, acting on R, then R must be the enabler of the second step.
Reasoning to identify enablers sounds plausible to me Skipping the dimerization step - as for dissociation, an acceptable excision because it looks like there is a reliable way to patch around it to maintain causal connectivity. @goodb ?
@deustp01 @ukemi @thomaspd I'm trying to digest here.. if some one wants to take a cut at writing out the rules for dissociation using an example, that would be helpful - if X then Y..
What is the logic for detecting a "spontaneous event".
For right now, any use of information that is not present in the Reactome-provided BioPAX file is out of scope for the conversion process. The real integration of the existing GO knowledge with these models will be an awesome project.. but not now - please add ideas for that to https://github.com/geneontology/pathways2GO/projects/2
Here's a guideline, though Peter can help with the exact BioPAX. For detecting dissociation: input is a complex, with multiple outputs where each output is some subset of input complex For detecting homodimerization/-multimerization: input is one gp type, output is complex with multiple subunits of same (input) type For detecting upstream ligand in signaling pathway: in a binding reaction, where one input is extracellular and the other inputs are transmembrane (and maybe where the output is transmembrane), the upstream ligand is the one with extracellular location. So maybe we can do this with just what's in the BioPAX.
What is the logic for detecting a "spontaneous event".
Sorry - I think I misled you. I brought up "spontaneous" to argue that, as a matter of chemistry, binding and unbinding are equally spontaneous - rhetoric aimed at the issue of whether unbinding / dissociation is out of scope for GO.
But for detection, if that turns out to be relevant - the tests are the same as for binding but with input and output reversed.
I believe we already have the logic in place to detect reactions we were labeling 'protein complex disassembly'. The aim here would be to simply ignore those and go to the next downstream reaction in the process. I can make a model tomorrow morning, but need to go out and snowblow the driveway before it gets too dark.
Rule for the causal connection: reaction-a directly_precedes reaction-b directly_precedes reaction-c
If reaction-b is disassembly and if output of reaction-b is an input of reaction-c, Then reaction-a directly provides input for reaction-c
If reaction-b is disassembly and if output of reaction-b is an enabler of reaction-c, then reaction-a directly positively regulates (or whatever we decided to use here) reaction c.
Thanks, in short summary, my todo here is:
Is there a 3. ? I think that will bring nearly all models into coherence with the current shex schema. I think the only ones that will still fail will contain untyped physical entities.
Question. Would we still add has_target_end_location and has_target_start_location assertions to transporter activity nodes ?
I think that would require a schema adjustment.
I think also buried in this thread is an important discussion about how we really want to handle protein binding in GO-CAM. I believe there are already some tickets in other trackers about this, but I don't want to lose sight of the issue because we need to decide what we want to do. I tend to favor creating 'binding with a purpose' MF terms and indicating enablers and inputs for those over our current practice of making reciprocal 'protein binding' annotations, but we would need to work through all of the implications for that approach.
Looking at the RO definitions and schema, I think we would need to leave out has_target_end_location and has_target_start_location assertions for a node typed as an MF.
Is there a way to keep that information about where the entity is transported from/to in the model?
Wow, you're right. These are only used with biological processes in the ontology. This is very counter-intuitive to me. But I believe in this definition:This relationship holds between p and l when p is a transport or localization process in which the outcome is to move some cargo c from some initial location l to some destination. p refers to a BFO process. Go molecular functions are BFO processes.
I think the 3 was the protein-containing complex inputs and outputs of biological processes. I think you changed the ShEX to allow for that.
@vanaukenk I think you are right about the protein binding annotations and I agree we need to decide how to handle them in the long run. @huaiyumi has already show that for at least some of them we can manually assign 'function with a purpose' MFs to those reactions. But for now, we are accurately capturing what is in Reactome and for now the way we are capturing it is perfectly legal. I think we don't lose sight of that global issue, but it is a future bridge. For now I think we leave them as they are and discuss them in the paper as a future objective.
@ukemi in the OWL defs, you are right, but the shex schema is different. The recent changes to the shex schema allowing for complexes were on the BP shape definition. We would need to add these relations to the MF shape definition as well for this to validate. Noting that we would probably also want to capture transports_or_maintains_localization_of in the same way. The following would be added to MF - @vanaukenk okay with that ?
transports_or_maintains_localization_of: ( @<InformationBiomacromolecule> OR @<ProteinContainingComplex> ) *;
has_target_end_location: @<AnatomicalEntity> {0,1};
has_target_start_location: @<AnatomicalEntity> {0,1};
Otherwise we can add a more specific shape for transporter_activity that extends MF to include these possible attributes.
I think also buried in this thread is an important discussion about how we really want to handle protein binding in GO-CAM.
We are making a lot of comments here and on other issues of the form, "this could be interesting / useful but will distract us if we try to handle it now. We probably need a new project, pathways2GO_the_sequel to collect such issues like this one, and ones where improving GO-CAM would require work on the Reactome (or other pathway data source) side.
@deustp01 I think @vanaukenk is proposing a discussion/decisions that may be informed by the pathways2GO work but are independent and consortium wide. Not sure where that discussion should best reside - I suspect the schema project.
For ideas and issues for the pathways2go sequel, lets continue accumulating them as issues here in this repo, but tagged into this new project: https://github.com/geneontology/pathways2GO/projects/2.
Making things concrete for the changes specific to this thread with an example from Signaling by BMP, now:
After:
All good?
@goodb, @vanaukenk and @deustp01. I think we need to add the start and end location properties as valid for the function transporters. Whenever I think about transport, I think about the physiology of our kidneys and our intestines. Those organs carry out transport at the level of the organ. In the kidney, different transporters are precisely arranged not only with respect to cellular locations, but also along the tubules that carry out the physiological process. If I were modeling the physiology of the kidney, I'd want to go down to the level of which transporters were moving which molecules where at the sub-cellular, cellular and anatomical levels. To do this we would need to be able to specify start and end locations of the transported molecules at the level of molecular functions.
PMID:27756725, for example.
I've been looking through some examples and I'm getting less happy with the dissociation rule above. It seems like there are going to be many edge cases where the results wouldn't be desirable. Here is an example where I think deleting the dissociation node would be problematic for the model: R-HSA-191859
In this one I think the node should not be typed dissociation, but it would match the current and proposed rule.
R-HSA-3371497
Is there a way that we could leave these nodes in the models by using another upper level type? Something along the lines of 1. or 3. in Paul's comment above: https://github.com/geneontology/pathways2GO/issues/75#issuecomment-575057253. ? I think a strategy along this lines would lead to a simpler, less lossy conversion if we can identify the right class to assign to these kinds of events.
Following up on Kimberly's last comment about protein binding. The first step toward getting more informative MF annotations is to convert to an activity flow, i.e. to figure out (when possible) which of the input entities in a protein binding reaction is the enabler. In general, the active entity of a protein binding event (enabler) will be the entity that is an output of the upstream reaction. However, in some cases, there will be either zero, more than one, entity that has an upstream reaction. In these cases we can use rules like the one I'd proposed above, to figure out which one came from upstream. We can also make use of the pathway step ordering in the BioPAX.
Following up on Ben's last comment. In the first example, it looks from the reaction label that dissociation isn't the only thing going on here, and that there's also transport/translocation. In this case it would be fine to remove the dissociation node and have a downstream activity of the translocated protein in the nucleus (it can be a root molecular function).
In the second example, this shouldn't be typed as dissociation, as it is also a binding. It looks like it is summarizing two reactions, a binding and a dissociation. So it would be good to just type it as binding, and capture only the binding aspect of it in the GO-CAM model.
In the second example, this shouldn't be typed as dissociation, as it is also a binding.
Two thoughts here. First, there is a granularity issue, and Reactome curators sometimes (but not always) lump steps that GO sees as separate. Second, though, some displacement events are real (my opinion of the biology, not a universal truth). That is, the binding of C to an A:B complex forces the release of B, so one function converts A:B + C to A:C + B without traversing an intermediate state in which nothing is bound to A. On the Reactome side, distinguishing shortcuts from real displacement will be solved when we implement reaction types; on the GO side, "displacement" gets added to the pile of issues to be considered along with "dissociation". (None of this is directly helpful with the central issue here, though, which is reliably extracting as much molecular function and biological process information as possible from Reactome and patching around function gaps to preserve connectivity within processes, except it adds yet another item to the discussion section of the paper where we show how the pathways2GO process enables new and helpful QA on both Reactome and GO content.)
I just meant that we can modify our rules such that if there is Reactome reaction that includes both binding plus dissociation, and we're going to ignore dissociation when translating to GO-CAM (i.e. it's not an activity), then it should be represented only as a binding activity.
We will discuss this on a call, but I think that we should also consider how a GO curator would model these processes de novo. In the cases above, I think we have an analagous situation to the binding dilemma. They point out areas where we can work on collaborative curation to assign GO functions or processes to the reactions. We have the biology in GO to represent the models above. Just as we plan to refine binding to 'binding with a purpose' I think we can keep the Reactome models compliant with Reactome curation standards, we can xref them to GO terms that we will import. We might want to try this with a few examples.
The homework in preparation for the call today (1/22) was to review this ticket.
Okay. It's a complicated discussion that raises issues both of the proper scope of GO and of what is practical to implement within a GO-CAM model now, given its current state and resources for development. A related issue is whether it can be appropriate to ask a contributing group like Reactome to modify their scope and data structure to fit better to GO.
One strategy proposed recently is to excise reactions like dissociation events that correspond to no GO molecular function. One could imagine that this is easy: the unusable event disappears and whatever was causally immediately upstream of it is instead declared to be causally upstream of whatever was immediately downstream. If I understand right, Ben has identified cases where this fails.
An alternative strategy proposed earlier was to import the reactions with no usable GO molecular function attribute and assign them a placeholder function like "molecular function" itself or "reaction". This would preserve the connectivity of the Reactome graph, would not violate current GO scope or GO-CAM limitations, and would lose only the information that GO is not able / does not want to handle anyway.
Are there other alternatives I've missed?
I don't think we can resolve any of the scope issues in a durable way this afternoon. Instead, can we identify places where Reactome-to-GO-CAM conversion fails and ask whether we have workarounds good enough for pathways2GO version 1?
Completing my homework here as well. On the plus side, it appears that half of this issue can be closed. I see a consensus in the thread that the go MF 'transporter activity' can be used to type reactions where the input and output entities are the same aside from their locations. Some shex changes to accommodate that, but I think we are good to go.
The dissociation reactions are more problematic. Upon further review, the existing rule for detecting them (n inputs < n outputs) was overly promiscuous. I think as many as half of its predictions are not what we want - as we see in the example I gave above https://github.com/geneontology/pathways2GO/issues/75#issuecomment-575824812 . That rule worked well in reverse for detecting binding - probably because there is almost always some element of binding going on.
If we are going to try to e.g. eliminate dissociation reactions from the GO-CAM models or automatically produce a more sophisticated structure for them, I need a better way to detect them. This will involve more clever processing of the inputs and outputs. Unfortunately I haven't done this yet. Its not entirely trivial - especially when there are many inputs and many outputs and when their types are mixed. But, I suspect I could get something working once the fog of my flu clears up.
Likewise, identifying causal linkages from upstream to downstream reactions grows complicated when there are multiple upstream and downstream reactions to deal with. Again, better work with the inputs and outputs would be needed.
As @deustp01 mentions above, the problems here have a lot do with varying levels of granularity among these Reactome reactions.
My take right now is to go the route of the placeholder Type for what we were calling dissociation events. This will prevent the introduction of any errors. Causal flow would be maintained, mirroring that created by Reactome curators. The more advanced processing proposed above (either for the elimination or better modeling of dissociation) could be put alongside rules like negative regulation be sequestration https://github.com/geneontology/pathways2GO/issues/62 as to do items for a version of the conversion that either involved manual review or allowed for the presence of false positives.
Glass half full. The advantage of going with a placeholder is that it maintains integrity of the causality and it allows us to easily see where there is a reaction that we can't type yet in GO. In the future we can look at these and see if we can't make an interpretation of what GO function or process they represent.
A related issue is whether it can be appropriate to ask a contributing group like Reactome to modify their scope and data structure to fit better to GO.
Not if it compromises the strengths of the contributing group, only if it enhances it.
Conclusions from the call:
Summarizing decisions from todays call with @ukemi @vanaukenk @deustp01 @balhoff
Justification for giving up on a better representation for dissociation at the moment is that, at this stage, we are erring on the side of caution. Looking though examples suggests that the proposed rules will not be 100% precise. Although the models generated with he simple default MF type will not be what a GO curator would or should produce, they also won't contain any incorrect assertions. Bringing these models into alignment with GO practices will involve further curation work on the Reactome side and likely much more sophisticated logic and/or manual work on the GO side.
There is some ontological dissatisfaction with tagging these reactions as molecular functions as, at least some of them, might not be enabled by a particular gene product. An ontologically better approach would be to introduce another high level term (e.g. 'reaction') but no one wants to do that right now. Hence pragmatically we use MF as a placeholder for this project.
jinx...
Yours is better.
took 9 more minutes ;)
@ukemi @deustp01 are we good to go for this issue? When you have time, could you have a look at the models now in noctua-dev so we can close this or decide what other changes to make?
First example for transport: http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-201451 Specifically R-HSA-201472 This looks good to me. In this case the missing information is the nuclear pore as the enabler of the transporter activity. @deustp01? Let's look at a couple more examples.
Second example of transport: http://noctua-dev.berkeleybop.org/editor/graph/gomodel:R-HSA-170822 Specifically R-HSA-170825 This looks good to me. Again the nuclear pore is missing, but there must be an enabler to get the GCK from the nucleus to the cytosol. @deustp01 I'm not going to tag you on all the ones I look at, but as always a sanity check by you is appreciated to be sure we are on the same page.
At GOC 2019b in Berkeley, we briefly discussed providing these kinds of reactions with either more specific types, e.g. something along the lines of 'protein complex disassembly' for dissociation and, as done in some prior iterations, 'transport' or 'establishment of protein localization' for transport processes. Or, defining a more appropriate generic upper level type, like 'reaction', that would be a parent of molecular function and using that for both.
Need resolution here from ontology developers in particular -> ping @ukemi
Noting previous discussions on this topic: #17 #35 #55 #73
Examples (both from Signaling by BMP R-HSA-201451 ) : translocation
dissociation