geneontology / pathways2GO

Code for converting between BioPAX pathways and Gene Ontology Causal Activity Models (GO-CAM)
8 stars 0 forks source link

Add active unit information from Reactome #31

Closed goodb closed 5 years ago

goodb commented 5 years ago

For some reactions, like 'activated human TAK1 phosphorylates MKK3/MKK6', Reactome says that a protein complex (or set) catalyzes the reaction, and then indicates a specific member of that complex (or set) as the 'active unit'. In this case, the modified protein 'p-T184,T187-MAP3K7' is the active unit and 'Activated TAK complexes' is the 'complex'. Note that several of the members of 'Activated TAK complexes' contain a MAP3K7.

To close this issue:

goodb commented 5 years ago

@fabregat what do you think would be the best way to access the active unit information ?

goodb commented 5 years ago

I don't see a way to get the info. from the Web API but did manage to find it in the graph database. Noting to self query will look something like: MATCH (reaction:Reaction{stId:"R-HSA-450346"})-[:catalystActivity]->(catalyst:CatalystActivity)-[:activeUnit]->(active_unit:PhysicalEntity) RETURN reaction, catalyst, active_unit

goodb commented 5 years ago

Working now with information gathered from the graph database. Uncomfortable about longevity of this solution.. Would prefer to use one standard and extend that standard as needed (e.g. by adding content into the BioPAX export even when its not in the specification - doing so should not break any semantic web capable code that consumes it.)

fabregat commented 5 years ago

To get the info using the ContentService, you will need two queries. Let's take the reaction you shared above (R-HSA-450346). The first query is

1) https://reactome.org/ContentService/data/query/R-HSA-450346/catalystActivity

Its result is in TSV format containing Identifier, DisplayName and SchemaClass:

177696 protein serine/threonine kinase activity of Activated TAK complexes [cytosol] CatalystActivity

Taking the first column, then you perform a second query

2) https://reactome.org/ContentService/data/query/177696/activeUnit

And its results, follows the same format, and it is

R-HSA-202527 p-T184,T187-MAP3K7 [cytosol] EntityWithAccessionedSequence

Please note that in case you need more info about the last EWAS, then you can query the ContentService:

3) https://reactome.org/ContentService/data/query/R-HSA-202527

Since you've also mentioned the graph database, I strongly recommend you to go this way since it will be faster on your end ;)

The query you suggest is almost correct, and I say almost because for other identifiers corresponding to other types of reactions (BlackBoxEvent for example) it wouldn't work. In any case, the fix is very easy (please see in bold what I've changed):

MATCH (reaction:ReactionLikeEvent{stId:"R-HSA-450346"})-[:catalystActivity]->(catalyst:CatalystActivity)-[:activeUnit]->(active_unit:PhysicalEntity) RETURN reaction, catalyst, active_unit

Related to extending the BioPax format in favour of the options above, maybe we could set a meeting to discuss how to go forward?

goodb commented 5 years ago

@fabregat thank you! Indeed I'm happy that I ended up down the graphdb path. It would have been quite slow the other way and this will clearly open up more possibilities. (But see my email about problems getting your Java access project to build). I updated my query to include ReactionLikeEvent and seems to be working - captures about 120 more relations.

Yes, I'd like to talk about how to get this information into your BioPAX export somehow. Although it would end up off-standard, an extension to the BioPAX ontology or even a hacky use of xrefs or comments could do the job for the short term and provide an example for extending the standard in the future. It might also be worth talking with @cmungall about new standardized ways for sharing graph databases like your neo4j version. Longer term, that work might end up replacing the BioPAX stuff, though I think that is likely a way off.

goodb commented 5 years ago

Need to convert to take the data out of a new provided statement in comments on Control And Catalysis entities. e.g. activeUnit: #Protein26 . This will replace the code that uses the graphdb. Future versions of the biopax export will contain this information. Note that the active unit e.g. #protein26 may or may not be otherwise linked to in biopax file, but should be present.

goodb commented 5 years ago

example file:

RAF-independent_MAPK1_3_activation.owl.txt

goodb commented 5 years ago

Question for @ukemi and @deustp01 when we have an active unit annotated on a complex that is not catalyzing but rather exerting a regulatory effect on the reaction, how should that be captured? For catalysis we have the enabled by / contributes to structure. What should we use for regulates? e.g. protein involved_in_negative_regulation of reaction, complex has_part protein, complex ?relation? reaction ?

goodb commented 5 years ago

I think I can answer my own question here. The plan is to go ahead assert the involved_in_regulation triple for the active entity. This happens in the first phase of processing. It will be picked up in the second phase and converted to the pattern from #39

goodb commented 5 years ago

Test with RAF-independent MAPK1/3 activation

cmungall commented 5 years ago

Hmm, I haven't thought this one through but that relation is mostly intended for inference rather than assertion

On Fri, Feb 15, 2019 at 3:07 PM goodb notifications@github.com wrote:

I think I can answer my own question here. The plan is to go ahead assert the involved_in_regulation triple for the active entity. This happens in the first phase of processing. It will be picked up in the second phase and converted to the pattern from #39 https://github.com/geneontology/pathways2GO/issues/39

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/geneontology/pathways2GO/issues/31#issuecomment-464239949, or mute the thread https://github.com/notifications/unsubscribe-auth/AADGOQLUGvcxgEm3bCQNQPV7FtAiy7U7ks5vNz2wgaJpZM4ZauEK .

deustp01 commented 5 years ago

In Reactome, "active unit" is an attribute of the catalyst activity class only, and is meant to be used when the physical entity doing the catalysis is a complex and one protein or subcomplex part is known to be the actual catalyst. We don't have an equivalent attribute on our regulator class (but it sounds like a good idea). Correction: we do have that attribute, and we do use it.

goodb commented 5 years ago

@deustp01 I'm seeing it in the biopax that @guanmingwu enhanced with the additional activity information. For example, here is a biopax control structure showing negative regulation with an annotated active unit (PEA15) for the reaction "Phosphorylated MAPKs translocate into the nucleus" in pathway "RAF-independent MAPK1/3 activation"

screen shot 2019-02-16 at 8 41 14 pm

This kind of thing is present in both the test file he sent for this pathway and looks like it shows up 66 times for the full Homo Sapiens batch export.

Is this an error? I don't see any indication of active unit on the corresponding control element in the public interface or the curatorial interface for that reaction.

goodb commented 5 years ago

For what its worth, this is what the end result of the current transformation looks like for this one. Note that the involved_in_regulation triple (I mentioned above and start this process with) ends up being replaced with the regulatory binding activity pattern. (Sorry Neo not loaded, UniProtKB:Q15121 = PEA15.)

screen shot 2019-02-16 at 9 18 01 pm
deustp01 commented 5 years ago

@goodb The problem is at my end. Despite what I said on Saturday, "active unit" is an optional attribute of regulation instances and it has been used to annotate 578 of the 2108 regulation instances in our released database. The annotation criteria should be the same as for active unit annotations of catalyst activity: the physical entity associated with the activity is a complex of two or more different gene products and there is evidence that allows one of them (or a subcomplex) to be identified as the part of the complex that is directly doing the catalysis or regulation.

I don't know how to do a query to filter out the non-human instances but they are a minority, so we have a new problem: why only 66 problem instances detected when it looks like there should be hundreds? Guanming will be back in about a week, I think.

Sorry.

goodb commented 5 years ago

@deustp01 Let me confirm the 66 number with a more careful test. (I pulled that out with a very fast hack to see if there was more than one.. want to double check). In the meantime, do you think the representation depicted in the figure above is suitable for representing this information in GO-CAM?

ukemi commented 5 years ago

My initial reaction to the figure above is that there is nothing enabling the binding reaction in the lower middle. It only has an input. That seems weird.

goodb commented 5 years ago

Okay, I made a mistake with the 66 number. Shouldn't have reported that specifically before verifying.

For the homo sapiens release that Guanming prepared for me in January I see: 7085 Control constructs of which 1713 contain active unit annotations.
There were 5252 Catalysis events (1213 with active units) and 1833 regulatory events (with 500 active units).
Also, there were 204 control events with more than one active unit annotated. e.g. the catalyst in the reaction 'Phosphorylation of PLC-gamma1'

goodb commented 5 years ago

On the representation, it would be good if @ukemi could weigh in. From my perspective I never really understood what was wrong with the initial entity involved_in_regulation_of function pattern but everyone pretty adamantly did not like that. The only difference here as compared to that one is that we have pulled the active unit out of its complex.

According to the pattern, If the reaction there (Phosphorylated MAPKs translocate to the nucleus) had something enabling it, that something would be attached as an enabler of the binding

ukemi commented 5 years ago

We need to make sure that each of the above types of active entities are represented in the pathways that we systematically review.

Can we get a reactome identifier and the corresponding models for: A control construct that does not have an active unit. A control construct that does have an active unit. A catalysis event that does not have an active unit. A catalysis event that does have an active unit. A regulatory event that does not have an active unit. A regulatory event that does have an active unit.

The reaction above is exactly why I have argued that kinase activity and phosphorylation represent a molecular function and a biological process respectively. In this case the activities of two independent kinases modify two different residues on the target protein. This is what I have always argued is the process of phosphorylation. @vanaukenk In an activity flow model where we don't allow the process to exist, I guess we could just have a branch that has both kinase activities and then those kinase activities flow back together to the next downstream function.


EDIT (Adding examples requested above: Here is one pathway that has examples of all the different situations.
Antigen activates B Cell Receptor (BCR) leading to generation of second messengers. (Click into the reaction to examine the controllers.)

ukemi commented 5 years ago

On the representation, it would be good if @ukemi could weigh in. From my perspective I never really understood what was wrong with the initial entity involved_in_regulation_of function pattern but everyone pretty adamantly did not like that. The only difference here as compared to that one is that we have pulled the active unit out of its complex.

It is misleading to me. It looks like the chemical is being regulated. I think it is also mixing a bit of apples and oranges. These models show the flow of activities that are enabled by continuants. The old representation short-circuits the activity aspect of the regulation.

goodb commented 5 years ago

It is misleading to me. It looks like the chemical is being regulated.

I don't really want to argue against everyone opposed to the use of the involved_in properties but the relation is unambiguous - the continuant entity is involved in the regulation of the occurrent function. Like many things, this may be confusing in the display.

ukemi commented 5 years ago

I completely agree that it is not technically wrong, but it blurs over the actual activity that is occurring. Don't you think we should state how the continuant is involved if we know? I don't want people to be mislead into thinking this represents, for example, AMP regulates PFK activity. We have worked so hard to tell people that something that is happening regulates PFK activity.

I agree that the display makes it more confusing.

deustp01 commented 5 years ago

I get the point that if the chemical citrate is the product of reaction 1, and that citrate molecule regulates reaction 2, the Reactome assertion that citrate regulates reaction 2 maps cleanly and reliably onto the assertion that reaction 1 regulates reaction 2. What about cases where reaction 2 is responsive to the level of a small molecule and that level is determined by the flux through several reactions, some generating citrate and some consuming it? There is certainly not a 1:1 mapping, and a many to one mapping, even with flags to identify positive and negative contributors is misleading without quantitative / stoichiometric features that neither of us want to indicate how much each of the contributing reactions contributes.

Generalizing, anywhere that there is a pipeline, A acts on B acts on C acts on D and the organization of the pipeline physically excludes participation by entities outside the pipeline, mapping entities to the reactions that produce them works cleanly. Whenever we're dealing with a system where there are several sources and sinks affecting the level of an entity that does something to another reaction / activity, it's hard to see how to get a reliable mapping, especially if the mapping is also supposed to yield a direction (i.e., the net effect is to raise citrate levels and thus promote / lower citrate levels and thus suppress a downstream activity).

Hard for me to see, anyway. If there's a fix I'm missing, we move on to implementing it in the Reactome to GO-CAM process.

But maybe there is a workaround.

In classic metabolic cases where the end product of a pathway feedback-inhibits an early reaction in the pathway (e.g., the products of purine biosynthesis AMP, GMP, and IMP all negatively regulate the activity of PRPP synthase, the mapping is clean: each of those molecules is generated in a reaction in the pathway, so those reactions each negatively regulate the PRPP synthase reaction.

In a case like regulation by citrate or ATP where there may be many sources and sinks and indeed the engineering goal is to integrate over all of them to determine the need for the reaction that the citrate or ATP is regulating, would it be acceptable to create a placeholder reaction on the fly, "synthesis of ATP" and make that the positive regulator (and maybe "breakdown of ATP" as a negative regulator)?

deustp01 commented 5 years ago

Also, there were 204 control events with more than one active unit annotated. e.g. the catalyst in the reaction 'Phosphorylation of PLC-gamma1'

This looks like bad annotation practice, and that the curator is trying to make one reaction instance do the job of at least two, by cramming two distinct ways of providing catalytic activity into a single catalystActivity instance attached to a single reaction instance.

If Ben could provide the list of 204 control events with more than one active unit annotated, I will take a look to see if this initial reaction is right and if so, work with Reactome people to come up with a plan for changed annotation practice henceforth and cleanup of the legacy instances. My hunch is that clean-up of the annotations could probably be automated, but all the new reactions that will appear will need to be laid out manually in pathway diagrams and this part will be painful.

goodb commented 5 years ago

@deustp01 sorry again, it was 108 multi-active-unit reactions. Got fooled as some of them appear in multiple pathways. That probably inflated some of the other counts above. Here is the list of reactions:

ukemi commented 5 years ago

Looks like we will have plenty to do next week. If you don't get around to these, we can look together.

If Ben could provide the list of 204 control events with more than one active unit annotated, I will take a look to see if this initial reaction is right and if so, work with Reactome people to come up with a plan for changed annotation practice henceforth and cleanup of the legacy instances. My hunch is that clean-up of the annotations could probably be automated, but all the new reactions that will appear will need to be laid out manually in pathway diagrams and this part will be painful.

The difficulty in annotating a reaction (or two) here is that choosing which one comes first might not be known or even consistent. That's why I would invoke a process with two kinase activities as parts. Not that I stubbornly argue a point. :)

ukemi commented 5 years ago

I get the point that if the chemical citrate is the product of reaction 1, and that citrate molecule regulates reaction 2, the Reactome assertion that citrate regulates reaction 2 maps cleanly and reliably onto the assertion that reaction 1 regulates reaction 2.

But our view since the beginning has been that it is not the citrate regulating reaction 2, but that the citrate is taking part in something that is happening that regulates reaction 2. That's why we have processes 'regulation of x'. Since we are gene-centric, we express the something that is happening in the context of what genes are doing. So the PFK binding the citrate is what is regulating the kinase activity. I think this also cleans up the mass action difficulties. If PFK binds citrate, it negatively regulates the kinase activity.

ukemi commented 5 years ago

PS. It makes total sense from a reaction point of view that the citrate (product of some reaction) is able to negatively regulate some downstream reaction.

ukemi commented 5 years ago

@deustp01 sorry again, it was 108 multi-active-unit reactions. Got fooled as some of them appear in multiple pathways. That probably inflated some of the other counts above. Here is the list of reactions:

OK. Some of these are definitely processes from a GO perspective.

ukemi commented 5 years ago

@goodb. What about splitting out the regulatory processes into a separate model, but referring to the same Reactome pathway. Would that give us the best of both worlds? It would get rid of the clutter that you find undesirable and would give me the explicit process-based representation.

goodb commented 5 years ago

I'd really prefer to keep a 1 to 1 relationship between Reactome pathways and GO-CAM model.

@ukemi I find the clutter and layout mess annoying from a UI perspective, but that is a separate issue from the modeling. We should not let weaknesses in the UI translate into weaknesses or hacks in the data model. The UI can be fixed once the structures in the model are stable. (Worst case is that I spend a week and hack together a new layout algorithm that is aware of the recent changes. Best case is that we get a client UI developer involved that has more power to control the system and can do more expand/contract operations on the graph entities...)

goodb commented 5 years ago

After further pondering, I am slowly coming around to what you are referring to as the process-based representation for this case. The key for me is that it matches up better with the structure of everything else happening in the GO-CAM kb and this is really important for query.

As an aside, the structures that are being established here should be stored somewhere as templates for curators once they are agreed upon...

deustp01 commented 5 years ago

"Since we are gene-centric, we express the something that is happening in the context of what genes are doing. So the PFK binding the citrate is what is regulating the kinase activity."

OK. We will need to look at a sample of regulation instances to be sure that we can translate Reactome's "entity X regulates an activity" into GO-CAM "entity X interacts with the enabler of the activity and that regulates the activity". Note the swapping of "binds" for "interacts": sometimes we do know that the interaction takes the form of binding, but we probably can't guarantee this. Indeed, there are likely to be at least some cases where the data say that an activity increases or decreases when an entity is present (so "entity regulates activity" is valid) but we have no data as to mechanism (so "binds" goes beyond the data).

deustp01 commented 5 years ago

"split out regulation" etc.

I think the way we were going last week and also here, we aren't really doing this. We are more nearly looking for ways of collapsing the whole Reactome annotation of a regulatory process that affects glycolysis or BMP signaling into a simple assertion that X entity regulates Y activity in the process being converted into GO-CAM. If there's a lot of Reactome annotations hidden inside that simple assertion, that's lost for now and can be brought back in a future UI that enables navigation between GO-CAM models. Our goal now is to make GO-CAMs that are good templates for future GO curation, so suppression of some human-specific, possibly out-of-scope annotations seems OK, especially as the surviving regulation stubs can be used in the future to reassemble the lost pieces.

ukemi commented 5 years ago

If we can't invoke binding, then I think we should explore your idea above. We could invoke a regulatory process in which the entity is simply a participant. When you curate one of these types of regulatory events, is it always because the entity is somehow involved in regulating the whole pathway?

ukemi commented 5 years ago

So maybe Ben's original representation is the best for now, which essentially said entity X is somehow involved in the regulation of MF A. Sorry to be going round and round with this, I thought we could always assert binding in these representations.