biolink / biolink-model

Schema and generated objects for biolink data model and upper ontology
https://biolink.github.io/biolink-model/
Other
171 stars 71 forks source link

Modeling Mechanism of Action in Chemical-Gene Interaction Associations #570

Closed mbrush closed 2 years ago

mbrush commented 3 years ago

This issue follows from a tangential discussion spawned in the TRAPI ticket here - regarding how to capture 'Mechanism of Action' terms in associations describing how a chemical (e.g. aspirin) affects a gene or gene product (e.g. COX2). The TRAPI model pushes this info into an 'Attribute' object:

"attributes": [
     {
     "relation_id": "NCIT:C54680",
     "relation_name": "Mechanism of Action",
     "source": "ChEMBL",
     "type": "MESH:C0085387",
     "name": "Cyclooxygenase inhibitor",
     "url": "https://www.ncbi.nlm.nih.gov/medgen/43164"
     }
  ]

. . . and @RichardBruskiewich asked "why not set the (Q)Edge relation property itself to MESH:C0085387?"

My understanding is that the Association.relation slot is used to hold the original or a more granular relationship between the S and O node than what is captured in the predicate slot (which holds a standardized Biolink predicate term).

It was mistakenly assumed in the earlier TRAPI ticket that the Association from which the Attribute above hung held that ChemicalX -treats - DiseaseY. I'll run with this for a second to illustrate an important point - which is that for such an association, a Mechanism of Action is not a more specific form of the treats relation, but rather a qualifier that refines the meaning of the core triple itself. It extends the core S-P-O semantics to assert that ChemicalX treats DiseaseY via mechanismZ. We might structure the association as follows:

subject: chemical X 
predicate: biolink:treats
object: DiseaseY 
qualifier*: mechanismY (e.g. MESH:C0085387  # "Cyclooxygenase inhibitor")

(*Biolink often defines more specific types of qualifiers, so we might define a *'mechanism of action qualifier'* slot to use here)

Now, returning to in the actual case here where the association holds between a Chemical and an affected Gene. In this context I think it is a valid option to frame the Mechanism of Action as a relationship and capture it in the relation field. Here we map to a more general predicate from Biolink to put in the Association.predicate slot, and put the more specific relationship from the source in the Association.relation slot:

subject: chemical X (Aspirin)
predicate: biolink:affects
relation: more specific mechanistic relationshipY (e.g. wikidata:is_inhibitor_of,  wikidata:is_agonist_of, inxight:is_allosteric_inhibitor_of)
object: gene Z (COX2)

Alternatively, the approach that several of us preferred in such cases, as discussed on a recent Biolink helpdesk call about MolePro data from InXight, is to go ahead and define the more granular relationship in Biolink, and use this as the value of the predicate:

subject: chemical X (aspirin)
predicate: biolink: is_inhibitor_of
object: gene Y (COX2)

NOTE: This is how Wikidata ultimately decided to model this type of knowledge. They initially used a general relationship and qualified it with a more specific type of mechanistic interaction. But have since refactored their modeling to use more granular predicates. Rationale for this is explained in their discussion wiki here.

mbrush commented 3 years ago

On the 12-7-20 Biolink Helpdesk call, we noted that there are several sources of this type of information that describe the specific mechanism by which a Chemical affects a Gene. And there is a finite set of terms describing these mechanisms. Our feeling was that we can create these more specific predicates in Biolink. So if a knowledge source says that a Chemical interacts with a Gene as an 'allosteric inhibitor', we can define a Biolink predicate for 'is allosteric inhibitor of', and use it to represent the knowledge without a need for a qualifier or more granular term in the relation slot. Simply:

subject: chemical X 
predicate: biolink: is_allosteric_inhibitor_of
object: gene Y

The Biolink Helpdesk team (@RichardBruskiewich @nlharris @deepakunni3 @cmungall) is going to work through a list of Mechanism of Action terms from InXight and other knowledge sources, and define a specific proposal more carefully, before bringing to the Translator DM team on a call.

RichardBruskiewich commented 3 years ago

@vdancik, please also take a look at this and the corresponding example in https://github.com/NCATSTranslator/ReasonerAPI/issues/185#issuecomment-740322400. Maybe have Paul C. ponder this as well, from a molecular modelling perspective.

I don't have very strong opinions about this except 1) it is still supportive of the need for a richer TRAPI Attribute schema and 2) upon some iteration between various expert scientists, we should be able to iterate toward a sensible solution.

RichardBruskiewich commented 3 years ago

Staring at @vdancik 's example, as modified by Eric, I sensed the following:

1) that I do agree with @mbrush's general perspective that "mechanism of action" is an Edge qualifier, i.e.

subject: chemical X 
predicate: biolink:treats
object: DiseaseY 
qualifier*: mechanismY (e.g. MESH:C0085387  # "Cyclooxygenase inhibitor")

however, this triggers the question of

2) how to properly represent a qualifier in TRAPI? Can TRAPI attributes be considered, a priori, to all be 'qualifiers' of the edge? In which case, Eric's de facto reformulation of Vlado's example should work fine, i.e.

{
    "attributes": [
        {
            "relation_id": "NCIT:C54680",
            "relation_name": "Mechanism of Action",
            "source": "ChEMBL",
            "type": "MESH:C0085387",
            "name": "Cyclooxygenase inhibitor",
            "url": "https://www.ncbi.nlm.nih.gov/medgen/43164"
        }
    ]
}

(leaving aside, for the moment, what the final name of the TRAPI Attribute properties relation_id and relation_name will be... this is a separate discussion, but the utility of these properties is somewhat clear here).

The practical side of this issue will be how best to capture this semantics in the Biolink Model context of Knowledge Graphs, since to some extent, the resulting qualifier is a "tag = value" 2-tuple:

( NCIT:C54680["Mechanism of Action"], MESH:C0085387["Cyclooxygenase inhibitor"] )

And this, only assuming that the value is not multivalued (i.e. not an list/vector/array) which may not always be true.

As @mbrush hints above, do we decide to add a whole parallel is_a hierarchy of Edge qualifier slots, comparable to the predicate related_to hierarchy, to partially deal with this challenge, then add such slots to the specific biolink:Association entities to which they best belong? In which case, we'd provide NCIT:C54680 as one of the exact_mappings of a new slot definition mechanism_of_action_qualifier: is_a: qualifier is_a: association slot, then decide which biolink:Association subclass needs it. Our final encoding of this example would then be:

{
    "attributes": [
        {
            "relation_id": "biolink:mechanism_of_action_qualifier",
            "relation_name": "Mechanism of Action",
            "source": "ChEMBL",
            "type": "MESH:C0085387",
            "name": "Cyclooxygenase inhibitor",
            "url": "https://www.ncbi.nlm.nih.gov/medgen/43164"
        }
    ]
}

which would be directly mappable in Biolink onto the biolink:Association.qualifier slot.

saramsey commented 3 years ago

My understanding was that the relation slot is supposed to contain the original source relation as a CURIE. For example,

{
  "predicate": "biolink:chemically_similar_to",
  "negated": "False",
  "simplified_edge_label": "chemically_similar_to",
  "publications_info": "{}",
  "subject": "CHEBI:73275",
  "provided_by": [
    "OBO:chebi.owl"
  ],
  "edge_label": "has_parent_hydride",
  "update_date": "2020-10-06 23:54:48 GMT",
  "relation": "CHEBI:has_parent_hydride",
  "object": "CHEBI:26775"
}

Has that guidance changed?

mbrush commented 3 years ago

@saramsey I don't think that guidance has changed. I think confusion may have been caused by the fact that a slot in Richard's example of an Attribute object above has the unfortunate name of relation_id. This slot is unrelated to the relation slot that hangs from an Association, and the name of the relation_id slot in the Attribute object has since been changed to avoid such confusion.

mbrush commented 3 years ago

Course correcting this thread back to the original topic of how to model chemical-gene product interaction associations.

On the 12-17-20 Data Modeling call, three different options were laid out for consideration. The central issue here concerns whether to capture the specific interaction mechanism using the predicate, relation, or qualifier slot of an Association/Edge. There is a dependency here on the level of granularity at which Biolink defines chemical interaction predicates based on specific mechanisms of interaction, which could then be used to populate the Association.predicate slot.

1. Capture specific mechanism in the Association.predicate slot

Here we would allow for Biolink to define granular chemical interaction predicates based on specific mechanisms of interaction (e.g. 'is agonist of', 'is negative allosteric modulator of'). The granularity of the Biolink predicate used in the Association could then match what was asserted in the source. No qualifier would be required.

Example:

subject: chemical X (aspirin)
predicate: biolink:is agonist of
object: gene product Y (COX2)

The KP could still use the relation slot to capture the specific term/language used by the source to describe the relationship if they wanted - for provenance sake. But it wouldn't be necessary from a semantic perspective (since the Biolink predicates will be granular enough to capture exactly what this source asserted).

2. Capture specific mechanism in the Association.relation slot:

Here we would keep interaction predicates in Biolink relatively high level (e.g. at present the stop at predicates like 'affects', 'molecularly interacts with', 'increases activity of'), and leave it to the relation field to capture the more specific interaction relationship asserted by the source.

Example:

subject: chemical X (Aspirin)
predicate: biolink:affects
relation: {more specific mechanistic relationship described by the source (e.g.  ro:is_agonist_of, wikidata:is_agonist_of, inxight:is_agonist_of, chebi:agonist}
object: gene product Y (COX2)

Note that we still need to sort out conventions for how KPs will create the term that goes in the relation slot. In the examples above there are interaction mechanism terms coming from different possible sources:

3. Capture specific mechanism in an Association.qualifier slot:

Here we would also keep interaction predicates in Biolink relatively high level, and capture the specific interaction mechanism as a qualifier that modifies the meaning of the core S-P-O triple. We could use existing ChEBI role terms to populate the qualifier slot, with their true semantics - so the association becomes "_Chemical x interactswith Gene Product y by realizing Role z" (e.g. agonist role).

Example:

subject: chemical X (Aspirin)
predicate: biolink:affects
object: gene product Y (COX2)
interaction mechanism qualifier: chebi:agonist

As noted for Option 1, the KP could still use the relation slot to capture the specific term/language used by the source to describe the relationship if they wanted - for provenance sake.


Approach 2 is what is currently implemented. It seemed that most people on the call were favoring Approach 1 above - one reason being that graph operations to interrogate knowledge are easier if the semantics are directly encoded in the edge. But there was enough interest in the other two to warrant documenting them and discussing here, and on the next DM call. The qualifier-based pattern in particular is appealing as we could leverage ChEBI role terms that already exist, instead of duplicating them as relationships in Biolink and/or RO.

RichardBruskiewich commented 3 years ago

Supportive of option 1 (dedicated predicates), see https://github.com/biolink/biolink-model/pull/587/commits/6bf04c8ace5315b0d78613b1799daa798b0bf425. Briefly, we could tag chemical (and other) specific predicates using semantic mixins

remontoire-pac commented 3 years ago

@mbrush @vdancik @RichardBruskiewich Reading through this, I wonder about the presentation of different levels of granularity from different primary Knowledge Sources. Does option 2 afford the ability to have multiple (possibly different or discordant) Association.relation values? It seems to me that it might -- the examples are presented as alternatives, but what about the case where we get conflicting information from two sources: (inxight:is_agonist_of, chebi:antagonist)? Does option 2 allow expressing both? What about option 3? It seems to me option 1 forces us to choose the "correct" one, but maybe I'm missing something here...

mbrush commented 3 years ago

Hi @remontoire-pac - if I understand correctly, you want to know how the models would handle cases where different knowledge sources describe interaction between the same chemical and gene product at different levels of granularity (e.g. 'inhibits' vs 'negative allosteric modulator of'), or in ways that outright conflict with each other (e.g. 'agonist of' vs 'antagonist of'). And you raise the possibility that in such cases, a single Association object in our data might hold different or conflicting values from different sources.

My short answer is that our models should not be expected to represent different or discordant interaction mechanisms, that may come from different agents/sources, inside a single Association. Rather, these assertions made by different agents/sources would be represented in separate Associations- each of which would also hold the evidence/provenance info specific to that source's assertion.

This answer is based on how we (Biolink/SRI) view the 'scope' of statement put forth in an Association - and the scenario you lay out provides an opportunity to explore this issue more generally. We consider an Association to express an assertion as made by a particular agent on a particular occasion. So whether two agents/sources make the exact same assertion, (e.g. that ChemicalX 'inhibits' ProteinY), consistent assertions at different granularities ('inhbiits' vs 'negative allosteric modulator of'), or conflicting assertions ('agonist of' vs 'antagonist of') - two separate Associations would be created in our knowledge graph in all of these cases. They might express the same or closely related propositions/facts, but they would have different provenance information (who said it, when, how, based on what evidence).

In the case of conflicting Associations ('agonist of' vs 'antagonist of'), it may be that an agent (e.g. a KP or ARA) comes along and notices that different sources are saying different things about the relationship between ChemicalX and ProteinY. This agent may review the evidence and provenance for each Association, and decide which they think is the correct or most precise relationship. They could then create a new Association that expresses their derived assertion - and points to the original Associations as part of the evidence/provenance supporting their claim. This is where it is critical that we have an EPC model that is rich enough to capture Association provenance at a level of detail that lets us resolve conflicts by re-evaluating the totality of evidence, and can represent how a new Association that resulted from this re-evaluation was generated.

remontoire-pac commented 3 years ago

@mbrush -- Thanks for this great clarification. What you are saying here makes a great deal of sense to me and seems to well represent the cases that I was considering in my earlier inquiry. One follow-up question: how does your clear model here relate to the original 3 options from earlier in the thread? Are each of the 3 options able to express what you are saying here, or does this expanded understanding mean we need to lean toward one of the 3 options/ Best, Paul

On Mon, Jan 4, 2021 at 6:45 PM Matthew Brush notifications@github.com wrote:

Hi @remontoire-pac https://github.com/remontoire-pac - if I understand correctly, you want to know how the models would handle cases where different knowledge sources describe interaction between the same chemical and gene product at different levels of granularity (e.g. 'inhibits' vs 'negative allosteric modulator of'), or in ways that outright conflict with each other (e.g. 'agonist of' vs 'antagonist of'). And you raise the possibility that in such cases, a single Association object in our data might hold different or conflicting values from different sources.

My short answer is that our models should not be expected to represent different or discordant interaction mechanisms, that may come from different agents/sources, inside a single Association. Rather, these assertions made by different agents/sources would be represented in separate Associations- each of which would also hold the evidence/provenance info specific to that source's assertion.

This answer is based on how we (Biolink/SRI) view the 'scope' of statement put forth in an Association - and the scenario you lay out provides an opportunity to explore this issue more generally. We consider an Association to express an assertion as made by a particular agent on a particular occasion. So whether two agents/sources make the exact same assertion, (e.g. that ChemicalX 'inhibits' ProteinY), consistent assertions at different granularities ('inhbiits' vs 'negative allosteric modulator of'), or conflicting assertions ('agonist of' vs 'antagonist of') - two separate Associations would be created in our knowledge graph in all of these cases. They might express the same or closely related express the same underlying proposition/fact, but they would have different provenance information (who said it, when, how, based on what evidence).

In the case of conflicting Associations ('agonist of' vs 'antagonist of'), it may be that an agent (e.g. a KP or ARA) comes along and notices that different sources are saying different things about the relationship between ChemicalX and ProteinY. This agent may review the evidence and provenance for each Association, and decide which is the correct or most precise relationship. They could then create a new Association that expresses their derived assertion - and point to the original Associations as part of the evidence/provenance supporting their claim. This is where it is critical that we have an EPC model and conventions that is rich enough to capture where Associations come from at a level of detail that lets us resolve conflicts by re-evaluating the totality of evidence, and represent how a new Association that resulted from this re-evaluation was generated.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/biolink/biolink-model/issues/570#issuecomment-754292146, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAQFWTXPIGXXB3QUVNDDAQLSYJHLJANCNFSM4USKRGTA .

mbrush commented 3 years ago

Yes - the general approach of treating Associations as representing individual Assertions is an orthogonal issue to the structure we choose to represent chemical interaction Associations. Everything I said above fits with all three modeling approaches.

saramsey commented 3 years ago

I think I favor option 2 or 1, rather than 3, because option 3 increases the complexity of the biolink metamodel by adding an edge attribute that must be checked, but only for chemical interactions (this means that reasoning about chemical interactions must be handled as a special case, compared to other types of reasoning). Between Option 2 and Option 1, I have a slight preference for Option 2 because it would have the least impact on our KG2 build system and on how ARAX currently works. But we could adapt to Option 1 if need be.

nlharris commented 3 years ago

Is this in progress?