biolink / biolink-model

Schema and generated objects for biolink data model and upper ontology
https://biolink.github.io/biolink-model/
Other
170 stars 71 forks source link

Adding GeneExpressionToDrugAssociation, etc. #765

Closed sierra-moxon closed 2 years ago

sierra-moxon commented 3 years ago

From a one-off helpdesk call with Guangrong Qin (multiomics provider)

Here is the summary of our discussion:

  1. Question 1: The current biolink model doesn't cover the concept of what we want to present.
    • [ ] Action item: Add a new biolink category term. (Currently they are manually adding to it) (GeneToDrugAssociation : GeneExpressionToDrugAssociation) (GeneToDrugAssociation : GeneMutationToDrugAssociation)
  2. Question 2: We can easily query all associations in the biolink model, how to return a curie id for each association in biolink?
    • [x] Action item: the biolink team will develop a function to return RO id for a given category.
  3. Question 3: The associations are in hierarchical structure, which will help users to get broader results or a precise answer. It requires us to have the hierarchy of associations in both the biolink model as well as the knowledge graphs.
    • [x] Action item: the SRI team will add s function to get the hierarchy of association.
  4. Question 4: Similar to question 3, there should be a hierarchical structure for the predicates.
    • [x] Action item: A function to get hierarchy of predicates.
  5. The EPC standard and the biolink model will be merged together.
    • [ ] Action item: Work with both the EPC team and biolink team to present the KGs provided by multiple omics KP.
gloriachin commented 3 years ago

For Q1, it may be related to the wide coverage of biological concepts in the Translator consortium and the limited number of annotations in the current Biolink model system. For this case, we firstly compared the drug sensitivity between the mutated group and wild type group, and report the genes which show a statistical association between the drug sensitivity. Secondly, we computed the correlation between the drug sensitivity values (IC50 or AUC) and the gene expression values and reported the results between the gene expression to drug sensitivity. The two KGs from the two pipelines can provide evidence about "Gene to drug association". But we also consider different molecular levels (the genetic level, transcriptomic level), so we need to allow users to have a clear understanding about in which level the results are coming from. We also need to assign a more precise biolink predicates for the two KGs.

gloriachin commented 3 years ago

For Q3, is the hierarchical structure for the predicates currently defined in the biolink model, and where can I take a look at them, and how we should implement them in the KGs?

ehinderer commented 3 years ago

Any progress on this? I anticipate CHP will also benefit from these associations.

mikebada commented 3 years ago

Is it intended that the proposed GeneExpressionToDrugAssociation be used make assertions of drugs increasing or decreasing the expression of corresponding genes? If so, I think this is already representable by making ChemicalToGeneAssociation triples and using the increases_expression_of and decreases_expression_of predicates.

And is it intended that the proposed GeneMutationToDrugAssociation be used to make assertions of drugs interacting with variants of genes? If so, I think this may also be already representable by making ChemicalToGeneAssociation triples and adding sequence variant qualifiers to the triples.

mbrush commented 3 years ago

@mikebada my understanding is a bit different than what you lay out above.

I think the GeneMutationToDrugAssociations are intended to assert that 'Mutations in gene x' are associated with sensitivity or resistance to 'treatment with drug y'. . . e.g. the presence of mutations in gene x correlate with a better patient/cellular response to treatment with a given drug.

I think the GeneExpressionToDrugAssociations are meant state that 'Expression level of gene x' is associated with sensitivity or resistance to 'treatment with drug y' . . . e.g. cells/patients where gene x shows increased expression are more likely to be resistant to treatment with a given drug.

If this understanding is correct, below are a few modeling ideas:

Based on this proposal, a GeneMutationToDrugAssociation stating that "Mutations in ALK are associated with sensitivity to Crizotinib treatment" might look something like the following:

subject: HGNC:427 (ALK)
subject_modifier: bl:SequenceVariation       # or go:GeneExpressionProcess for the expression use case  
predicate: bl:associated with sensitivity to  
object: CHEBI:64310   (Crizotinib)
object_modifier: bl:Treatment 

The subject and object modifiers here serve to extend/modify the semantics of the subject or object node - e.g. cast 'ALK' in the subject slot as 'variants in ALK', and 'Crizotinib' in the object slot as 'treatment with Crizotinib'. This approach to 'post-composing' node semantics may not be fully explicit. But it allows us to continue using existing community IRIs for core domain entities like Genes and Drugs as node ids in these associations, without having to mint IRIs for concepts like "Mutations in ALK". If we document it well and apply consistently, I think it could be sufficient. And importantly, this approach supports simpler queries with fewer hops to find connections between core domain entities, which seems to be a key requirement for Translator reasoning tools. Finally, I think the proposal is consistent with our principle that the core S-P-O triple should remain true even if qualifiers are ignored.

Note also that the EPC metadata will provide cues to highlight that these assertions report correlations based on statistical analysis of cellular survival and expression data. This is important, as other sources (e.g. CIViC) generate drug response assertions using the same S-P-O-Q pattern/semantics, but based on the results of clinical trials directly testing the drugs on patients. For example, see this CIViC assertion reporting that ALK Fusions are associated with sensitivity to Crizotinib.

mikebada commented 3 years ago

@mbrush Upon closer reading I think your interpretation is closer to what was intended. However, the central entities in the proposed associations are genes and drugs, so I think we should try as much as we can to use the already existing ChemicalToGeneAssociation class to represent these; otherwise the drug-gene assertions will be spread among multiple association types. affected_by would seem to work as a predicate, perhaps along with qualifiers for the subject and object nodes.

gloriachin commented 3 years ago

Matt's understanding is correct. I also like the idea of adding a 'subject_modifier' or 'subject_levels' for each subject or object. Especially, when we are talking about one gene, we can have a precise annotation of which level we are talking about, such as gene mutation, gene expression.

mbrush commented 3 years ago

@mikebada I think it could be fine to type this as a ChemicaltoGene Association, unless the data creators have a use case for a more specific association types. However, it is a bit disingenuous to say that a Gene is a participant in these associations as they are defined above. Even though we use a Gene IRI as the subject node identifier, the actual category of the subject (when its qualifiers is considered) are 'Expression Level of Gene', or 'Mutations in Gene'. Also, I think we may want to use this same category to cover assertions about a specific variant affecting response to a drug - which would not be accommodated by Chemical-Gene association).

As for the predicate, I am proposing we add new predicates that more precisely capture the relationship being asserted to hold here. I think 'affects' is too general, and perhaps not really correct (gene expression / mutations are 'affecting' the patient response to a drug, not the drug itself). The predicates we would have to add for the proposal above minimally include bl:associated with sensitivity to and bl:associated with resistance to.

Finally, all of this does raise an alternative modeling possibility where the subject is truly a Gene (free of any qualifier), and we move the meaning captured above using 'Expression' / 'Mutation' qualifiers into the predicate. e.g.

# Second option: less reliance on qualifiers

subject: HGNC:427 (ALK)
predicate: bl:expression_level_associated with_sensitivity_to    # or bl:has_mutations_associated_with_sensitivity_to
object: CHEBI:64310   (Crizotinib)

For this approach, we would need to minimally add the following predicates: bl:expression level associated with sensitivity to, bl:expression level associated with resistance to, bl:mutations associated with sensitivity to, and bl:mutations associated with resistance to.

I think either approach could work, but my initial inclination is to go with the qualifier approach because I think it will allow for more flexibility to accommodate new use cases without predicate proliferation.

mbrush commented 3 years ago

Third option: More reliance on qualifiers - move 'patient response' aspect to a qualifier on the drug

subject: HGNC:427 (ALK)
subject_modifier: bl:SequenceVariation
predicate: bl:affects? associated with? bl:related_to?    
object: CHEBI:64310   (Crizotinib)
object_modifier: Sensitivity 

In all of these proposals, the question remains what category we use/define to type the association.

  1. GeneMutationtoDrugAssociation - requires adding a new Association type. But this does allow us to then define clear constraints / guardrails for how data is created using this association type.
  2. ChemicaltoGeneAssociation - the current / canonical direction for Chemical-Gene associations. But as noted above - problematic for a few reasons.
  3. GenetoChemicalAssociation - not the canonical direction for Chemical-Gene associations, but allows us to make a more natural statement about a gene/variant affecting response to a drug.
mikebada commented 3 years ago

@mbrush I don't think we'd need to broaden the definition fo ChemicalToGeneAssociation, as it's already very broadly defined as an interaction between a chemical entity and a gene or gene product.