biolink / biolink-model

Schema and generated objects for biolink data model and upper ontology
https://biolink.github.io/biolink-model/
Other
177 stars 72 forks source link

Semantic description of gene co-expression in specific tissues #324

Closed rtroper closed 3 years ago

rtroper commented 4 years ago

In a previous round of the Translator project, we developed Big GIM, a database containing information about co-expression of genes in different tissues (healthy and cancerous). We plan to develop this database into a full-fledged KP with standardized API.

We're attempting to map information we have in Big GIM to terms/concepts found in the Biolink model. The intent of this exercise is to determine if appropriate terms/concepts exist already in Biolink for representing this information as a knowledge graph.

At the same time, we're grappling with the question of whether the information already in Big GIM is well-suited for answering the kinds of questions that Translator is trying to answer, or if it should be supplemented or replaced with information (still related to gene expression) that is better-suited for this purpose.

Data currently available in Big GIM ('parsed' or decomposed in terms of nodes, edges, and slots) is as follows:

  1. Level of gene product A (i.e. gene expression)
  2. is correlated with
  3. level of gene product B
  4. within
  5. [healthy tissue X | cancer tissue Y]
  6. with Spearman rank correlation value r
  7. with significance (p-value) value p

Here's a series of questions that reflects our thinking and some gaps in our understanding of how this type of information can be translated to a knowledge graph that's useful to an ARA:

Q1. In terms of Biolink concept mapping, the entity GeneProduct maps to (1) and (3). But where does the concept of "level" or "amount" come in? Is this degree of semantic precision necessary? If so, how is the concept of "level"/"amount" incorporated? As a slot?

Q2. The correlated with Biolink association maps to (2), but how do we qualify it with information (numerical values) in (6) and (7)? There is a has confidence level association that seems relevant to (7), but how should this be incorporated into a graph that's returned to the ARA? And how do we (in the graph) qualify (2) with (6)? Would these be slots?

Q3. It seems that the entity MaterialSample may map to (5), but how should this be qualified as being of a particular tissue type (e.g. organ location and whether it is healthy or cancerous tissue)?

It seems it could be very easy to over-specify semantic information so that it becomes rather unwieldy to translate into a knowledge graph. In addition to questions in Q1, Q2, and Q3 above, we're wondering how a pragmatic balance can be struck such that we only incorporate/encode the minimal semantic information necessary for the ARAs to do their job.

Any insight or guidance is much appreciated.

cmungall commented 4 years ago

Apologies for brevity...

How about a gene-gene edge with an edge property for site, disease state, magnitude?

Would look like this in the yaml:

  gene to gene co-expression association:
    is_a: gene to gene association
    description: >-
      Indicates that two genes are co-expressed, possibly under the same conditions
    slots:
      - quantifier qualifier
      - expression site
      - life stage
      - phenotypic state
    slot_usage:
      relation:
        subproperty_of: co-expressed with
        symmetric: true
        description: >-
          This will typically be the relationship type 'co-expressed with', but may be a sub-relation
      expression site:
        range: anatomical entity
        description: "location in which the two genes are co-expressed. May be cell, tissue, or organ"
        examples:
          - value: UBERON:0002037
            description: cerebellum
      stage qualifier:
        range: life stage
        description: "stage at which the gene is expressed in the site"
        examples:
          - value: UBERON:0000069
            description: larval stage
      phenotypic state:
        range: disease or phenotype
        description: >-
          if the co-expression is in diseased or unhealthy tissue the phenotypic state can
          be put here, e.g. MONDO ID. For heathy, use XXX
      quantifier qualifier:
        description: >-
          optional quantitive value indicating degree of expression
cbizon commented 4 years ago

I think this is ok, but I'm not excited by the expression site being a property. In general, I feel like having entities that are nodes in some places and the value of an edge property in another place is a bad pattern to get into.

This same issue will come up in chemical reactions as well as eQTL variants.

The other options that have been discussed previously are 1) making the correlation (or reaction) into a node itself (reifying it) so that it can have edges to gene A and gene B and the site/tissue, with the description / pvalues becoming node properties on this new node type. 2) having 3 pairwise edges (geneA-geneB, geneA-site, geneB-site) and putting a common hyperedge id on them to indicate that they go together. We've tried this solution before, but it produces complicated cypher queries, and it makes sort of a mess when the same entity (especially tissues) appear in many hyperedges.

cmungall commented 3 years ago

Let's take the general discussion to #566.

sierra-moxon commented 3 years ago

I'm closing after following the ticket trail to a merged PR for gene-gene association with tissue as an edge property. Please feel free to reopen if I got this wrong, or we need to discuss it further to appropriately answer this use case! :)