Closed rtroper closed 3 years ago
Apologies for brevity...
How about a gene-gene edge with an edge property for site, disease state, magnitude?
Would look like this in the yaml:
gene to gene co-expression association:
is_a: gene to gene association
description: >-
Indicates that two genes are co-expressed, possibly under the same conditions
slots:
- quantifier qualifier
- expression site
- life stage
- phenotypic state
slot_usage:
relation:
subproperty_of: co-expressed with
symmetric: true
description: >-
This will typically be the relationship type 'co-expressed with', but may be a sub-relation
expression site:
range: anatomical entity
description: "location in which the two genes are co-expressed. May be cell, tissue, or organ"
examples:
- value: UBERON:0002037
description: cerebellum
stage qualifier:
range: life stage
description: "stage at which the gene is expressed in the site"
examples:
- value: UBERON:0000069
description: larval stage
phenotypic state:
range: disease or phenotype
description: >-
if the co-expression is in diseased or unhealthy tissue the phenotypic state can
be put here, e.g. MONDO ID. For heathy, use XXX
quantifier qualifier:
description: >-
optional quantitive value indicating degree of expression
I think this is ok, but I'm not excited by the expression site being a property. In general, I feel like having entities that are nodes in some places and the value of an edge property in another place is a bad pattern to get into.
This same issue will come up in chemical reactions as well as eQTL variants.
The other options that have been discussed previously are 1) making the correlation (or reaction) into a node itself (reifying it) so that it can have edges to gene A and gene B and the site/tissue, with the description / pvalues becoming node properties on this new node type. 2) having 3 pairwise edges (geneA-geneB, geneA-site, geneB-site) and putting a common hyperedge id on them to indicate that they go together. We've tried this solution before, but it produces complicated cypher queries, and it makes sort of a mess when the same entity (especially tissues) appear in many hyperedges.
Let's take the general discussion to #566.
I'm closing after following the ticket trail to a merged PR for gene-gene association with tissue as an edge property. Please feel free to reopen if I got this wrong, or we need to discuss it further to appropriately answer this use case! :)
In a previous round of the Translator project, we developed Big GIM, a database containing information about co-expression of genes in different tissues (healthy and cancerous). We plan to develop this database into a full-fledged KP with standardized API.
We're attempting to map information we have in Big GIM to terms/concepts found in the Biolink model. The intent of this exercise is to determine if appropriate terms/concepts exist already in Biolink for representing this information as a knowledge graph.
At the same time, we're grappling with the question of whether the information already in Big GIM is well-suited for answering the kinds of questions that Translator is trying to answer, or if it should be supplemented or replaced with information (still related to gene expression) that is better-suited for this purpose.
Data currently available in Big GIM ('parsed' or decomposed in terms of nodes, edges, and slots) is as follows:
Here's a series of questions that reflects our thinking and some gaps in our understanding of how this type of information can be translated to a knowledge graph that's useful to an ARA:
Q1. In terms of Biolink concept mapping, the entity
GeneProduct
maps to (1) and (3). But where does the concept of "level" or "amount" come in? Is this degree of semantic precision necessary? If so, how is the concept of "level"/"amount" incorporated? As a slot?Q2. The
correlated with
Biolink association maps to (2), but how do we qualify it with information (numerical values) in (6) and (7)? There is ahas confidence level
association that seems relevant to (7), but how should this be incorporated into a graph that's returned to the ARA? And how do we (in the graph) qualify (2) with (6)? Would these be slots?Q3. It seems that the entity
MaterialSample
may map to (5), but how should this be qualified as being of a particular tissue type (e.g. organ location and whether it is healthy or cancerous tissue)?It seems it could be very easy to over-specify semantic information so that it becomes rather unwieldy to translate into a knowledge graph. In addition to questions in Q1, Q2, and Q3 above, we're wondering how a pragmatic balance can be struck such that we only incorporate/encode the minimal semantic information necessary for the ARAs to do their job.
Any insight or guidance is much appreciated.