Open mbrush opened 6 years ago
TL;DR (earlier/longer notes from which the above summary was condensed)
To date most CQ notebooks have explored relatively simple retrieval and faceting type queries. But the real utility of the Translator will be supporting more open-ended, exploration of data , enabling users to populate a blackboard with knowledge that drives serendipitous discovery and novel insight.
Given the graph-based nature of much of the knowledge in the Translator system, 'pathfinding' operations are one potentially useful approach for this type of exploration and blackboard construction. Here, the system would return paths through the graph connecting entities of interest, and allow users to filter and facet these paths to hone in on those representing meaningful evidence in support of their larger question or use case.
For example, given a set of candidate FA modifier genes, explore paths linking these genes to FA in the data to provide evidence for prioritizing/ranking these candidates, and suggesting possible mechanisms of action. Here, we have positive controls that we can start with, as ALDH2, ADH5, and TGFbeta are known modifies with established mechanisms. We will write pathfinding cypher queries that return all paths through the Monarch and SemMedDB data connecting these genes to FA, and explore requirements and approaches for refining/constraining these paths to hone in on those representing the most meaningful evidence.
Example: Return all paths between Aldh2 and FA -> filter and facet and expand results to identify most meaningful paths that support this known fact, and might have led to its hypothesis before its official discovery
Using these controls for pilot experiments, we can think about the types of evidence that would support inference to these answers, the types of data that would support these inferences, how the data would have to be modeled, queried, and presented to users to support such inference, and the tooling required to support these tasks. e.g. for arriving at Aldh2 as a FA modifier:
Ultimately, we hope that this exercise will inform requirements for many aspects of Translator development:
Tasks:
Build Data Graph: Load a single neo4j database (on an ncats aws server) with data from Monarch-SciGraph, which contains semantically integrated data from a variety of biological/biomedical curated knowledgebases, with a focus on genotype-phenotype related resources. Possibly load additional/complementary graph databases into this neo4j instance, normalizing as possible to facilitate traversal across databases /sources (e.g. SemMedDB). The goal here is to approximate the ability to do pathfinding across a distributed architecture currently being explored in the translator.
Evaluate Interface: Create a neo4j explorer interface to allow query and visualization of results
Define/Optimize Pathfinding Queries: Use cypher query language to write 'pathfinding' queries (e.g. show all paths through the graph connecting entity1 to entity2). Likely to be computationally expensive, so will require optimization of the query and/or the data. Cypher may offer specific query constructs and algorithms for pathfinding analyses.
Visualization of Results: Query results will be large numbers of 'paths', which are inherently difficult to display and process in a way that supports comprehensive understanding by humans. We will need to explore output formats and approaches to visualizing, summarizing, and operating on results so as to allow efficient and actionable human understanding
Query Refinement: A key task will be refining queries to hone in on the most meaningful paths through the data.
Evaluation and Expansion: If successful, we will identify a small subset of paths that provide meaningful evidence for our query, that an exert can evaluate and use to seed further exploration of the data.
Regarding Task 5 (but also probably 3 and 4), I'm thinking a machine learning approach may be useful here. How that would work could be similar to how prediction in drug repurposing works, where by using a set of known drug-disease pairs, the paths through the network connecting these known "true" connections are selected for and weighted more strongly than edge types that don't (or are less useful) for connecting these. A technique like that could be applied here to refine queries and try to select more meaningful paths. I or @veleritas could look into this more deeply...
Use an integrated neo4j database to explore how human and machine learning agents might collaborate to extract evidence from knowledge graphs to derive predictions and mechanistic hypotheses.
Tasks
Goals
Valued Expertise