Pathfinding approaches to ML-Guided knowledge exploration

mbrush commented 6 years ago

Use an integrated neo4j database to explore how human and machine learning agents might collaborate to extract evidence from knowledge graphs to derive predictions and mechanistic hypotheses.

Tasks

Load a neo4j instance with diverse data types from Monarch and SemMed DB databases.
Define and optimize cypher 'pathfinding' query templates.
Apply templates toward answering selected CQs - start with 'positive control' queries that look for paths through the graph providing evidence supporting a known fact/mechanism (e.g. ALDH2 as a known modifier of FA, or cyclodextrin as a successful re-purposing for Niemann-Pick disease).
Manually explore query results by evaluating types of paths returned, defining rules/approaches to identify most meaningful evidence, and refining queries to hone in on these paths in the data.
Explore machine learning approaches to automate this process, and derive evidence-based predictions from data in knowledge graphs.
Explore approaches/interfaces for human intervention in this process - i.e. how to present underlying rationale for automated predictions in a way that allows human users to evaluate the evidence, refine and extend queries based on this, and inform new experiments and analyses.

Goals

Understand data and modeling requirements for this type of approach
Inform architectural requirements for BB and reasoner applications - particularly w.r.t automated/machine learning methods that can help weight evidence and make predictions, and interfaces for human intervention in refining and extending ML results.
Provide end-to-end examples of what open-ended, ML-guided exploration and discovery in the Translator might look like in practice.

Valued Expertise

Monarch/SemMedDB data
Cypher query language and graph-based algorithms (e.g. for pathfinding, traversals, edge-weighting)
Visualization of graph data and paths
Machine learning approaches

mbrush commented 6 years ago

TL;DR (earlier/longer notes from which the above summary was condensed)

To date most CQ notebooks have explored relatively simple retrieval and faceting type queries. But the real utility of the Translator will be supporting more open-ended, exploration of data , enabling users to populate a blackboard with knowledge that drives serendipitous discovery and novel insight.

Given the graph-based nature of much of the knowledge in the Translator system, 'pathfinding' operations are one potentially useful approach for this type of exploration and blackboard construction. Here, the system would return paths through the graph connecting entities of interest, and allow users to filter and facet these paths to hone in on those representing meaningful evidence in support of their larger question or use case.

For example, given a set of candidate FA modifier genes, explore paths linking these genes to FA in the data to provide evidence for prioritizing/ranking these candidates, and suggesting possible mechanisms of action. Here, we have positive controls that we can start with, as ALDH2, ADH5, and TGFbeta are known modifies with established mechanisms. We will write pathfinding cypher queries that return all paths through the Monarch and SemMedDB data connecting these genes to FA, and explore requirements and approaches for refining/constraining these paths to hone in on those representing the most meaningful evidence.

Example: Return all paths between Aldh2 and FA -> filter and facet and expand results to identify most meaningful paths that support this known fact, and might have led to its hypothesis before its official discovery

Using these controls for pilot experiments, we can think about the types of evidence that would support inference to these answers, the types of data that would support these inferences, how the data would have to be modeled, queried, and presented to users to support such inference, and the tooling required to support these tasks. e.g. for arriving at Aldh2 as a FA modifier:

what types of information would represent evidence for this?
can we present this to users given our knowledge sources?
what would this look like (e.g.. are there paths through the data that could be presented to a user?)
how would the system help guide a user to the meaningful evidence amidst the noise?

Ultimately, we hope that this exercise will inform requirements for many aspects of Translator development:

With respect to the data, this can help us to understand where gaps are, and inform ingest of new data types and sources, and how the data should be modeled.
With respect to the blackboard system, it can inform requirements for query functionality, navigation and visualization interfaces, and computational methods (e.g. embedded analysis functionality, or machine learning based filtering or faceting support) that would be required to support meaningful insight and discovery. These will be useful for pilot systems like BT explorer, tk.bio, and NCATS bb prototype, as well as reasoner applications . . .
With respect to user engagement, it will end-to-end example of what open-ended exploration and discovery in the Translator means and looks like in practice - critical for shared understanding of data and approach.

Tasks:

Build Data Graph: Load a single neo4j database (on an ncats aws server) with data from Monarch-SciGraph, which contains semantically integrated data from a variety of biological/biomedical curated knowledgebases, with a focus on genotype-phenotype related resources. Possibly load additional/complementary graph databases into this neo4j instance, normalizing as possible to facilitate traversal across databases /sources (e.g. SemMedDB). The goal here is to approximate the ability to do pathfinding across a distributed architecture currently being explored in the translator.
Evaluate Interface: Create a neo4j explorer interface to allow query and visualization of results
Define/Optimize Pathfinding Queries: Use cypher query language to write 'pathfinding' queries (e.g. show all paths through the graph connecting entity1 to entity2). Likely to be computationally expensive, so will require optimization of the query and/or the data. Cypher may offer specific query constructs and algorithms for pathfinding analyses.
Visualization of Results: Query results will be large numbers of 'paths', which are inherently difficult to display and process in a way that supports comprehensive understanding by humans. We will need to explore output formats and approaches to visualizing, summarizing, and operating on results so as to allow efficient and actionable human understanding
Query Refinement: A key task will be refining queries to hone in on the most meaningful paths through the data.
Evaluation and Expansion: If successful, we will identify a small subset of paths that provide meaningful evidence for our query, that an exert can evaluate and use to seed further exploration of the data.

stuppie commented 6 years ago

Regarding Task 5 (but also probably 3 and 4), I'm thinking a machine learning approach may be useful here. How that would work could be similar to how prediction in drug repurposing works, where by using a set of known drug-disease pairs, the paths through the network connecting these known "true" connections are selected for and weighted more strongly than edge types that don't (or are less useful) for connecting these. A technique like that could be applied here to refine queries and try to select more meaningful paths. I or @veleritas could look into this more deeply...

NCATS-Tangerine / translator-api-registry