NCATS-Gamma / robokop

Master UI for ROBOKOP
MIT License
16 stars 3 forks source link

Add Synonymizers, change planning to use them #25

Closed cbizon closed 6 years ago

cbizon commented 6 years ago

Currently, we plan the building of a knowledge graph of entities using our type graph of identifiers. In this type graph, nodes are identifiers and edges are services that transform one identifier into another.

There are 2 types of identifier transformations: those that change semantic types (Drug to Gene) and those that do not (HGNC Gene to NCBI Gene). These latter are referred to as synonyms. The type graph as implemented complicates planning and execution in a couple of ways.

First, we expect that good answers through graphs will be short. But in any kind of shortness metric, a synonym edge should count less than semantic-type transforming edge. This leads to a very complicated cypher query on the type graph to create shortest paths.

Second, we need to be able to transform arbitrarily across the synonyms for a concept. At best, this would be kind of a hub-spoke topology where a single identifier system was considered canonical for a semantic type and any arbitrary synonyming would go through that canonical identifier in two hops. But for every spoke in that wheel, we'd need to have an entry in the yaml file, and implement a specific function.

For many of the types, however, full-synonymizers exist. That is, you pass in one identifier for a concept (like an HGNC id) and get back a set of identifiers for the same concept in other identifier systems. For genes, both mygene.info and an HGNC service perform this way. To incorporate those into our present system, we need to implement many individual transformation methods. Each method calls the general synonymizer, then extracts and returns only what is asked for and throws the rest away.

We can simplify this process by making use of these synonymizers directly. We maintain the information about type-transforming APIs, and use it to generate a concept-level knowledge map (this is already done). The only this concept-map is used for query planning. After every function call the value returned from a knowledge source is passed to the appropriate synonymizer, which collects all synonyms and stores them in the node. When the node is passed to the next KS, the appropriate identifier is retrieved and used.

cbizon commented 6 years ago
cbizon commented 6 years ago

Semantic Types and possible synonymizers:

Gene

Substance

Disease

GeneticCondition GeneticCondition is-a Disease. So we should be able to use the Disease synonymizer for it.

Phenotype

Pathway I'm not sure that these can be synonymized - further understanding is required.

Single ID system identifiers:

The following entities only have a single identifier system (that we are using) so no synonymizing needs to be done (yet).

Anatomy (Uberon) Cell (Cell Ontology) BiologicalProcess (GO) CellularComponent (GO) MolecularFunction (GO)