Add Synonymizers, change planning to use them

cbizon commented 6 years ago

Currently, we plan the building of a knowledge graph of entities using our type graph of identifiers. In this type graph, nodes are identifiers and edges are services that transform one identifier into another.

There are 2 types of identifier transformations: those that change semantic types (Drug to Gene) and those that do not (HGNC Gene to NCBI Gene). These latter are referred to as synonyms. The type graph as implemented complicates planning and execution in a couple of ways.

First, we expect that good answers through graphs will be short. But in any kind of shortness metric, a synonym edge should count less than semantic-type transforming edge. This leads to a very complicated cypher query on the type graph to create shortest paths.

Second, we need to be able to transform arbitrarily across the synonyms for a concept. At best, this would be kind of a hub-spoke topology where a single identifier system was considered canonical for a semantic type and any arbitrary synonyming would go through that canonical identifier in two hops. But for every spoke in that wheel, we'd need to have an entry in the yaml file, and implement a specific function.

For many of the types, however, full-synonymizers exist. That is, you pass in one identifier for a concept (like an HGNC id) and get back a set of identifiers for the same concept in other identifier systems. For genes, both mygene.info and an HGNC service perform this way. To incorporate those into our present system, we need to implement many individual transformation methods. Each method calls the general synonymizer, then extracts and returns only what is asked for and throws the rest away.

We can simplify this process by making use of these synonymizers directly. We maintain the information about type-transforming APIs, and use it to generate a concept-level knowledge map (this is already done). The only this concept-map is used for query planning. After every function call the value returned from a knowledge source is passed to the appropriate synonymizer, which collects all synonyms and stores them in the node. When the node is passed to the next KS, the appropriate identifier is retrieved and used.

cbizon commented 6 years ago

[x] Identify a synonymizer for each semantic type
[x] Identify a canonical identifier system for each semantic type
[x] Implement services for each synonymizer
[x] refactor rosetta.yml
[x] refactor query planning
[x] refactor query execution, including how synonyms are bubbled to neo4j / UI

cbizon commented 6 years ago

Semantic Types and possible synonymizers:

Gene

mygene.info: Has the advantage of having a smartAPI included in the registry. I think that biothings explorer uses this as its gene synonymizer
HGNC API: The advantage is that we already have some code for calling this one

Substance

Pharos uses its own IDs internally, but understands CHEMBL identifiers and (for drugs) INN/USAN names.
CTDBase has text names, MeSH, CasRN, and DrugBank IDs. The text names are the internal key between tables in CTDBase, but the terms come from MeSH, and every term has a mapping to a MeSH identifier.
DrugBank's API is $$.
OXO maps between MeSH, UMLS, snomed, NCIT, Drugbank, CAS and CHEBI, though it requires hops set to 3 and also brings back a bunch of pubmed ids as identifiers (?)
PubChem sort of does these but I think you need to start with pubchem id (or name), and then CHEMBL is only included if a user has added it.
I don't believe that the KBA includes drugs (but Richard might correct me).
EBI's UniChem does convert to CHEMBL! It doesn't take everything, (not MeSH) but it does take drugbank and chebi.

Disease

MONDO: We could just use the MONDO ontology x-refs and build something
OXO: Basically this is the same thing, but includes more (maybe good maybe bad, and we already have some code for it.

GeneticCondition GeneticCondition is-a Disease. So we should be able to use the Disease synonymizer for it.

Phenotype

HPO synonyms/x-refs
OXO

Pathway I'm not sure that these can be synonymized - further understanding is required.

Single ID system identifiers:

The following entities only have a single identifier system (that we are using) so no synonymizing needs to be done (yet).

Anatomy (Uberon) Cell (Cell Ontology) BiologicalProcess (GO) CellularComponent (GO) MolecularFunction (GO)

NCATS-Gamma / robokop

Add Synonymizers, change planning to use them #25

Semantic Types and possible synonymizers:

Single ID system identifiers: