NCATS-Tangerine / translator-knowledge-beacon

NCATS Translator Knowledge Beacon Application Programming Interface plus Sample code
MIT License
7 stars 2 forks source link

Add a path to return a knowledge map #40

Open cmungall opened 6 years ago

cmungall commented 6 years ago

with @stevencox @balhoff @jmcmurry

For smartapi and beacons we will define a way a knowledge source can advertise what types and identifiers are related by what relations/predicates.

This would be a list such as the following

-
  subject:
    semantic_type: gene
    prefixes:
       - NCBIGene
       - ENSEMBL
       - HGNC
       - MGI
  predicate:
    id: RO:nnn
    label: has phenotype
  object:
    semantic_type: phenotype
    prefixes:
       - MP
       - HP
  count: 500 # optional
  description: >
     blah blah
-
  ...      
cbizon commented 6 years ago

This is good and should go a long way towards helping the reasoner teams. Vlado and I showed an example of how the yaml for one of the biolink endpoints could be annotated like this.

I am still struggling a little with the object prefixes. The idea is that these are "might get" prefixes. You can't know ahead of time which of them you will get. So if a reasoner is planning a set of calls, it has to anticipate dealing with all of the possibilities. That probably means calling an identifier translation service in between the output of one call and the input of the next (think PIKR or OXO).

So how would the yaml above describe one of these ID translators? Usually they take any kind of id (for one or more entity) and return all the kinds of IDs for that object that it knows about. So if I had a gene ID service I would get

subject:
  prefixes:
    - NCBIGene
    - ENSEMBLE
    - HGNC
 ...
object:
  prefixes:
    - NCBIGene
    - ENSEMBLE
    - HGNC

But here the output prefixes are not "might gets" but "will give you these if I have them".

Again, the reason that this is important is in trying to plan a path through sources/IDs ahead of time. If the identifier service above can only tell me that it might give me HGNC ids, then it doesn't help me any in the planning. I can imagine 3 ways around this:

  1. modify the yaml above somehow to distinguish between these cases (maybe an annotation for required prefixes?)

  2. If we want to use a service like OXO, wrap it in a service that we stand up that implements a bunch of endpoints like NCBIGeneToHGNC, and then annotate those endpoints (this is our current solution)

  3. Treat ID translation as a separate activity from Knowledge Source querying. I don't think this is very elegant, but it would probably work.

Sorry for the long comment...

micheldumontier commented 6 years ago

one option is for us to specify a triple-based metamap, in which we have subject type, relation, object type so that we maintain the relationship between subject and object types e.g.

alternatively, we can specify this relation with the more precise identifier type that holds the relation e.g

if we have a mapping between drugbank and drug, and doid and disease, then we get back to the first declaration. if we go with the first option, then we would get a list of all possible drug and disease identifier types.

cmungall commented 6 years ago
cmungall commented 6 years ago

@cbizon - ok, so if each path object represents a potential path through semantic types, it could additionally be annotated by ID paths within that concept path, such as in the following

-
  subject:
    semantic_type: gene
    prefixes:
       - NCBIGene
       - ENSEMBL
       - HGNC
       - MGI
  predicate:
    id: RO:nnn
    label: has phenotype
  object:
    semantic_type: phenotype
    prefixes:
       - MP
       - HP
  idpaths:
    -
      subj: MGI
      obj: MP
    -
      subj: HGNC
      obj: WP
   -
      subj: ENSEMBL
      obj: HP
   -
      subj: ENSEMBL
      obj: MP
  count: 500 # optional

here we are saying that for a gene-phenotype path, if you give an MGI you will get an MP, and if you give an HGNC you get an HP. However, if you give an ENSEMBL then you can either get an HP or MP. There are a variety of ways a planner can infer whether the result will be which (based on species of the input, knowledge that the root node of each ontology is for a respective species, and that has-phenotype is generally species-preserving) but this is probably out of scope.

micheldumontier commented 6 years ago

I would go for an explicit list of the triple patterns:

- subject_type: gene
  subject_id_type: MGI
  predicate_type: has phenotype
  predicate_id: RO:nnn
  object_type: phenotype
  object_id_type: MP
- subject_type: gene
  subject_id_type: MGI
  predicate_type: has phenotype
  predicate_id: RO:nnn
  object_type: phenotype
  object_id_type: HP

or using lists, this would indicate all s-o combinations possible through the specified relation

- subject_type: gene
  subject_id_type: 
   - NCBIGene
   - ENSEMBL
   - HGNC
   - MGI
  predicate_type: has phenotype
  predicate_id: RO:nnn
  object_type: phenotype
  object_id_type: 
    - MP
    - HP
cbizon commented 6 years ago

I really like the explicitness of these proposals. The more explicitly the services are defined, the easier it is for an automated tool (reasoner, or whatever else) to incorporate them into a plan.

At the risk of backpedalling a little, though, some of these examples are helping me to realize that even with these very explicit specifications, there are going to be services where it is not possible to say with certainty exactly what ID type I will get back for a given ID input type. So I (or other automators) will have to write some code to handle this general problem case.

And if I have to write this kind of code anyway, then I start to question the wisdom of asking others to do this level of specification. And just to be clear: I still definitely want semantic types and id types, I'm just waffling on the need for extra annotation to handle the "might-get" id case.

micheldumontier commented 6 years ago

There's no guarantee, but rather it is a specification for what can possibly be returned by the service.

RichardBruskiewich commented 6 years ago

:...if we have a mapping between drugbank and drug, and doid and disease, then we get back to the first declaration...

Since the mapping from some identifier namespace prefixes is not cleanly one-to-one to semantic data type, then we likely introduce ambiguity by minimizing the description to triplets of either semantic data types or identifier prefixes (or data sources).

The original specification seems to offer a better way of resolving such ambiguities.

RichardBruskiewich commented 6 years ago

We're iterating on an implementation of this on our side.

Along the way, we are thinking that perhaps the /types, /predicates and /kmaps JSON outputs can be more tightly harmonised with emerging standards.

Here is the iteration as of April 27, 2018 incorporating the spirit of the emerging Translator Knowledge Graph standard

First of all, we also propose that the '/types' call be renamed to '/categories' since that is what concept types are called in the 'node' properties of the Translator Knowledge Graph standard.

The proposed JSON for /kmap is:

{ "subject": { "category": "chemical substance", "prefixes": [] }, "predicate": {
"edge_label": "affects_risk_for", # Snake case standard - minimal predicate "relation": "reduces_condition" # the maximal predicate returned "negated": False # is the logical opposite of the predicate assumed in this statement? }, "object": { "category": "disease", "prefixes": [] }, "frequency": 42, "description": "chemical substance reduces condition disease" }

Note that the Biolink Model terms will generally be used in the /kmap unless circumstances prevent this (e.g. a proper mapping of local terms to Biolink has not yet been implemented for a given beacon).

The JSON for /categories and /predicates could be something like this (note: the above kmap, categories and predicates are not aligned, but you should get the general idea)

For /categories, something like this:

{ "id": "biolink:AnatomicalEntity", "uri": "http://bioentity.io/vocab/AnatomicalEntity", / we use 'category' instead of 'name' to reflect the emerging TKG standard / "category": "anatomical entity", "local_id": "UMLSSG:ANAT", "local_uri": "https://metamap.nlm.nih.gov/Docs/SemGroups_2013#ANAT", "local_category": "ANAT",
"description": "A subcellular location, cell type or gross anatomical part", "frequency": 4295 },

For /predicates, something like this:

{ "id": "biolink:gene_associated_with_condition", "uri": "http://bioentity.io/vocab/gene_associated_with_condition", / we use 'predicate' here instead of 'name' to align with the emerging TKG standard / "edge_label": "gene_associated_with_condition", # Snake case standard - minimal predicate? "relation": "exacerbates disease course" # maximal predicate "local_id": "SIO:000983", "local_uri": "http://semanticscience.org/resource/SIO_000983", "local_relation": "gene-disease association", # only need 'relation' because the local id is "precise" "description": "A gene-disease association is an association between a gene and a disease.", "frequency": 1234 },

It is implied here that local identifiers map one-to-one to Biolink, but not sure how to handle degenerate lists of local variables mapping many-to-one (Biolink).

micheldumontier commented 6 years ago

ok

RichardBruskiewich commented 5 years ago

After a few months experience with the above /categories and /predicates interfaces, it seems apparent that some of the content may be implied by the Biolink Model, or adequately documented in a single "CURIE" manner, thus some fields are semantically redundant, thus the following changes to those metadata API's are proposed in the next iteration of the beacon API (possibly release 1.3.0):

For /categories, we remove the 'id' and 'uri' (these are inferred from the Biolink Model) plus '_localid' and '_localid' fields but simply insist that the '_localcategory' field be a CURIE. Note that for some concept types - like the SemMedDb UMLS semantic groups, e.g. 'ANAT' - a namespace may need to be defined within the Translator project (i.e. UMLSSG) which can be documented somewhere by the project and exposed by Translator tools.

{ "category": "anatomical entity", "local_category": "UMLSSG:ANAT", "description": "A subcellular location, cell type or gross anatomical part", "frequency": 4295 },

For /predicates, we remove the 'id' and 'uri' (these are inferred from the Biolink Model) plus '_localid', '_localuri' and '_localrelation' field but insist that the 'relation' field contain the 'local relation' which should often be a CURIE (as documented in the Translator knowledge Graph standard) although technically, could be a simpler string (we discourage this as not so resolvable as to authority).

{ "edge_label": "gene_associated_with_condition", # Snake case standard - minimal predicate? "relation": "SIO:000983" "description": "A gene-disease association is an association between a gene and a disease.", "frequency": 1234 },

Note that in both of the above cases, the 'description' field should ideally be the 'local' description but could also default to the Biolink one, if a local one is not available.

In addition, there are some knowledge sources for which the 'frequency' of usage of a given category or predicate term may be indeterminate (e.g. beacons that harvest knowledge dynamically from a network of APIs, e.g. the new Biothings Explorer beacon). For such beacons, the 'frequency' field may be set to 'null' meaning 'not computed or computable'