biothings / biothings_explorer

TRAPI service for BioThings Explorer
https://api.bte.ncats.io
Apache License 2.0
8 stars 9 forks source link

update edge provenance info to comply with Translator Standard -- July 1 #208

Closed andrewsu closed 2 years ago

andrewsu commented 3 years ago

The parent ticket is here: https://github.com/NCATSTranslator/TranslatorArchitecture/issues/48

This is an example edge with provenance:

            "edges": {
                "CHEBI:41423-biolink:metabolic_processing_affected_by-NCBIGene:1576": {
                    "predicate": "biolink:metabolic_processing_affected_by",
                    "subject": "CHEBI:41423",
                    "object": "NCBIGene:1576",
                    "attributes": [
                        {
                            "attribute_type_id": "provided_by",
                            "value": [
                                "drugbank"
                            ],
                            "value_type_id": "biolink:provided_by"
                        },
                        {
                            "attribute_type_id": "api",
                            "value": [
                                "MyChem.info API"
                            ],
                            "value_type_id": "bts:api"
                        },
                        {
                            "attribute_type_id": "publications",
                            "value": [
                                "PMID:22336956"
                            ],
                            "value_type_id": "biolink:publication"
                        },
                        {
                            "attribute_type_id": "action",
                            "value": "substrate",
                            "value_type_id": "bts:action"
                        },
                        {
                            "attribute_type_id": "function",
                            "value": "Vitamin d3 25-hydroxylase activity",
                            "value_type_id": "bts:function"
                        }
                    ]
                },

These are the desired edge properties (copied from the parent ticket):

primary knowledge source:
     is_a: knowledge source
     description: >-
       The most upstream source of the knowledge expressed in an Association that an
       implementer can identify (may or may not be the 'original' source).
     range: information resource
     multivalued: false

  original knowledge source:
    is_a: primary knowledge source
    description: >-
      The Information Resource that created the original record of the knowledge expressed
      in an Association (e.g. via curation of the knowledge from the literature, or
      generation of the knowledge de novo through computation, reasoning, inference over
      data).
    range: information resource
    multivalued: false

  aggregator knowledge source:
    is_a: knowledge source
    description: >-
      An intermediate aggregator resource from which knowledge expressed in an Association was
      retrieved downstream of the original source, on its path to its current serialized form.
    range: information resource
    multivalued: true
colleenXu commented 3 years ago

[updated 7/21 to reflect the discussion in the 7/20 lab call; rearrange to put what we're doing first at the top]


Situation B: BTE uses x-bte to get edge, API called counts as an "aggregator". This is the situation for most APIs BTE uses (see Situations A/C for the exceptions).

Update notes:

What to do:

BTE has to add all the source-related information to the edge attributes array:

  1. BTE should say it is an aggregator: add a hard-coded object @ariutta. This is needed for all 3 situations.
  2. the API BTE called to get the edge is an aggregator. Currently this info is the "attribute_type_id":"api" object.
    • code @ariutta :
      1. CHANGE "attribute_type_id":"api" to "attribute_type_id": "biolink:aggregator_knowledge_source".
      2. CHANGE where BTE gets the "attribute_type_id":"api" value to get it from the SmartAPI registry file's info.x-translator.infores-curie field.
      3. CHANGE the "value_type_id":"bts:api" to "value_type_id":"biolink:InformationResource"
    • CX: All APIs BTE uses x-bte with have been updated; they have info.x-translator.infores-curie property
  3. the x-bte annotates "where the edge is from" using a hard-coded "source" field. This can count as a "primary" knowledge source (where BTE thinks this info is from). Currently this info is the "attribute_type_id":"provided_by" object.
    • code @ariutta :
      1. CHANGE "attribute_type_id":"provided_by" to "attribute_type_id": "biolink:primary_knowledge_source".
      2. CHANGE the "value_type_id":"bts:provided_by" to "value_type_id":"biolink:InformationResource"
    • CX: All APIs BTE uses x-bte with have been updated; so the hard-coded source is set to the corresponding infores ID OR IS ABSENT
  4. Currently, don't do anything to the "bts:source" attribute. This comes from the response-mapped field of the API (API telling us where it thinks the info is from).

A Current-and-Desired example:

Current (the source-related attribute objects for an edge):


                    "attributes": [
                        {
                            "attribute_type_id": "api",
                            "value": [
                                "BioLink API"
                            ],
                            "value_type_id": "bts:api"
                        },
                        {
                            "attribute_type_id": "provided_by",
                            "value": [
                                "Monarch Initiative"
                            ],
                            "value_type_id": "biolink:provided_by"
                        },
                        {
                            "attribute_type_id": "source",
                            "value": [
                                "https://archive.monarchinitiative.org/#omim"
                            ],
                            "value_type_id": "bts:source"
                        },
                        ......
                     ]

Desired (comments as //):

                       {  // add this
                            "attribute_type_id": "biolink:aggregator_knowledge_source",
                            "value": ["infores:translator-biothings-explorer"],
                            "value_type_id": "biolink:InformationResource"
                        },
                       { // corresponds to the "api" object above
                            "attribute_type_id": "biolink:aggregator_knowledge_source",
                            "value": ["infores:biolink-api"],
                            "value_type_id": "biolink:InformationResource"
                        },
                       { // corresponds to the "provided_by" object above
                            "attribute_type_id": "biolink:primary_knowledge_source",
                            "value": ["infores:monarchinitiative"],
                            "value_type_id": "biolink:InformationResource"
                        },
                        { // no change to the "source" object above
                            "attribute_type_id": "source",
                            "value": [
                                "https://archive.monarchinitiative.org/#omim"
                            ],
                            "value_type_id": "bts:source"
                        },

Here's some thoughts on how to update provenance. The situations below are based on what API BTE called to get that edge.

Important notes to read first:


Situation A: BTE ingests edge from a TRAPI API

currently BTE ingests these TRAPI APIs:

What to do:

Situation C: BTE uses x-bte to get the edge, the API we call counts as "primary".

This is the situation for APIs from multiomics and text mining provider, since they create knowledge from their analysis of data/publications...and perhaps some external APIs that we bring in.

The APIs BTE ingests right now that fit this are:

Other APIs that fit this (but BTE doesn't ingest right now):

What to do

BTE has to add all the source-related information to the edge attributes array:

  1. Talk to those teams. They should have their own ideas of how to model their source-related info in TRAPI, and may have info on source that's not currently in the APIs but they want to add it.
  2. BTE should say it is an aggregator (same as the other two situations)
  3. the API BTE called to get the edge is a primary. Currently this info is "attribute_type_id":"api" object. Do the same as situation B above, but set attribute_type_id as "biolink:primary_knowledge_source".
  4. If the team can describe the data source it used to make its knowledge -- we could put that in as a hard-coded x-bte source....perhaps this is a "supporting data source". We could then treat it like the corresponding section in Scenario B, except setting the attribute_type_id as "biolink:supporting_data_source"
  5. It would be awesome, but maybe a reach? If we could add the url for more info on the KP APIs (see the desired example's primary_knowledge_source object).

An example:

Ideally from clinical risk kp api (the source-related attribute objects for an edge) - doesn't exist right now:

                    "attributes": [
                        {
                            "attribute_type_id": "api",
                            "value": [
                                "Clinical Risk KP API"
                            ],
                            "value_type_id": "bts:api"
                        },
                        {
                            "attribute_type_id": "provided_by",
                            "value": [
                                "clinical-records-washington-2018"
                            ],
                            "value_type_id": "biolink:provided_by"
                        },
                        {
                            "attribute_type_id": "provenance",
                            "value": "https://github.com/NCATSTranslator/Translator-All/wiki/EHR-Risk-KP",
                            "value_type_id": "bts:provenance"
                        }
                        ......
                     ]

Desired (comments as //): Notice that the url clinical risk kp api gave was moved to be under the primary knowledge source. Also I made up the supporting data source below since I don't know what it is; it's not in the info above.

                       {  // added
                            "attribute_type_id": "biolink:aggregator_knowledge_source",
                            "value": ["infores:translator-biothings-explorer"],
                            "value_type_id": "biolink:InformationResource"
                        },
                       {  // was "api" object above
                            "attribute_type_id": "biolink:primary_knowledge_source",
                            "value": ["infores:biothings-multiomics-clinical-risk"],
                            "value_url": "https://github.com/NCATSTranslator/Translator-All/wiki/EHR-Risk-KP"
                            "value_type_id": "biolink:InformationResource"
                        },
                       {  // was "provided_by" object above
                            "attribute_type_id": "biolink:supporting_data_source",
                            "value": ["infores:clinical-records-washington-2018"],
                            "value_type_id": "biolink:InformationResource"
                        },

Additional reference:

AlexanderPico commented 3 years ago

Screen Shot 2021-07-06 at 10 59 42 AM

andrewsu commented 3 years ago

As a very quick recap of today's discussion, @ariutta will take the lead on modifying the structure of the JSON output in the edge attributes, and @colleenXu will take the lead on updating the SmartAPI records for where most of those values are drawn. There undoubtedly will be other details and edge cases to fix later, but let's start with that...

colleenXu commented 3 years ago

I have edited my post above to reflect today's call. @andrewsu and @ariutta, please review at minimum the section under "Scenario B" and confirm whether these tasks/decisions correctly reflect today's decisions.

andrewsu commented 3 years ago

Quick note that the ARAX results viewer for Translator now has a nice visualization for the edge provenance info. For example, from https://arax.ncats.io/?source=ARS&id=a7af1e97-eae3-430d-b570-4da271ea56c7

image

colleenXu commented 2 years ago

@ariutta All APIs with yamls in registry update here are updated to address this issue. Note that 3 APIs don't have the "hard-coded" source field anymore; this is fine - they just won't have the corresponding attribute object in their attributes array.

once the 2 multiomics api yamls have their PRs merged / smartapi registry entries updated, they may also not have the "hard-coded" source field anymore

colleenXu commented 2 years ago

Note that Provenance situation A may be dealt with, once this PR are merged.

I notice that this PR seems to add the BTE provenance object mentioned above and included below:

                       {  // add this
                            "attribute_type_id": "biolink:aggregator_knowledge_source",
                            "value": ["infores:translator-biothings-explorer"],
                            "value_type_id": "biolink:InformationResource"
                        },
colleenXu commented 2 years ago

I'm okay with closing this issue for now, and opening it again to deal with Provenance situation C related issues as that comes up...

This is going to happen with text mining targeted association soon where the plan is to ingest the edge attributes field from records and preserve its structure...