biothings / biothings_explorer

TRAPI service for BioThings Explorer
https://explorer.biothings.io
Apache License 2.0
10 stars 10 forks source link

adjust SmartAPI yaml, x-bte annotation for Biolink/Monarch API migration #774

Closed colleenXu closed 6 months ago

colleenXu commented 7 months ago

EDIT: see below for update, actually migrating to v3 https://api-v3.monarchinitiative.org/v3/docs#/

We are using Biolink/Monarch API v1, which will soon be shutdown and replaced by v2 http://api-v2.monarchinitiative.org/api.

So we'll want to adjust the SmartAPI yaml using the v2's swagger spec + adjust the x-bte annotation if needed.

What's unclear at the moment:

colleenXu commented 7 months ago

Jackson @tokebe noticed some increased request failures, so I updated the SmartAPI yaml / registration to use the v2 server url (see lab Slack convo). We'll monitor to see if there's any improvement.


Potential queries for directly comparing v1 and v2:

kevinschaper commented 7 months ago

Hi @colleenXu,

We're shutting down api.monarchinitiative.org, and our new production api is served from api-v3.monarchinitiative.org. As a transition to let people know that api.monarchinitiative.org is going away, we're planning to put a message up on that host but continue to make it available on another hostname - we picked api-v2 for that, but unfortunately it does make total sense that it would appear to be the replacement.

The v3 api format is different, the good news is that we should be better able to address performance problems (within limits). The v3 api is served from the new core graph, which is built on the biolink data model with new ingests.

Side note, I'm actually not seeing any direct gene expression for spinal cord or pancreas in the new graph:

http://api-v3.monarchinitiative.org/v3/api/association?predicate=biolink:expressed_in&object=UBERON:0001264&direct=true

http://api-v3.monarchinitiative.org/v3/api/association?predicate=biolink:expressed_in&object=UBERON:0002240&direct=true

I created an issue for specifying the subject/object taxon, and a second issue to look at our gene expression ingests.

colleenXu commented 7 months ago

[EDITED w/ updated info]

Latest info on the Biolink/monarch migration to v3 https://api-v3.monarchinitiative.org/v3/docs#/:

So the next steps are:

colleenXu commented 7 months ago

Info from Kevin Schaper

On using api-biolink vs api-v2 for now:

Either one is ok, they're just DNS entries for the same VM - I added api-biolink.mi.org because I realized that it made total sense to assume that api-v2 comes after api, and I wanted to avoid that confusion.

On what endpoint to use:

The new API is very much biolink/kgx-centric, I'm guessing you'll just be using the /associations endpoint, which largely uses biolink slots for filtering, with biolink predicates & categories as values, etc.

colleenXu commented 7 months ago

Notes

On writing SmartAPI yaml

Querying the v3 API

(@kevinschaper and any others working on the Monarch API may find this post interesting)

colleenXu commented 7 months ago

[defunct: using association endpoint instead]

On BTE post-processing

A. Directionality

We query with an input ID, which will match to the subject or object fields in the hit/item depending on the association type (which is fixed in the biolink-model canonical predicate direction).

If the input ID matches the subject, then each item's direction field == outgoing and the output entity ID will be in the object field...

VS If the input ID matches the object, then each item's direction field == incoming and the output entity ID will be in the subject field...

B. Publications

For now: within an item/hit, only keep elements in the publications field array that have the prefix PMID. These will be in the format PMID:24468074.

I've noticed other kinds of elements like:

Also, there's a publications_links field but we may need special logic to decide when to use the publications_links.id (for PMID) vs publications_links.url (for other kinds of references?).

colleenXu commented 7 months ago

[defunct: using association endpoint instead]

Queries to test the post-processing checks

input ID matches subject field

When querying Monarch API for Disease autosomal dominant cerebellar ataxia (MONDO:0020380) -> PhenotypicFeature, a lot of edges are returned that connect to a subclass of that disease instead: https://api-v3.monarchinitiative.org/v3/api/entity/MONDO:0020380/biolink:DiseaseToPhenotypicFeatureAssociation?format=json&limit=30&offset=0 When we query this API through BTE (saved response: [example.json](https://github.com/biothings/biothings_explorer/files/14145855/example.json)), we find multiple examples that BTE is handling this correctly...aka it isn't creating incorrect edges between the input ID and the phenotypes that are actually connected to the subclasses: * results where there's only 1 edge, which has aux-graphs. This means there were no hits/items/records where the input ID was the subject. BTE correctly didn't make any direct edges. * (7) Difficulty walking * (10) Gait imbalance * (11) Horizontal nystagmus * results where the publications in the direct edge don't match the publications in the aux-graph edges. So BTE correctly didn't merge direct edges and edges to the subclasses. * (2) direct edge has [PMID:36516086](https://pubmed.ncbi.nlm.nih.gov/36516086/) but indirect edge for spinocerebellar ataxia 45 (MONDO:0033480) has [PMID:29053796](https://pubmed.ncbi.nlm.nih.gov/29053796/) * (24) direct edge has no publications, but indirect edge for spinocerebellar ataxia type 38 (MONDO:0014417) has [PMID:25065913](https://pubmed.ncbi.nlm.nih.gov/25065913/)

input ID matches object field

When querying Monarch API for PhenotypicFeature Clinodactyly (HP:0030084) -> Disease, a lot of edges are returned that connect to a subclass of that pheno instead: https://api-v3.monarchinitiative.org/v3/api/entity/HP:0030084/biolink:DiseaseToPhenotypicFeatureAssociation?format=json&limit=30&offset=0 When we query this API through BTE (saved response: [example-2.json](https://github.com/biothings/biothings_explorer/files/14146177/example-2.json)), we find multiple examples that BTE is handling this correctly...aka it isn't creating incorrect edges between the input ID and the diseases that are actually connected to the subclasses: * results where there's only 1 edge, which has aux-graphs. This means there were no hits/items/records where the input ID was the subject. BTE correctly didn't make any direct edges. * (1) trisomy 8p * (6) paternal uniparental disomy of chromosome 14 * (7) rhizomelic limb shortening with dysmorphic features

colleenXu commented 7 months ago

Stuff to follow up on later?

Example of wonky behavior

The edge source info would look like this: * service-provider trapi says biolink-api is upstream of it * but there's no entry for biolink-api (and...then monarchinitiative should be upstream?) * then there's entries for monarchinitiative and its upstream sources (which include the primary). These come from post-processing the raw API response. ``` "sources": [ { "resource_id": "infores:hpo-annotations", "resource_role": "primary_knowledge_source" }, { "resource_id": "infores:monarchinitiative", "resource_role": "aggregator_knowledge_source", "upstream_resource_ids": [ "infores:hpo-annotations" ] }, { "resource_id": "infores:service-provider-trapi", "resource_role": "aggregator_knowledge_source", "upstream_resource_ids": [ "infores:biolink-api" ] } ] ```

click to see MetaEdges

* Chem to Pathway: unclear how helpful this is, since chemicals seem generic (water, ADP, ATP...). [Example](https://api-v3.monarchinitiative.org/v3/api/association?category=biolink:ChemicalToPathwayAssociation&object=Reactome:R-HSA-1369007&direct=true&format=json&limit=500&offset=0). 1 Predicate: `participates_in` * their prefix `Reactome` differs from what we use (REACT)...so this may require extra post-processing support (depends on how helpful setting the subject/object namespace is) * unclear if other Pathway namespaces exist * Gene to Pathway: previously chose not to annotate because MyGene also covers this info. Also has prefix issue (see Chem to Pathway above). 1 Predicate: `participates_in` * Gene to GO BiologicalProcess (989349 items): previously chose not to annotate because MyGene also covers this info. Each kind has multiple possible predicates, lots of diff primary knowledge sources * `actively_involved_in` (797927) * `acts_upstream_of_or_within` (180729) * `acts_upstream_of` (9327) * `acts_upstream_of_or_within_positive_effect` (507) * `acts_upstream_of_positive_effect` (506) * `acts_upstream_of_or_within_negative_effect` (178) * `acts_upstream_of_negative_effect` (175) * Gene to GO MolecularActivity (848151 items): see notes for BiologicalProcess above * `enables` (841330) * `contributes_to` (6821) * Gene to GO CellularComponent (745837 items): see notes for BiologicalProcess above * `located_in` (502225) * `active_in` (145515) * `part_of` (94049) * `colocalizes_with` (4048) * [Gene to Gene ortholog](https://api-v3.monarchinitiative.org/v3/api/association?category=biolink%3AGeneToGeneHomologyAssociation&subject=HGNC%3A9508&direct=false&format=json&limit=10&offset=0): previously chose not to annotate because MyGene also covers this info. 1 predicate (`orthologous_to`, 551383 hits). Seems to be 1 primary knowledge source (panther)

colleenXu commented 7 months ago

Jackson @tokebe:

I changed the x-bte annotation to use the associations endpoint:

So now the post-processing is different, but hopefully simpler...

STILL NEED:

Publication info from old comment

#### B. Publications For now: within an item/hit, only keep elements in the `publications` field array that have the prefix `PMID`. These will be in the format `PMID:24468074`. I've noticed other kinds of elements like: * `OMIM` curies * `orphanet` curies Also, there's a `publications_links` field but we may need special logic to decide when to use the `publications_links.id` (for PMID) vs `publications_links.url` (for other kinds of references?).

DON'T NEED:

colleenXu commented 7 months ago

[EDITED to add info on what we learned / addressed while working on the API post-processing]

Update

The basic set of updates is done:

Working on

Jackson @tokebe discussed the following, and they're going to try it out: doing post-processing on the primary_knowledge_source and aggregator_knowledge_source response fields, creating a new, custom field formatted as a TRAPI edge sources (array of objects). BTE can then ingest it with the same response-mapping key trapi_sources as Multiomics/Text-Mining APIs.

Example

first hit in https://api-v3.monarchinitiative.org/v3/api/association?category=biolink:CausalGeneToDiseaseAssociation&subject=HGNC:11138&predicate=biolink:causes&direct=true&format=json&limit=10&offset=0 A. `"primary_knowledge_source": "infores:omim"` (value of this field is always a string: infores curie) ➡️ element for TRAPI sources array ``` { "resource_id": "infores:omim", "resource_role": "primary_knowledge_source" } ``` B. `"aggregator_knowledge_source": ["infores:monarchinitiative", "infores:medgen"]`. Value of this field is always an array of string infores-curies, in order from furthest to closest to the primary source. So medgen has omim (the primary source) as its upstream. ➡️ >=1 elements for TRAPI sources array ``` { "resource_id": "infores:medgen", "resource_role": "aggregator_knowledge_source", "upstream_resource_ids": ["infores:omim"] }, { "resource_id": "infores:monarchinitiative", "resource_role": "aggregator_knowledge_source", "upstream_resource_ids": ["infores:medgen"] }, ``` Putting this together: create a new, custom field with the TRAPI sources array ``` { "sources": [ { "resource_id": "infores:omim", "resource_role": "primary_knowledge_source" }, { "resource_id": "infores:medgen", "resource_role": "aggregator_knowledge_source", "upstream_resource_ids": ["infores:omim"] }, { "resource_id": "infores:monarchinitiative", "resource_role": "aggregator_knowledge_source", "upstream_resource_ids": ["infores:medgen"] } ] } ```

implementation notes

Example showing this

Send the following TRAPI query to Monarch API only, through BTE: ``` { "message": { "query_graph": { "nodes": { "n0": { "categories": ["biolink:Gene"], "ids": ["HGNC:7551"] }, "n1": { "categories": ["biolink:Gene"] } }, "edges": { "e01": { "subject": "n0", "object": "n1" } } } } } ``` BTE should make the following requests: * https://api-v3.monarchinitiative.org/v3/api/association?subject=HGNC:7551&category=biolink:PairwiseGeneToGeneInteraction&subject_namespace=HGNC&predicate=biolink:interacts_with&object_namespace=HGNC&direct=true&format=json&limit=500 * retrieves 2 records showing relationships with TRIM63 (HGNC:16007 / NCBIGene:84676): 1 from biogrid (w/ PMID:19850579) and 1 from string * https://api-v3.monarchinitiative.org/v3/api/association?object=HGNC:7551&category=biolink:PairwiseGeneToGeneInteraction&subject_namespace=HGNC&predicate=biolink:interacts_with&object_namespace=HGNC&direct=true&format=json&limit=500 * retrieves 2 other records showing relationships with TRIM63: 1 from biogrid (w/ diff PMID:18157088) and 1 from string Then bundle these into two Edges: 1 for biogrid and 1 for string ``` "313161c093025842c0f60162954b3340": { "predicate": "biolink:interacts_with", "subject": "NCBIGene:4607", "object": "NCBIGene:84676", "attributes": [ { "attribute_type_id": "biolink:publications", "value": [ "PMID:19850579", "PMID:18157088" ], "value_type_id": "linkml:Uriorcurie" } ], "sources": [ { "resource_id": "infores:biogrid", "resource_role": "primary_knowledge_source" }, { "resource_id": "infores:monarchinitiative", "resource_role": "aggregator_knowledge_source", "upstream_resource_ids": [ "infores:biogrid" ] }, { "resource_id": "infores:service-provider-trapi", "resource_role": "aggregator_knowledge_source", "upstream_resource_ids": [ "infores:monarchinitiative" ] } ] }, "7e8fb0a590bff1f4fc71564d36bd2bc5": { "predicate": "biolink:interacts_with", "subject": "NCBIGene:4607", "object": "NCBIGene:84676", "attributes": [], "sources": [ { "resource_id": "infores:string", "resource_role": "primary_knowledge_source" }, { "resource_id": "infores:monarchinitiative", "resource_role": "aggregator_knowledge_source", "upstream_resource_ids": [ "infores:string" ] }, { "resource_id": "infores:service-provider-trapi", "resource_role": "aggregator_knowledge_source", "upstream_resource_ids": [ "infores:monarchinitiative" ] } ] }, ``` A similar example is TTN (HGNC:12403 / NCBIGene:7273)

colleenXu commented 7 months ago

Knowledge source infores IDs used by this resource

From Kevin Schaper (Translator Slack link)

current possible knowledge source combos on edges

|aggregator knowledge source|primary knowledge source | |--------------------|------------------------------------------| |infores:monarchinitiative|infores:agbase | |infores:monarchinitiative|infores:alzheimers\-university\-of\-toronto| |infores:monarchinitiative|infores:aruk\-ucl | |infores:monarchinitiative|infores:bgee | |infores:monarchinitiative|infores:bhf\-ucl | |infores:monarchinitiative|infores:biogrid | |infores:monarchinitiative|infores:cacao | |infores:monarchinitiative|infores:cafa | |infores:monarchinitiative|infores:complexportal | |infores:monarchinitiative|infores:dflat | |infores:monarchinitiative|infores:dibu | |infores:monarchinitiative|infores:dictybase | |infores:monarchinitiative|infores:disprot | |infores:monarchinitiative|infores:ensembl | |infores:monarchinitiative|infores:flybase | |infores:monarchinitiative|infores:gdb | |infores:monarchinitiative|infores:go\-central | |infores:monarchinitiative|infores:go\-noctua | |infores:monarchinitiative|infores:goc | |infores:monarchinitiative|infores:goc\-owl | |infores:monarchinitiative|infores:hgnc | |infores:monarchinitiative|infores:hgnc\-ucl | |infores:monarchinitiative|infores:hpa | |infores:monarchinitiative|infores:hpo\-annotations | |infores:monarchinitiative|infores:intact | |infores:monarchinitiative|infores:interpro | |infores:monarchinitiative|infores:lifedb | |infores:monarchinitiative|infores:mgi | |infores:monarchinitiative|infores:mtbbase | |infores:monarchinitiative|infores:ntnu\-sb | |infores:monarchinitiative|infores:orphanet | |infores:monarchinitiative|infores:panther | |infores:monarchinitiative|infores:parkinsonsuk\-ucl | |infores:monarchinitiative|infores:phi\-base | |infores:monarchinitiative|infores:pinc | |infores:monarchinitiative|infores:pombase | |infores:monarchinitiative|infores:reactome | |infores:monarchinitiative|infores:rgd | |infores:monarchinitiative|infores:rhea | |infores:monarchinitiative|infores:rnacentral | |infores:monarchinitiative|infores:roslin\-institute | |infores:monarchinitiative|infores:sgd | |infores:monarchinitiative|infores:string | |infores:monarchinitiative|infores:syngo | |infores:monarchinitiative|infores:syngo\-ucl | |infores:monarchinitiative|infores:syscilia\-ccnet | |infores:monarchinitiative|infores:uniprot | |infores:monarchinitiative|infores:wb | |infores:monarchinitiative|infores:xenbase | |infores:monarchinitiative|infores:yubiolab | |infores:monarchinitiative|infores:zfin | |infores:monarchinitiative, infores:alliancegenome|infores:flybase | |infores:monarchinitiative, infores:alliancegenome|infores:mgi | |infores:monarchinitiative, infores:alliancegenome|infores:rgd | |infores:monarchinitiative, infores:alliancegenome|infores:sgd | |infores:monarchinitiative, infores:alliancegenome|infores:wormbase | |infores:monarchinitiative, infores:alliancegenome|infores:zfin | |infores:monarchinitiative, infores:medgen|infores:omim | |infores:phenio |infores:HsapDv | |infores:phenio |infores:bfo | |infores:phenio |infores:chebi | |infores:phenio |infores:cl | |infores:phenio |infores:eco | |infores:phenio |infores:emapa | |infores:phenio |infores:envo | |infores:phenio |infores:fao | |infores:phenio |infores:fbbt | |infores:phenio |infores:fma | |infores:phenio |infores:fypo | |infores:phenio |infores:go | |infores:phenio |infores:hp | |infores:phenio |infores:iao | |infores:phenio |infores:ma | |infores:phenio |infores:mondo | |infores:phenio |infores:mp | |infores:phenio |infores:mpath | |infores:phenio |infores:nbo | |infores:phenio |infores:ncbitaxon | |infores:phenio |infores:obi | |infores:phenio |infores:ogms | |infores:phenio |infores:pato | |infores:phenio |infores:po | |infores:phenio |infores:pr | |infores:phenio |infores:ro | |infores:phenio |infores:so | |infores:phenio |infores:uberon | |infores:phenio |infores:upheno | |infores:phenio |infores:wbbt | |infores:phenio |infores:wbphenotype | |infores:phenio |infores:xpo | |infores:phenio |infores:zfa | |infores:phenio |infores:zp |

colleenXu commented 6 months ago

@tokebe

This is now ready for deployment!

PRs for push to Prod:

Once these are fully deployed to Prod, we can update the registered yaml (PR) and start the process of removing the override...

colleenXu commented 6 months ago

Notes

Stuff to follow up on

Short-term

EDIT, DONE: Sierra and Kevin confirmed 2/28 that it's fine to change infores, and we could deprecate biolink-api infores...

Example of wonky behavior

The edge source info would look like this: * service-provider trapi says biolink-api is upstream of it * but there's no entry for biolink-api (and...then monarchinitiative should be upstream?) * then there's entries for monarchinitiative and its upstream sources (which include the primary). These come from post-processing the raw API response. ``` "sources": [ { "resource_id": "infores:hpo-annotations", "resource_role": "primary_knowledge_source" }, { "resource_id": "infores:monarchinitiative", "resource_role": "aggregator_knowledge_source", "upstream_resource_ids": [ "infores:hpo-annotations" ] }, { "resource_id": "infores:service-provider-trapi", "resource_role": "aggregator_knowledge_source", "upstream_resource_ids": [ "infores:biolink-api" ] } ] ```

Longer-term?

EDIT: moving to separate issues

click to see MetaEdges

* Chem to Pathway: unclear how helpful this is, since chemicals seem generic (water, ADP, ATP...). [Example](https://api-v3.monarchinitiative.org/v3/api/association?category=biolink:ChemicalToPathwayAssociation&object=Reactome:R-HSA-1369007&direct=true&format=json&limit=500&offset=0). 1 Predicate: `participates_in` * their prefix `Reactome` differs from what we use (REACT)...so this may require extra post-processing support (depends on how helpful setting the subject/object namespace is) * unclear if other Pathway namespaces exist * Gene to Pathway: previously chose not to annotate because MyGene also covers this info. Also has prefix issue (see Chem to Pathway above). 1 Predicate: `participates_in` * Gene to GO BiologicalProcess (989349 items): previously chose not to annotate because MyGene also covers this info. Each kind has multiple possible predicates, lots of diff primary knowledge sources * `actively_involved_in` (797927) * `acts_upstream_of_or_within` (180729) * `acts_upstream_of` (9327) * `acts_upstream_of_or_within_positive_effect` (507) * `acts_upstream_of_positive_effect` (506) * `acts_upstream_of_or_within_negative_effect` (178) * `acts_upstream_of_negative_effect` (175) * Gene to GO MolecularActivity (848151 items): see notes for BiologicalProcess above * `enables` (841330) * `contributes_to` (6821) * Gene to GO CellularComponent (745837 items): see notes for BiologicalProcess above * `located_in` (502225) * `active_in` (145515) * `part_of` (94049) * `colocalizes_with` (4048) * [Gene to Gene ortholog](https://api-v3.monarchinitiative.org/v3/api/association?category=biolink%3AGeneToGeneHomologyAssociation&subject=HGNC%3A9508&direct=false&format=json&limit=10&offset=0): previously chose not to annotate because MyGene also covers this info. 1 predicate (`orthologous_to`, 551383 hits). Seems to be 1 primary knowledge source (panther)

colleenXu commented 6 months ago

I've confirmed that the changes have been deployed to BTE Prod. So I've:

How I tested

We can tell that BTE is using the new v3 Monarch API by doing a test query for the `gene-disease-contributesTo` operation - which didn't exist in the old API. If we have edges with the `contributes_to` predicate and the enhanced sources info (omim <- medgen <- monarchinitiative <- service provider), then we know that BTE is using the new SmartAPI yaml and api-response-transform code. POST to Monarch-API-only, thru BTE: https://bte.transltr.io/v1/smartapi/d22b657426375a5295e7da8a303b9893/query ``` { "message": { "query_graph": { "nodes": { "n0": { "categories": ["biolink:Gene"], "ids": ["HGNC:6294", "HGNC:9652"] }, "n1": { "categories": ["biolink:Disease"] } }, "edges": { "e01": { "subject": "n0", "object": "n1", "predicates": ["biolink:contributes_to"] } } } } } ``` Should get this edge in the response, showing the `contributes_to` predicate and the enhanced sources info (omim <- medgen <- monarchinitiative <- service provider) ``` "1ff8a4f5ade3639ebd6b951ac8984627": { "predicate": "biolink:contributes_to", "subject": "NCBIGene:3784", "object": "MONDO:0100316", "attributes": [], "sources": [ { "resource_id": "infores:omim", "resource_role": "primary_knowledge_source" }, { "resource_id": "infores:medgen", "resource_role": "aggregator_knowledge_source", "upstream_resource_ids": [ "infores:omim" ] }, { "resource_id": "infores:monarchinitiative", "resource_role": "aggregator_knowledge_source", "upstream_resource_ids": [ "infores:medgen" ] }, { "resource_id": "infores:service-provider-trapi", "resource_role": "aggregator_knowledge_source", "upstream_resource_ids": [ "infores:monarchinitiative" ] } ] } ```


BUT before closing this, I'd like to discuss "stuff to follow up on" with Jackson @tokebe first...(open new issues?)

colleenXu commented 6 months ago

Discussed the "stuff to follow up on" with Jackson and Sierra/Kevin (see edited post). I'll open new issues, but we're ready to close this one