biothings / pending.api

Set of standalone APIs built with the BioThings SDK for the Translator Project
https://biothings.ncats.io
Apache License 2.0
5 stars 13 forks source link

Data source: repoDB #77

Closed erikyao closed 6 months ago

erikyao commented 2 years ago

Requirement originally discussed in: smartAPI - Issue#85

Plugin repo: https://github.com/erikyao/repoDB

Bug description: Due to the reason explained in this comment, the parser previously (back in 2020) relied on MyChem to query drugbank.id => drugbank.name. However since 2021 MyChem no longer provides drugbank data (see https://docs.mychem.info/en/latest/doc/data_source.html#drugbank).

Solution: find another API for drugbank.id => drug_name queries, or pre-process the data file full.csv

rjawesome commented 2 years ago

This CSV Drugbank Vocabulary seems to be open source and contains drugbank id to name data. It also contains names for some of the IDs you were not able to find in mychem (ie. DB12430).

erikyao commented 2 years ago

Thank you @rjawesome for the information! That CSV would definitely help!

rjawesome commented 2 years ago

I can also make a PR for this on the parser if you want...

erikyao commented 2 years ago

Sure, @rjawesome, I appreciate your help!

rjawesome commented 2 years ago

See this pr

erikyao commented 2 years ago

Thank you, @rjawesome! Yep I realized that injective relation is enough for "one-to-one"...

colleenXu commented 2 years ago

Don't know if this needs SmartAPI annotation...

erikyao commented 2 years ago

Don't know if this needs SmartAPI annotation...

Hi @colleenXu, this is a bug fix to the old repoDB API. It should have been annotated before.

colleenXu commented 2 years ago

If it does, it's likely very old. It's not incorporated into BTE at the moment.

andrewsu commented 2 years ago

Let's use this as an opportunity to add a SmartAPI annotation for BTE integration. I'm going to reopen the ticket, unassign @erikyao and @rjawesome, and add it to the "Needs SmartAPI / BTE annotation" section of our project tracker...

andrewsu commented 1 year ago

example record https://biothings.ncats.io/repodb/chemical/DB14707 :

{
  "_id": "DB14707",
  "_version": 1,
  "repodb": {
    "drugbank": "DB14707",
    "indications": [
      {
        "NCT": "NA",
        "detailed_status": "NA",
        "name": "Squamous cell carcinoma",
        "phase": "NA",
        "status": "Approved",
        "umls": "C0007137"
      }
    ],
    "name": "Cemiplimab"
  }
}
colleenXu commented 10 months ago

Related infores stuff is ready:

colleenXu commented 10 months ago

Here's the SmartAPI yaml w/ x-bte annotation for BioThings repoDB. This yaml is registered in SmartAPI Registry.

I haven't made a PR to add this to BTE's regular use (for the config file, API_LIST variable): I'm waiting until we're closer to the next release cycle to make a PR with all the KPs we want to add.

Example query

send a POST request to the api-specific endpoint, BioThings repoDB only. Like `http://localhost:3000/v1/smartapi/1138c3297e8e403b6ac10cff5609b319/query`. This works even when the KP isn't included in BTE's config Put this in the request body: It's querying with the drug Cetuximab (aka `DRUGBANK:DB00002`) ``` { "message": { "query_graph": { "nodes": { "n0": { "ids": ["DRUGBANK:DB00002"], "categories": ["biolink:SmallMolecule"] }, "n1": { "categories": ["biolink:Disease"] } }, "edges": { "e01": { "subject": "n0", "object": "n1", "predicates": ["biolink:treats"] } } } } } ``` You should get a response with this edge (from this [record in the BioThings API](https://biothings.ncats.io/repodb/query?q=repodb.drugbank:DB00002), based on this [operation's example](https://github.com/NCATS-Tangerine/translator-api-registry/blob/d0ffea982bf949c67f87c72790d3f52252ee449d/repodb/smartapi.yaml#L615): * subject: Cetuximab (primary ID in SRI NodeNorm `PUBCHEM.COMPOUND:14122979`, DRUGBANK ID in the BioThings API is `DB00002`) * object: Malignant tumor of colon (primary ID in SRI NodeNorm `MONDO:0021063`, UMLS ID in BioThings API is `C0007102`) ``` "c50bcf1f5d6c4c55c44535cc3e9c49d2": { "predicate": "biolink:treats", "subject": "PUBCHEM.COMPOUND:14122979", "object": "MONDO:0021063", "attributes": [], "sources": [ { "resource_id": "infores:repodb", "resource_role": "primary_knowledge_source" }, { "resource_id": "infores:biothings-repodb", "resource_role": "aggregator_knowledge_source", "upstream_resource_ids": [ "infores:repodb" ] }, { "resource_id": "infores:service-provider-trapi", "resource_role": "aggregator_knowledge_source", "upstream_resource_ids": [ "infores:biothings-repodb" ] } ] } } ```

colleenXu commented 10 months ago

However, I have some observations / possible next steps:

  1. Does this API use the latest data from repoDB? I notice a data update (v2.1 2023-06-15) in the version history section of the repodb website
  2. What does the field value "NA" mean? If it basically means "not available/applicable", I'd find it helpful if the parser removed the fields with "NA" values. That way BTE would be able to use this field without post-processing to remove "NA".
"NA" is a common value for these fields

* [repodb.indications.NCT](https://biothings.ncats.io/repodb/query?q=repodb.indications.NCT:NA), but the non-"NA" info could be useful publication ref info for BTE * [repodb.indications.phase](https://biothings.ncats.io/repodb/query?q=repodb.indications.phase:NA): BTE may need to use this info in the future as part of the treats-refactor * [repodb.indications.detailed_status](https://biothings.ncats.io/repodb/query?q=repodb.indications.detailed_status:NA)

  1. I think changing the parser to create association-centric data (unique combos of drug-disease-status) rather than drug-centric (current) would be helpful, particularly for the upcoming "treats" refactor. Currently, there are problems retrieving info when querying with the disease ID ("reverse operations", related to https://github.com/biothings/biothings_explorer/issues/316 and https://github.com/biothings/biothings_explorer/issues/727#issuecomment-1784476295).
Mockup of what association-centric data may look like

[Right now, there's 1 record for the drug Rituximab](https://biothings.ncats.io/repodb/query?q=repodb.drugbank:DB00073). It'd be transformed into multiple records, 1 for each combo of rituximab + unique disease + unique status. So for rituximab + "Lymphoma, Non-Hodgkin" `C0024305`, there'd be 3 records (3 diff statuses). I didn't include all the info for the "Terminated" record since there's currently 18 objects/clinical-trials in the data. ``` [ { "drug_drugbank_id": "DB00073", "drug_name": "rituximab", "indication_umls": "C0024305", "indication_name": "Lymphoma, Non-Hodgkin", "status": "Approved" }, { "drug_drugbank_id": "DB00073", "drug_name": "rituximab", "indication_umls": "C0024305", "indication_name": "Lymphoma, Non-Hodgkin", "status": "Terminated", "clinical_trial_info": [ { "NCT": "NCT00057343", "phase": "Phase 3" }, { "NCT": "NCT00057447", "detailed_status": "administrative reasons", "phase": "Phase 1/Phase 2" }, .... ] }, { "drug_drugbank_id": "DB00073", "drug_name": "rituximab", "indication_umls": "C0024305", "indication_name": "Lymphoma, Non-Hodgkin", "status": "Withdrawn", "clinical_trial_info": [ { "NCT": "NCT02408042", "phase": "Phase 1/Phase 2" } ] } ] ```

Problem 1: When I query with a disease ID and a specific indication status, I can't retrieve only the hits where both constraints are true in the same nested object.

Related to https://github.com/biothings/biothings_explorer/issues/727#issuecomment-1784476295 For example, I can try querying for the indication `C0032797` (Postpartum Hemorrhage) and I want only drugs where the indication status isn't approved: ``` curl --location --globoff 'https://biothings.ncats.io/repodb/query?size=1000&fields=repodb.indications%2Crepodb.drugbank%2Crepodb.name&jmespath=repodb.indications%7C[%3F(status%3D%3D%60Terminated%60%7C%7Cstatus%3D%3D%60Withdrawn%60%7C%7Cstatus%3D%3D%60Suspended%60)]' \ --header 'Content-Type: application/json' \ --data '{ "q": "C0032797", "scopes":"repodb.indications.umls" }' ``` I'll get hits like this in the response, which show that the indication matched but the status didn't. At the moment, we don't have BTE post-processing to recognize and remove hits like this: BTE will use them for answer edges even though they didn't actually match what I wanted. ``` { "query": "C0032797", "_id": "DB00353", "_score": 8.514726, "repodb": { "drugbank": "DB00353", "indications": [], "name": "Methylergometrine" } }, { "query": "C0032797", "_id": "DB00429", "_score": 8.514726, "repodb": { "drugbank": "DB00429", "indications": [], "name": "Carboprost tromethamine" } }, ```

Problem 2: When I query with a disease ID, I don't get back info for only that disease ID. So I have to exclude useful info like the disease-name field from the response-mapping

Related to https://github.com/biothings/biothings_explorer/issues/316#issuecomment-939232795 I can take the `rev-disease-drug` operation and try to include the disease-name field: * add `repodb.indications.name` to the parameters.field section * add `input_name: repodb.indications.name` to the drug response-mapping And then test the operation with a local BTE override and a disease ID that SRI NodeNorm doesn't recognize ([C0334634](https://biothings.ncats.io/repodb/query?q=repodb.indications.umls:C0334634), `Malignant lymphoma, lymphocytic, intermediate differentiation, diffuse` in BioThings repodb) ``` { "message": { "query_graph": { "nodes": { "n0": { "ids": ["UMLS:C0334634"], "categories": ["biolink:Disease"] }, "n1": { "categories": ["biolink:SmallMolecule"] } }, "edges": { "e01": { "subject": "n0", "object": "n1", "predicates": ["biolink:related_to"] } } } } } ``` In the response [repodbC0334634.txt](https://github.com/biothings/pending.api/files/13749298/repodbC0334634.txt), BTE has given that ID the wrong label "Precocious Puberty"...probably because the [subquery](https://biothings.ncats.io/repodb/query?q=repodb.indications.umls:C0334634)'s first hit has `C0034013` "Precocious Puberty" in the first nested object, rather than the disease I asked for. ``` "nodes": { "UMLS:C0334634": { "categories": [ "biolink:Disease" ], "name": "Precocious Puberty", "attributes": [ { "attribute_type_id": "biolink:xref", "value": [ "UMLS:C0334634" ] }, { "attribute_type_id": "biolink:synonym", "value": [ "UMLS:C0334634" ] } ] }, ```

colleenXu commented 9 months ago

After discussion with Andrew yesterday, I've opened an issue for the next steps.

However, it should be fine if these next steps aren't done by the time we add this API to BTE's regular use - we can still go forward with deploying.

colleenXu commented 8 months ago

Will need to update the x-bte annotation once the https://github.com/biothings/pending.api/issues/169 is addressed for all instances (ncats.io and all ITRB instances transltr.io).

Can create separate operations depending on status, so we can map it to different predicates during the treats refactor/biolink-model update

colleenXu commented 8 months ago

repoDB has been updated on all instances (under the hood, the internal routing is now to biothings.transltr.io - ITRB Prod instance...not biothings.ncats.io).

So I'm moving this issue back to a to-do, to update the x-bte annotation.

colleenXu commented 7 months ago

Updated the SmartAPI yaml w/ x-bte annotation to match the parser/API updates - master branch only uses the "approved" treatment operations https://github.com/NCATS-Tangerine/translator-api-registry/commit/fa1f36e74d03ae4a96abee0e8ddda0b6b7b58b51

Also updated the SmartAPI registration. So it's ready to add to BTE's regular use (for the config file, API_LIST variable) - so I added it to the PR linked above.

We'll try to get it into Translator's Lobster release (dev/CI -> Test this Friday).


There's another version in biolink-4-update https://github.com/biothings/biothings_explorer/issues/788 with "clinical trial only" operations available: https://github.com/NCATS-Tangerine/translator-api-registry/commit/50634e74980cffc18bc5e0e43cd5d091ee497baa. I've adjusted the PR https://github.com/biothings/bte-server/pull/19 to add an override to this.

tokebe commented 6 months ago

@colleenXu Should this issue be closed?

colleenXu commented 6 months ago

Yep, confirmed that it's live by posting an example query to https://bte.transltr.io/v1/team/Service Provider/query (Prod instance).

Example

``` { "message": { "query_graph": { "nodes": { "n0": { "ids": ["DRUGBANK:DB00002"], "categories": ["biolink:SmallMolecule"] }, "n1": { "categories": ["biolink:Disease"] } }, "edges": { "e01": { "subject": "n0", "object": "n1", "predicates": ["biolink:treats"] } } } } } ``` There should be edges like this that come from repodb ``` "7cc54b63aaf016ef67d50252c2323b04": { "predicate": "biolink:treats", "subject": "PUBCHEM.COMPOUND:14122979", "object": "MONDO:0021063", "attributes": [], "sources": [ { "resource_id": "infores:repodb", "resource_role": "primary_knowledge_source" }, { "resource_id": "infores:biothings-repodb", "resource_role": "aggregator_knowledge_source", "upstream_resource_ids": [ "infores:repodb" ] }, { "resource_id": "infores:service-provider-trapi", "resource_role": "aggregator_knowledge_source", "upstream_resource_ids": [ "infores:biothings-repodb" ] } ] }, ```