NCATS-Tangerine / translator-api-registry

This repo hosts the API metadata for the Translator project
13 stars 31 forks source link

SuppKG API YAML #122

Closed mnarayan1 closed 1 year ago

mnarayan1 commented 1 year ago

YAML for the SuppKG API. The API is located here.

Notes:

I've been trying to test my yaml file with this query:

curl --request POST \
  --url http://localhost:3000/v1/smartapi/suppkg/query \
  --header 'Content-Type: application/json' \
  --data '{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids": ["UMLS:C0008780"]
                },
                "n1": {
                    "categories": ["biolink:NamedThing"]
                }
            },
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1"
                }
            }
        }
    }
}'

Here is my smartapi_overrides.json file:

{
  "conf": {
    "only_overrides": true
  },
  "apis": {
    "suppkg": "https://raw.githubusercontent.com/mnarayan1/translator-api-registry/master/suppkg/suppkg.yaml"  
  }
}

However, I'm getting this error: {"error":"Your input query graph is invalid","more_info":"Your Input Query Graph is invalid."}

Are there any issues with my annotations? Should I format my query differently?

colleenXu commented 1 year ago

I know this post is long and kinda intimidating >.<. I think you've done a good job overall (great attention to detail!).

I'll summarize the feedback as:


Addressing the issues you raised 1. I'm not sure why you're getting that error when testing locally. I can paste the cURL you provided into my terminal and BTE will execute without errors * Postman converts my queries to the cURL snippet below, which is a bit different from what you provided. Maybe try that format? * Maybe run `git pull` and `npm run pull` to see if BTE updates. If it does, then run `npm run compile` to make sure any changes are incorporated. * If you still have issues, please post to our lab's #ncats-translator channel so multiple people can look into what's going on... * potentially useful reminder: you can test a local file, but it'll need 3 slashes (like I'm using `file:///Users/colleenxu/Desktop/translator-api-registry/_temp_testing/suppkg.yaml`). And if you adjust your yaml and then want to test it, you'll want to save your yaml file, stop/quit BTE, run the `API_OVERRIDE=true npm run smartapi_sync` command again, and then start BTE again. BTE won't automatically pull in the changes
cURL from Postman ``` curl --location 'http://localhost:3000/v1/smartapi/suppkg/query' \ --header 'Content-Type: application/json' \ --data '{ "message": { "query_graph": { "nodes": { "n0": { "ids": ["UMLS:C0008780"] }, "n1": { "categories": ["biolink:NamedThing"] } }, "edges": { "e01": { "subject": "n0", "object": "n1" } } } } } ' ```
2. use the not-created-yet `infores:biothings-suppkg` in the x-translator section (infores for the BioThings API). Then use `infores:suppkg` in the operations. Once we're close to registering this yaml, I'll make a PR to [biolink-model](https://github.com/biolink/biolink-model/blob/master/infores_catalog.yaml) with these new infores IDs. 3. yeah, there seems to be issues with writing operations as general as NamedThing - NamedThing. Related to the "But regarding operations" section below. * from my testing, it looks like BTE won't use these operations because every starting/input ID is found to be something more specific than NamedThing (the queryEdge is more specific than the operation so BTE doesn't match this operation with the queryEdge). To get your example query working, once I made the changes to the operations/response-mapping (mentioned in the next section), I had to change the inputs.semantic to a more specific category that matched that input ID (Disease).
Minor yaml suggestions * add a sentence in the `info.description` of the API: about the publication / what suppKG has inside, with a link to the publication * change version to what the BioThings API version is. It looks like it might be `2021`? * tag the endpoints with the path `/association/` as `association`. Right now they're tagged as `interaction`. My understanding is that this tag on endpoints is to group endpoints by path * in the currently commented-out `testExamples`, I provide the IDs in the format that they would be when querying in TRAPI format. So they have the biolink-model prefix (specified in inputs.id and outputs.id).
Feedback on the current operations * the `parameter.fields` section needs to list / encompass all the fields in the response-mapping because that part of the query tells the BioThings API what fields you want in the response. * also I suggest adding more fields: `relation.conf` and `relation.sentence` seem useful. * in the response-mapping, `input_name` and `output_name` are special keywords that tell BTE that these aren't edge-attributes - their values should actually be used to replace the node names when SRI Node Normalizer doesn't provide a human-readable name for those node's IDs. They should match the inputs/outputs of the operation: * subject-object's response-mapping `object` should have `input_name: subject.name` * object-subject's response-mapping `subject` should have `input_name: object.name`

But regarding the operations

EDIT: @andrewsu and I have decided that this is a good next step.

With this resource, I think we'll need to write more specific operations, based on the set of unique combos of subject.semtypes,predicate,object.semtypes values (meta-triples). This probably involves analyzing the data underlying this API.

Then, depending on how many unique combos there are, we could then decide whether we want to map to biolink-model / write operations manually or through code (like what we do with semmeddb).

Here's an example of what I think the format for operations would be (I've worked through it and tested it):

the x-bte operations and response-mapping section ``` SmallMolecule-treats-Disease: ## 595,222 records - supportBatch: true useTemplating: true ## flag to say templating is being used below inputs: - id: UMLS semantic: SmallMolecule requestBodyType: object requestBody: body: >- {"q": {{ queryInputs | replPrefix('predicate:TREATS AND object.semtypes:((dsyn) OR (neop)) AND subject.umls')| dump }}, "scopes": []} outputs: - id: UMLS semantic: Disease parameters: fields: object.umls,relation,subject.name,object.name size: 1000 predicate: treats source: "infores:suppkg" # no infores for suppkg yet response_mapping: "$ref": "#/components/x-bte-response-mapping/object" # testExamples: # - qInput: "UMLS:C0062737" ## histaglobin # oneOutput: "UMLS:C0002103" ## allergic rhinitis SmallMolecule-treats-Disease-rev: - supportBatch: true useTemplating: true inputs: - id: UMLS semantic: Disease requestBodyType: object requestBody: body: >- {"q": {{ queryInputs | replPrefix('predicate:TREATS AND subject.semtypes:((phsu) OR (orch)) AND object.umls')| dump }}, "scopes": []} outputs: - id: UMLS semantic: SmallMolecule parameters: fields: subject.umls,relation,subject.name,object.name size: 1000 predicate: treated_by source: "infores:suppkg" # no infores for suppkg yet response_mapping: "$ref": "#/components/x-bte-response-mapping/subject" # testExamples: # - qInput: "UMLS:C0263338" ## urticaria, chronic # oneOutput: "UMLS:C0062737" ## histaglobin x-bte-response-mapping: object: UMLS: object.umls suppkg_confidence_score: relation.conf ## not sure what to name this...you may know better? pubmed: relation.pmid "biolink:supporting_text": relation.sentence input_name: subject.name output_name: object.name subject: UMLS: subject.umls suppkg_confidence_score: relation.conf ## not sure what to name this...you may know better? pubmed: relation.pmid "biolink:supporting_text": relation.sentence input_name: object.name output_name: subject.name ```

Example response from testing: suppkg.txt

notes:

colleenXu commented 1 year ago

well...now I'm done editing my comment >.<. Hopefully this makes it easier to digest

mnarayan1 commented 1 year ago

@colleenXu Thank you for the feedback! I've updated the yaml with your suggestions, and replaced the operations section with what you wrote.

With this resource, I think we'll need to write more specific operations, based on the set of unique combos of subject.semtypes,predicate,object.semtypes values (meta-triples). This probably involves analyzing the data underlying this API.

Regarding the above, I can get counts for the predicates and how many subjects/objects have multiple semtypes.

colleenXu commented 1 year ago

@mnarayan1 (CC @andrewsu )

I'd like to check in: how is the analysis of the data's predicates/semtypes going? or being able to test YAMLs locally?

mnarayan1 commented 1 year ago

@colleenXu Sorry for the late response, I was out of town. I fixed the issue with my local installation of BTE, and I am able to test the yaml now.

Here is the analysis I've gotten on the data.

Number of records with only one semtype: 190314

Occurrences of each predicate:
CAUSES: 28792
COEXISTS_WITH: 73720
COMPARED_WITH: 12826
PREDISPOSES: 4647
AUGMENTS: 17074
STIMULATES: 14759
ASSOCIATED_WITH: 17417
ISA: 11234
AFFECTS: 49248
INTERACTS_WITH: 43273
PART_OF: 40920
ADMINISTERED_TO: 10329
PROCESS_OF: 54557
PRODUCES: 8031
PRECEDES: 2453
USES: 25120
LOCATION_OF: 77989
DIAGNOSES: 4895
DISRUPTS: 14084
COMPLICATES: 443
INHIBITS: 16856
TREATS: 43353
PREVENTS: 10247
CONVERTS_TO: 896
SAME_AS: 142
HIGHER_THAN: 1411
LOWER_THAN: 93
METHOD_OF: 5588
MEASURES: 3449
OCCURS_IN: 1139
MANIFESTATION_OF: 237

Is there any other information I should get?

colleenXu commented 1 year ago

@mnarayan1

Based on your info, it sounds like:


I think it would be helpful to have more specific info:

A) Do you know what exact semtypes field values correspond to supplements? If you don't, is there a way to analyze the data and figure this out?

B) Is it possible to generate a table containing counts of how many records there are for each unique combo of subject.semtypes, predicate, object.semtypes values (meta-triples)? Something like this:

subject semtype predicate object semtype count
phsu,orch TREATS dsyn 4000
phsu TREATS dsyn 6000
orch TREATS dsyn 300

What would be most helpful are exact matches: so phsu,orch represents just that, and not stuff that's an inexact match like phsu or phsu,orch,bacs.

C) I see a relation.conf field in the records. Do we have a sense of the distribution of this value? A range would be helpful, or something like this

My brainstorming This KP is very similar to semmeddb...which is problematic because semmeddb has thousands of operations and requires a TON of special processing (pmid count, semtype/domain-predicate/range-predicate exclusions, novelty, etc.). My tentative ideas are: * figure out what meta-triples cover useful info on supplements, and only make x-bte annotation for those. * my guess on supplement semtypes: * [inch,phsu](https://pending.biothings.io/suppkg/query?q=(subject.semtypes:inch%20AND%20subject.semtypes:phsu)%20AND%20predicate:TREATS%20AND%20object.semtypes:dsyn): inorganic chemical + pharmacological substance * [vita: vitamin](https://pending.biothings.io/suppkg/query?q=subject.semtypes:vita%20AND%20predicate:TREATS%20AND%20object.semtypes:dsyn) * [bact,phsu](https://pending.biothings.io/suppkg/query?q=(subject.semtypes:bact%20AND%20subject.semtypes:phsu)%20AND%20predicate:TREATS%20AND%20object.semtypes:dsyn): bacteria + pharmacological substance (probiotics) * [fngs,phsu](https://pending.biothings.io/suppkg/query?q=(subject.semtypes:fngs%20AND%20subject.semtypes:phsu)%20AND%20predicate:TREATS%20AND%20object.semtypes:dsyn): fungus + pharmacological substance (probiotics) * [plnt: plant](https://pending.biothings.io/suppkg/query?q=subject.semtypes:plnt%20AND%20predicate:TREATS%20AND%20object.semtypes:dsyn) * [elii: Element, Ion, or Isotope](https://pending.biothings.io/suppkg/query?q=subject.semtypes:elii%20AND%20predicate:TREATS%20AND%20object.semtypes:dsyn) * Double-check against the semmeddb exclusions (the type/domain/range stuff) to make sure they're allowed. * can we filter by relation.conf? I think we can only query "have at least 1 of the relation.conf values for this record be > X" but that may still be helpful....or we could adjust the parser to only include data with a relation.conf value > X...
colleenXu commented 1 year ago

Err...and the table from B) may be way too large for a github comment. A csv / tsv file may be the best way to share this table (along with a jupyter notebook or google colab notebook of the data analysis you're doing and how you're generating the table).

mnarayan1 commented 1 year ago

@colleenXu

Here is the notebook where I've done my work. It has a list of semtypes that could correspond to supplements, distribution of relation.conf values, and code used to generate the table of meta-triples.

A) There doesn't seem to be anywhere in SuppKG that explicitly states whether or not something is a dietary supplement. However, I looked through this list (containing all 133 UMLS semantic types) and compiled a list of semtypes that could possibly correspond to a supplement (excluding objects, body parts, diseases, etc.)

B) Here is the csv file with unique triples and their counts.

C) The distribution of relation.conf values is in the notebook. All relation.conf values are between 0.5 and 0.968.

andrewsu commented 1 year ago

So while there are many metatriples in suppkg, we are really only interested in the ones that directly relate to supplements. So if you took your list of possible semantic types associated with supplements from your notebook, can you redo the analysis showing the counts of each metatriple in this csv?

mnarayan1 commented 1 year ago

Here are the counts of metatriples with only supplements.

andrewsu commented 1 year ago

Hmm, that still results in a huge list of metatriples. So let's change gears a little bit. Rather than trying to come up with exclusion filters to remove what we don't want, let's instead focus on defining a small set of inclusion filters for triples that we do want. For this resource, the most unique thing we get are for [supplements] - TREATS - [disease]. So, if I restrict your CSV to rows where the predicate is TREATS, the object is "dsyn", and the count is > 100, I get this list:

subject predicate object count
['orch', 'phsu'] TREATS ['dsyn'] 2180
['phsu'] TREATS ['dsyn'] 2066
['phsu', 'plnt'] TREATS ['dsyn'] 1307
['orch', 'phsu', 'dsp'] TREATS ['dsyn'] 746
['orch', 'phsu', 'vita', 'dsp'] TREATS ['dsyn'] 301
['phsu', 'plnt', 'dsp'] TREATS ['dsyn'] 299
['food', 'phsu', 'dsp'] TREATS ['dsyn'] 297
['bacs', 'orch', 'phsu', 'dsp'] TREATS ['dsyn'] 281
['antb', 'orch'] TREATS ['dsyn'] 236
['bact', 'phsu', 'dsp'] TREATS ['dsyn'] 218
['antb'] TREATS ['dsyn'] 202
['aapp', 'gngm', 'bacs', 'phsu', 'dsp'] TREATS ['dsyn'] 176
['bact', 'phsu'] TREATS ['dsyn'] 167
['bacs', 'phsu'] TREATS ['dsyn'] 150
['aapp', 'gngm', 'phsu'] TREATS ['dsyn'] 132
['bacs', 'orch', 'phsu'] TREATS ['dsyn'] 128
['inch', 'phsu'] TREATS ['dsyn'] 119
['phsu', 'dsp'] TREATS ['dsyn'] 106

I would take the union of all the subject types, and see if you can create a smartAPI operation (or a set of operations) to retrieve those triples specifically. Does that make sense?

mnarayan1 commented 1 year ago

@andrewsu @colleenXu I've finished writing the operations to retrieve the above triples. I've tested them out on my local BTE instance, and the queries for each triple type seem to work (I included the testExamples in the yaml). Is there anything else I should add?

colleenXu commented 1 year ago

@mnarayan1

Suggested major edits:

I think it'll be simpler and more elegant to have 2 operations
One for `supplement-treats-disease`. It would be very similar to the current `SmallMolecule-treats-Disease`, but the object.semtypes would be set to all SEMMED semantic types that are mapped to Disease: `(acab OR anab OR cgab OR comd OR dsyn OR mobd OR neop)`. So the requestBody would be like the code chunk below * During my testing, the nested parentheses weren't needed. * This list of semtypes comes from my [analysis of the valid SEMMEDDB metatriples](https://github.com/NCATS-Tangerine/translator-api-registry/blob/master/semmeddb/AutoGen_SEMMEDDB.ipynb) after Translator-curated exclusions were applied. ``` requestBody: body: >- {"q": {{ queryInputs | replPrefix('predicate:TREATS AND object.semtypes:(acab OR anab OR cgab OR comd OR dsyn OR mobd OR neop) AND subject.umls')| dump }}, "scopes": []} ```
The other for `disease-treated_by-supplement`. It would be similar to one of the rev operations, but the subject.semtypes would be set to some of the SEMMED semantic types for supplements: `(aapp OR antb OR bacs OR dsp OR food OR inch OR orch OR phsu OR vita)`. AND subject.semtypes would be set to NOT be other semantic-types for supplements: (bact OR gngm OR plnt). So the requestBody would be like the code chunk below * excluding bact, gngm, plnt because there are Translator-curated exclusions (domain-predicate) that say it isn't valid to have these as the subject for a TREATS statement ``` requestBody: body: >- {"q": {{ queryInputs | replPrefix('predicate:TREATS AND (NOT subject.semtypes:(bact OR gngm OR plnt)) AND subject.semtypes:(aapp OR antb OR bacs OR dsp OR food OR inch OR orch OR phsu OR vita) AND object.umls')| dump }}, "scopes": []} ```
adjust response-mapping * change the keyword `pubmed` to `ref_pmid`, due to this [issue](https://github.com/biothings/biothings_explorer/issues/677#issue-1821677954). We recently [pushed changes to all SmartAPI yamls](https://github.com/biothings/biothings_explorer/issues/677#issuecomment-1678213370) for this. * Given the [TRAPI validation issues](https://github.com/biothings/biothings_explorer/issues/587#issuecomment-1635346625), please comment out the `suppkg_confidence_score` lines. The final response-mapping may look something like this: ``` object: UMLS: object.umls ref_pmid: relation.pmid "biolink:supporting_text": relation.sentence input_name: subject.name output_name: object.name ## not including these fields due to data-processing / biolink-modeling issues # suppkg_confidence_score: relation.conf ```
change the parameter.fields to match the response-mapping For the two operations, the parameter.fields can be changed since we'll only need the fields that are referenced in the response-mapping. So something like this could work for the supplement-treats-disease operation (object.umls contains the output): `object.umls,relation.pmid,relation.sentence,subject.name,object.name`

Minor edits

click here to expand * set `info.x-translator.biolink-version` to `"3.5.3"` instead * in `servers`, you can probably remove the `Production server` object since it's identical to the `Encrypted Production server` object * in `description`, also include the [link to the suppKG paper](https://arxiv.org/abs/2106.12741) (ref: Andrew's post [here](https://github.com/biothings/pending.api/issues/55))
colleenXu commented 1 year ago

@andrewsu

This API seems to still have "fake" UMLS:DC IDs, and I suggest discussing this (parser enhancements?)....before registering the SmartAPI yaml (which would make it accessible via the api-specific endpoints (v1/smartapi/).

This was previously brought up starting here and the comments below it all seem relevant.

andrewsu commented 1 year ago

@colleenXu Let's go ahead and allow these "fake UMLS IDs" to be returned. Presumably, NodeNormalizer will fail to resolve these, and BTE will use the original names from SuppKG as the human-readable names for presentation in the ARAX UI and Translator UI. At least that's how I think it will work -- let's see how it works in practice...

@mnarayan1 let us know when you have the updates done from @colleenXu's suggestions above...

mnarayan1 commented 1 year ago

@andrewsu @colleenXu I've finished with the edits, and the testing is still working for me.

colleenXu commented 1 year ago

I'm going to merge this PR, since the yaml looks ready. Good job @mnarayan1!

We'll continue discussion and next steps in https://github.com/biothings/biothings_explorer/issues/706