Closed mnarayan1 closed 1 year ago
I know this post is long and kinda intimidating >.<. I think you've done a good job overall (great attention to detail!).
I'll summarize the feedback as:
EDIT: @andrewsu and I have decided that this is a good next step.
With this resource, I think we'll need to write more specific operations, based on the set of unique combos of subject.semtypes
,predicate
,object.semtypes
values (meta-triples). This probably involves analyzing the data underlying this API.
Then, depending on how many unique combos there are, we could then decide whether we want to map to biolink-model / write operations manually or through code (like what we do with semmeddb).
Here's an example of what I think the format for operations would be (I've worked through it and tested it):
Example response from testing: suppkg.txt
notes:
well...now I'm done editing my comment >.<. Hopefully this makes it easier to digest
@colleenXu Thank you for the feedback! I've updated the yaml with your suggestions, and replaced the operations section with what you wrote.
With this resource, I think we'll need to write more specific operations, based on the set of unique combos of subject.semtypes,predicate,object.semtypes values (meta-triples). This probably involves analyzing the data underlying this API.
Regarding the above, I can get counts for the predicates and how many subjects/objects have multiple semtypes.
@mnarayan1 (CC @andrewsu )
I'd like to check in: how is the analysis of the data's predicates/semtypes going? or being able to test YAMLs locally?
@colleenXu Sorry for the late response, I was out of town. I fixed the issue with my local installation of BTE, and I am able to test the yaml now.
Here is the analysis I've gotten on the data.
Number of records with only one semtype: 190314
Occurrences of each predicate:
CAUSES: 28792
COEXISTS_WITH: 73720
COMPARED_WITH: 12826
PREDISPOSES: 4647
AUGMENTS: 17074
STIMULATES: 14759
ASSOCIATED_WITH: 17417
ISA: 11234
AFFECTS: 49248
INTERACTS_WITH: 43273
PART_OF: 40920
ADMINISTERED_TO: 10329
PROCESS_OF: 54557
PRODUCES: 8031
PRECEDES: 2453
USES: 25120
LOCATION_OF: 77989
DIAGNOSES: 4895
DISRUPTS: 14084
COMPLICATES: 443
INHIBITS: 16856
TREATS: 43353
PREVENTS: 10247
CONVERTS_TO: 896
SAME_AS: 142
HIGHER_THAN: 1411
LOWER_THAN: 93
METHOD_OF: 5588
MEASURES: 3449
OCCURS_IN: 1139
MANIFESTATION_OF: 237
Is there any other information I should get?
@mnarayan1
Based on your info, it sounds like:
I think it would be helpful to have more specific info:
A) Do you know what exact semtypes
field values correspond to supplements? If you don't, is there a way to analyze the data and figure this out?
B) Is it possible to generate a table containing counts of how many records there are for each unique combo of subject.semtypes
, predicate
, object.semtypes
values (meta-triples)? Something like this:
subject semtype | predicate | object semtype | count |
---|---|---|---|
phsu,orch | TREATS | dsyn | 4000 |
phsu | TREATS | dsyn | 6000 |
orch | TREATS | dsyn | 300 |
What would be most helpful are exact matches: so phsu,orch
represents just that, and not stuff that's an inexact match like phsu
or phsu,orch,bacs
.
C) I see a relation.conf field in the records. Do we have a sense of the distribution of this value? A range would be helpful, or something like this
Err...and the table from B) may be way too large for a github comment. A csv / tsv file may be the best way to share this table (along with a jupyter notebook or google colab notebook of the data analysis you're doing and how you're generating the table).
@colleenXu
Here is the notebook where I've done my work. It has a list of semtypes that could correspond to supplements, distribution of relation.conf values, and code used to generate the table of meta-triples.
A) There doesn't seem to be anywhere in SuppKG that explicitly states whether or not something is a dietary supplement. However, I looked through this list (containing all 133 UMLS semantic types) and compiled a list of semtypes that could possibly correspond to a supplement (excluding objects, body parts, diseases, etc.)
B) Here is the csv file with unique triples and their counts.
C) The distribution of relation.conf
values is in the notebook. All relation.conf
values are between 0.5 and 0.968.
So while there are many metatriples in suppkg, we are really only interested in the ones that directly relate to supplements. So if you took your list of possible semantic types associated with supplements from your notebook, can you redo the analysis showing the counts of each metatriple in this csv?
Hmm, that still results in a huge list of metatriples. So let's change gears a little bit. Rather than trying to come up with exclusion filters to remove what we don't want, let's instead focus on defining a small set of inclusion filters for triples that we do want. For this resource, the most unique thing we get are for [supplements] - TREATS - [disease]
. So, if I restrict your CSV to rows where the predicate is TREATS, the object is "dsyn", and the count is > 100, I get this list:
subject | predicate | object | count |
---|---|---|---|
['orch', 'phsu'] | TREATS | ['dsyn'] | 2180 |
['phsu'] | TREATS | ['dsyn'] | 2066 |
['phsu', 'plnt'] | TREATS | ['dsyn'] | 1307 |
['orch', 'phsu', 'dsp'] | TREATS | ['dsyn'] | 746 |
['orch', 'phsu', 'vita', 'dsp'] | TREATS | ['dsyn'] | 301 |
['phsu', 'plnt', 'dsp'] | TREATS | ['dsyn'] | 299 |
['food', 'phsu', 'dsp'] | TREATS | ['dsyn'] | 297 |
['bacs', 'orch', 'phsu', 'dsp'] | TREATS | ['dsyn'] | 281 |
['antb', 'orch'] | TREATS | ['dsyn'] | 236 |
['bact', 'phsu', 'dsp'] | TREATS | ['dsyn'] | 218 |
['antb'] | TREATS | ['dsyn'] | 202 |
['aapp', 'gngm', 'bacs', 'phsu', 'dsp'] | TREATS | ['dsyn'] | 176 |
['bact', 'phsu'] | TREATS | ['dsyn'] | 167 |
['bacs', 'phsu'] | TREATS | ['dsyn'] | 150 |
['aapp', 'gngm', 'phsu'] | TREATS | ['dsyn'] | 132 |
['bacs', 'orch', 'phsu'] | TREATS | ['dsyn'] | 128 |
['inch', 'phsu'] | TREATS | ['dsyn'] | 119 |
['phsu', 'dsp'] | TREATS | ['dsyn'] | 106 |
I would take the union of all the subject types, and see if you can create a smartAPI operation (or a set of operations) to retrieve those triples specifically. Does that make sense?
@andrewsu @colleenXu I've finished writing the operations to retrieve the above triples. I've tested them out on my local BTE instance, and the queries for each triple type seem to work (I included the testExamples in the yaml). Is there anything else I should add?
@mnarayan1
@andrewsu
This API seems to still have "fake" UMLS:DC
IDs, and I suggest discussing this (parser enhancements?)....before registering the SmartAPI yaml (which would make it accessible via the api-specific endpoints (v1/smartapi/
).
This was previously brought up starting here and the comments below it all seem relevant.
@colleenXu Let's go ahead and allow these "fake UMLS IDs" to be returned. Presumably, NodeNormalizer will fail to resolve these, and BTE will use the original names from SuppKG as the human-readable names for presentation in the ARAX UI and Translator UI. At least that's how I think it will work -- let's see how it works in practice...
@mnarayan1 let us know when you have the updates done from @colleenXu's suggestions above...
@andrewsu @colleenXu I've finished with the edits, and the testing is still working for me.
I'm going to merge this PR, since the yaml looks ready. Good job @mnarayan1!
We'll continue discussion and next steps in https://github.com/biothings/biothings_explorer/issues/706
YAML for the SuppKG API. The API is located here.
Notes:
NamedThing
for thesemantic
field of thex-bte-operations
section. Is there something more specific I should use instead?I've been trying to test my yaml file with this query:
Here is my
smartapi_overrides.json
file:However, I'm getting this error:
{"error":"Your input query graph is invalid","more_info":"Your Input Query Graph is invalid."}
Are there any issues with my annotations? Should I format my query differently?