Closed andrewsu closed 1 year ago
Note:
Possibly relevant, from Slack:
Andy Crouse (Unsecret Agent, UI team) 1:13 PM This is the test set of genes and drugs for next creative mode query. Note there are some ‘destroyer of worlds’ genes and drugs (valproic acid, vitamin A, TP53, P450) So if the lights dim when testing that is why. But they should be good for pressure testing! I included some that are not classically druggable like none coding and transcription factors. So hopefully this will cover the gambit. https://docs.google.com/document/d/1X3XbHhS_AIGkSyxSaYUuiCqVNh33Wz7f3EEyKZEctL8/edit?usp=sharing
The proposed TRAPI query templates are defined in these issues:
Noting:
@colleenXu Presently, the qualifier-matching only supports one qualifier-set per templateGroup. Would you prefer I change the implementation to support multiple qualifier-sets in a templateGroup for OR matching within a single group (presently, this would be covered by making multiple templateGroups)?
Examples:
Presently:
{
"name": "Some template group",
"subject": ["ChemicalEntity"],
"predicate": ["affects"],
"object": ["Gene"],
"qualifiers": {
"some_qualifier_type": "some_qualifier_value"
}
}
Proposed:
{
"name": "Some template group",
"subject": ["ChemicalEntity"],
"predicate": ["affects"],
"object": ["Gene"],
"qualifiers": [
{
"some_qualifier_type": "some_qualifier_value"
},
{
"some_other_type": "this_would_be_OR"
}
]
}
I think using 1 qualifier-set for matching to templateGroup is fine because that's what we see in this issue (and near-future).
In the third bullet point of my post above, I'm saying that for OUR individual templates, I'll need to use multiple qualifier-sets to query for "activity" vs "activity_or_abundance" because we don't do qualifier-hierarchy expansion yet.
This is a work-in-progress, so edits will be done over time
Note that Multiomics APIs has some potentially relevant info but we have some concerns:
- drug response: has gene-gene relationships, for negative and positive correlation. But not clear where this comes from or how to use this
- Wellness: has chem-chem and chem-gene relationships, for correlation and related_to. But not clear where this comes from or how to use this
agonist
of this gene...inhibitor
of this gene.
chembl.drug_mechanisms
data, in subject-association-object formataction_type
. drugcentral.bioactivity
data, in subject-association-object formataction_type
. I found that 2 templateGroup items worked: one for Chem-increases-Gene and one for Chem-decreases-Gene (input ID can be on the Chem QNode or the Gene QNode). I wrote the templateGroups and the templates @andrewsu + I discussed on Monday and did a bit of testing (see next post). These are in this branch https://github.com/biothings/bte_trapi_query_graph_handler/pull/132
EDIT: discussed below + during Wednesday 1/11 meeting. Decisions added to each point
@tokebe
After testing https://github.com/biothings/bte_trapi_query_graph_handler/pull/132 (it's branched off dev, and I tested it with all my other branches as dev)....overall it looks close to done. However, I have the following questions / issues:
PUBCHEM.COMPOUND:5358
is a good example (runs fairly fast, results from multiple templates and merging happens). See its entries in the Chem tables of the next post. bte:call-apis:query QEdge eA obtained 59903 records, exceeding maximum of 30000. Skipping remaining 1 (0 planned/1 paged) queries for this edge. Your query may be too general? +0ms
. -> It checks after each individual sub-query. Sometimes the excess is a lot! JC will adjust code to truncate and remove the records over the maximum (he'll decide how to implement).
increases
GAPDH but there are between 30k-40k records there (still takes too long for ID resolution). [reverted; this records the testing done 1/10]
Chem Name | Chem ID | Final results | Template breakdown and notes |
---|---|---|---|
metformin | PUBCHEM.COMPOUND:4091 | 1000 | > 1000 on first template |
palmitic acid | PUBCHEM.COMPOUND:12358543 | 0 | hmmm...example of no results |
sumatriptan | PUBCHEM.COMPOUND:5358 | 137 | 37, 4, 2, 85, 32 (160 sum). Results dropping or merging? |
Gene Name | Gene ID | Final results | Template breakdown and notes |
---|---|---|---|
MAPK8IP3 | NCBIGene:23162 | 4 | 4 from first template, none from the rest. Compare to decreases entry. |
P450 (CYP27B1?) | NCBIGene:1594 | 832 | 48, 0, 0, 770, 18 (836 sum). Results dropping or merging? |
MEF2C | NCBIGene:4208 | 1000 | 48, 1, 0, 6453 (stop). |
GAPDH | NCBIGene:2597 | presume 1000 | 220, 88, 112, then get stuck on 4th template's "physically_interacts_with" QEdge execution. Example of explosion. |
Chem Name | Chem ID | Final results | Template breakdown and notes |
---|---|---|---|
metformin | PUBCHEM.COMPOUND:4091 | 1000 | > 1000 on first template |
palmitic acid | PUBCHEM.COMPOUND:12358543 | 0 | hmmm...example of no results |
sumatriptan | PUBCHEM.COMPOUND:5358 | 177 | 56, 19, 3, 132, 32 (242 sum). Results dropping or merging? |
Gene Name | Gene ID | Final results | Template breakdown and notes |
---|---|---|---|
MAPK8IP3 | NCBIGene:23162 | 761 | 7, 0, 0, 754, 0 (sum 761, so no dropping/merging) |
P450 (CYP27B1?) | NCBIGene:1594 | 1000 | 31, 0, 0, 6694 (stop) |
MEF2C | NCBIGene:4208 | 1000 | 34, 0, 0, 6923 (stop) |
GAPDH | NCBIGene:2597 | presume 1000 | 290, 120, 24, then get stuck on 4th template's "physically_interacts_with" QEdge execution. Example of explosion. |
5
. I'd suggest 500 to match what ARAX returns.
@colleenXu I've fixed the results merging count and logging. As a result, I can confirm that no results are being dropped. Running sumatriptan, I get 433 results added across the templates, and 244 results in the final response. Logging now shows 189 results that were combined. 433 - 189 = 244, everything checks out. I was actually running in circles for a while because I thought I had to account for the number of results merged into, but this is simply a count of results from the 244 that were combined from multiple templates.
I'm not sure what's different between our locals that cause a difference in numbers, (perhaps I haven't updated specs recently?), but all results are accounted for, and no code is dropping results.
I've also fixed issue 4 -- a piece of code in call-apis was still using pre-qedge-refactor accessors, breaking the counts.
Additionally, I've pushed changes so records are truncated to the cutoff, and changed the creative limit to 500.
All set to proceed with testing again 👍
@tokebe
Questions I have after my second round of testing:
decreased
query for sumatriptan (PUBCHEM.COMPOUND:5358), I notice that there are 234 final results but one of the templates reports getting 247 results (in the logs). I'm then confused...if we're not truncating results, how can a template get more results than there is at the final count?
{
"message": {
"query_graph": {
"nodes": {
"gene": {
"categories": ["biolink:Gene"]
},
"chemical": {
"categories": ["biolink:ChemicalEntity"],
"ids": ["PUBCHEM.COMPOUND:5358"]
}
},
"edges": {
"t_edge": {
"object": "gene",
"subject": "chemical",
"predicates": ["biolink:affects"],
"knowledge_type": "inferred",
"qualifier_constraints": [
{
"qualifier_set": [
{
"qualifier_type_id": "biolink:object_aspect_qualifier",
"qualifier_value": "activity_or_abundance"
},
{
"qualifier_type_id": "biolink:object_direction_qualifier",
"qualifier_value": "decreased"
}
]
}
]
}
}
}
}
}
I circled in orange the numbers mentioned above.
(Points 4-5 from the post have been addressed by the recent changes (yay), and point 2 is still pending / doesn't seem to be an issue)
[EDITED 1/23-1/26 after is_set: true
added to templates and fixes added to logging. Rearranged list to match original Translator issue curie lists]
I tested more chemicals and genes this time, including all chemicals listed in the Translator posts
@colleenXu Regarding Point 1, I think your confusion comes from assuming that the merge log is supposed to be per-template. It isn't. Records are merged per-template, but the only merge log is a summary at the end of all merged results across all templates. Please pull and run again, and you'll see the log had been updated to show both the number of results merged, the number they were merged into, and the actual result count decrease. If you add up the results for each template, and then subtract the actual result count decrease, the math checks out (I spent a considerable amount of time verifying this last round...).
We could log per-template as well, but I don't particularly see the need to do so?
Running GAPDH on my local takes 19.9 minutes, which I agree is a little too much. Some of my optimizations may help with this, but I do think there's a case to be made for further decreasing the max records allowed.
After a meeting with @colleenXu the problem was confirmed. I've investigated the issue and found the reason:
Multiple results from the same template can end up being merged if that template is a multi-hop and the results connect the same subject and object via different intermediate nodes. IIRC, this was an intended behavior to keep results relatively well-organized. Such results would not be merged in non-creative execution (which another question worth asking somewhere else).
As a side effect of this, merging can show more results merged than what one might expect: instead of the maximum number merged in a step being equal to the smallest of either the current result set or the current template, it is actually the sum of those two.
@colleenXu I'm working on a fix to change the logging behavior to explicitly point this out when it occurs, and will push that change to this branch (and main) when it's done.
@colleenXu I've pushed multiple creative mode logging fixes and improvements and the math appears to check out now. Please run a couple tests and let me know if the logging seems better.
Feedback:
2023-01-25T03:02:54.891Z INFO: [Template-1]: Execution Summary: (906) nodes / (942) edges / (905) results; (3/36) queries returned results from (2) unique APIs
2023-01-25T03:02:54.891Z INFO: [Template-1]: APIs: BioThings SEMMEDDB API, Text Mining Targeted Association API
2023-01-25T03:02:54.895Z INFO: (0) results from Template-1 were merged with other results from the template. (0) results were merged with existing results from previous templates. Current result count is 905 (+905)
2023-01-25T03:02:54.895Z INFO: Addition of 905 results from Template 1 exceeds creative result maximum of 500 (reaching 905 merged). Response will be truncated to top-scoring 500 results. Skipping remaining 4 templates.
2023-01-25T03:02:54.895Z INFO: Final result count (before truncation): 905
2023-01-25T03:02:54.897Z INFO: Execution Summary: (501) nodes / (537) edges / (500) results; (0/36) queries returned results from (0) unique APIs
2023-01-25T03:02:54.897Z INFO: APIs:
2023-01-25T03:02:54.897Z INFO: Scoring Summary: (273) scored / (227) unscored
Otherwise, new logs look good!
I didn’t miss your update. Regardless of is_set
behavior, the behavior without is_set
looked wrong and needed fixing. I confirmed what was wrong with log clarity for those cases and fixed them.
Fix for the API end summary incoming.
Pushed the fix; yet another fun case of a change somehow not making it into a commit while silently remaining on my local, making me think I'm losing my mind lol.
Sorry for the late reply. I reran a bunch of queries and I think things look good! I like the new logs.
Perhaps we're ready to make a request for ITRB CI?
Note that templates were changed, replacing physically_interacts_with
predicate with interacts_with
(more general). Allows us to use dgidb for those templates (and mychem after its edits https://github.com/biothings/pending.api/issues/101#issuecomment-1418656362)
https://github.com/biothings/bte_trapi_query_graph_handler/pull/135
Deployed to prod 🚀
Noting here just in case:
Old template ideas that weren't implemented (intended effect is "downregulates"):
Translator consortium target for implementation in Feb 2023. Exact TRAPI query templates will be created soon per Architecture meeting 2022-12-06...
EDIT Implementation dates: