Closed andrewsu closed 1 year ago
Just adding my general plan here for quick reference:
Main issue before start of implementation: I need that template query library, or a good place to start in generating one.
At least initially, we can use this query template library created by @Carolina1396 https://github.com/Carolina1396/drug_mechanisms_rare_diseases_BioThingsExplorer/tree/main/src/query_templates, which was based at least in part on an analysis of the most common metapaths in DrugMechDB shown in https://github.com/SuLab/DMDB_Analysis/blob/main/1_code/1_basic_dmdb_analysis.ipynb.
@andrewsu I'm realizing there are some questions I've failed to ask regarding scope:
Should BTE support inferred-mode edges in multi-hop queries? More than one inferred edge?
I believe this should still be relatively simple in this implementation: we use our template query to replace the inferred edge, keeping the 'outside' topology the same, for as many templated queries as we have (in the case of multiple inferred edges, one graph for each possible combination of template queries as appropriate).
Implementation-wise this isn't very complex -- just a few extra loops to make sure we get all the permutations. Since we're not re-assembling results afterward, just de-duplicating and concatenating them, there's no overt complications with results assembly. I do however see some possible cases where we have a single multi-hop query with multiple inferred edges exploding very quickly.
Another simple clarifying question: When encountering an inferred edge, obviously BTE must create every query that can be generated from templates related to that inferred edge. I assume BTE should also still execute the original edge in addition?
Should BTE support inferred-mode edges in multi-hop queries? More than one inferred edge?
No, we can make the assumption (and enforce the limit) that inferred-mode queries are single-hop queries. If that's not true, then feel free to have BTE fail out with some appropriate error message.
When encountering an inferred edge, obviously BTE must create every query that can be generated from templates related to that inferred edge. I assume BTE should also still execute the original edge in addition?
Yes, I think that makes sense. Feel free to implement that in code, or by adding the appropriate template to the query library. Whichever route makes the most sense to you...
@tokebe @andrewsu
note on the query that is sent:
they may use "biolink:Drug" as the category....and we will likely want to add "ChemicalEntity" to this because we don't keep much under "Drug".
if it's clear how to fix it, go for it. If not, skip it...
@tokebe @andrewsu
I suggest some changes to the templates for this particular issue...
ChemicalEntity
, not SmallMolecule
I've taken all the templates and turned them into 7 TRAPI queries, with some mods....https://github.com/biothings/BioThings_Explorer_TRAPI/issues/453#issuecomment-1147082666
However, these may need further tweaking (predicates, node constraint for the drug regulatory status aka whether it is a drug, EDIT: is_set:True on QNodes...)?
EDIT: node constraints not working: https://github.com/biothings/BioThings_Explorer_TRAPI/issues/174#issuecomment-1147956168
And...it's unclear to me how we'll assemble the separate TRAPI queries / QGraphs and their responses into 1 response / thing....is that the goal?
@colleenXu As for the response assembly: This was discussed in a meeting, the graphs and results get de-duplicated (remove/combine duplicate node/edge references) and then concatenated into one response. Inferred query graphs are dropped, at least until behavior to the contrary is specified.
Regarding the templates, I'll attempt to include the suggestions you made to the best of my ability (where appropriate -- I think there are some things that require more discussion, such as adding predicates). My goal is to make a system where we can easily add/change templates, so anything else can be easily fixed after primary implementation.
Note that we may be discussing more implementation details during this relay. I have a note from current attendance of the "creative mode" session: "grouping answers by drug". I'm not sure exactly what that would look like right now.
EDIT: I'm going to adjust my templates to keep the qNodeID naming given in the "initial query" thingy.
I also noted some question as to predicate/direction -- something along the lines of "does this creative query result specify something that actually helps, or is it the opposite?"
I'm not sure I was entirely clear if that reached an answer. Do we want to be incredibly careful with our predicate selection (both matching a creative qEdge to appropriate templates, and possibly controlling additional predicates in our templates, either statically or dynamically), or for this initial implementation should the focus be more on getting a result graph that is definitely related, whether it strictly fits the creative qEdge's predicate or not entirely?
My previous comment may be safely ignored for now - it has been reasonably answered by discussion in the relay, and my implementation will be trivial to fix if I misunderstood.
I'm cataloguing items which need to be fixed/combined when combining/de-duplicating results, which has led me to certain node attributes, namely:
num_source_node
num_target_node
source_qg_node
target_qg_node
I'm not totally familiar with their purpose -- how should these behave when combining multiple results? Remove source/target qg_nodes which aren't present in the original query_graph and adjust counts accordingly? Should counts be added between responses for overlapping nodes?
Regarding the node attributes listed above, check to see if they have any internal usage. If not, I think they can be removed from the output for now. I believe these were added as some sort of attempt to look at node degree (so that paths using very common nodes, or "hub nodes", can eventually be down-weighted in scoring). But, I don't think we actually got around to implementing anything using those attributes.
My current implementation searches for folders matching the subjectCategory-predicate-objectCategory
of the original query (making multiple combinations if there are multiple categories/predicates), and assumes that any query templates in matching folders will be appropriate for the original query. I'm of course planning on adding functionality to ensure it's possible to search for reverse cases so we don't have to rewrite templates backwards.
This has, however, led to another question: should BTE attempt to expand the search to possible additional appropriate template groups based on the biolink hierarchy?
For instance, if the user queries something to the effect of ChemicalEntity-treats-DiseaseOrPhenotypicFeature
, should BTE also try to find templates for ChemicalEntity-treats-Disease
and ChemicalEntity-treats-PhenotypicFeature
? Similarly, should the reverse case be checked -- expanding Disease
to DiseaseOrPhenotypicFeature
?
I suppose expanding the query classes to more specific subclasses would be useful, i.e., ChemicalEntity-treats-DiseaseOrPhenotypicFeature
-> ChemicalEntity-treats-Disease
and ChemicalEntity-treats-PhenotypicFeature
. I think we shouldn't do the reverse.
But since we only have one creative mode template right now, I think it's also fine to punt on this class-traversal functionality until later. Up to you to decide based on the effort involved in just tackling it now...
@andrewsu @colleenXu As a note for the creation of query templates, I'm setting up the handling such that a template is expected to have one node named subject
and one named object
-- for instance, subject
would be a ChemicalEntity, while object
would be a DiseaseOrPhenotypicFeature, with the rest of the template adding the inferred nodes/edges (which don't need to conform to any naming convention) between these two. Templates are formatted just as normal TRAPI queries, for example:
{
"message": {
"query_graph": {
"nodes": {
"subject": {
"categories": ["biolink:ChemicalEntity"]
},
"n1": {
"categories": ["biolink:Gene"]
},
"object": {
"categories": ["biolink:DiseaseOrPhenotypicFeature"]
}
},
"edges": {
"e01": {
"subject": "subject",
"object": "n1",
"predicates": ["biolink:affects"]
},
"e03": {
"subject": "object",
"object": "n1",
"predicates": ["biolink:caused_by"]
}
}
}
}
}
@tokebe I was using "disease" and "drug" as the node names in my query templates. I imagine that the final node names/qNodeIDs depend on what the UI needs, but I don't know what they need (....asking them as a next step).
@tokebe I've read up on this issue, and:
subject
/object
naming convention is specifically limited to the query templates. Whatever node names the user uses are what are returned; subject/object is just used so my code can know specifically to which template nodes it should merge the user query nodes. I could have used "Drug" and "Disease", however subject/object felt like it would be much clearer in-code, and would be distinct enough in-template for cases where multiple disease/drug nodes might exist in a pathway.Also, with reversing I think you've misunderstood what I mean? I have to handle reverse cases by nature of how my code searches for templates, something like:
"Did the user ask for
Disease X -treated_by-> Drug
instead ofDrug -treats-> Disease X
? Ok, check if this direction or the other matches one of the groups of templates we have."
The reason for this is that template groups are searched for by folder name, where the actual name of the folder is something like Drug_treats_Disease
. Actual reversing for template execution/etc. is handled by query execution and is not what I'm concerned about here.
From discussion (first Andrew-I, then Jackson-I)
Some of my notes related to this discussion:
For clarity in discussion of templates:
Regarding different qNodeIDs/qEdgeIDs for responses from different templates:
(Moving from Slack discussion to here for ease-of-record)
It's been proposed that a result cap should be imposed to prevent oversized responses. What should this cap be, and are there any special considerations regarding its behavior/implementation?
I imagine the simplest form is something like "If the cumulative results reaches/surpasses
Tagging @colleenXu @andrewsu for discussion.
I imagine the simplest form is something like "If the cumulative results reaches/surpasses , do not proceed with additional template sub-queries; Assemble final results and return."
I agree that this is the simplest form of adding a result cap, and I think it's where we should start...
Further testing: with @colleenXu's template revision and now template ordering, it's possible now to stop at a reasonable number of results a fair amount of the time. However, it's still possible at times to get very large responses for one template, which causes BTE to stop after that one, but still maintain possibly very large numbers of results. It might be worth figuring out how to pass a result cap in sub-queries, or even to implement sub-query time-outs.
Additionally, I'm still concerned that we'll end up stopping after just the first template, which is going to be the basic one-hop, so I think further discussion is still warranted about the presence (or perhaps just the ordering) of the basic one-hop?
Additional behavioral question: currently, the user-query subject and object categories and IDs are merged with those of any matching templates, meaning if the user-query specifies ChemicalEntity
, then sub-queries specifying SmallMolecule
will actually query for both.
Given current template revision efforts, it seems like category merging may not be the best idea, so I've removed it in the latest commit. Should I re-enable this?
Replying to @tokebe's first post https://github.com/biothings/BioThings_Explorer_TRAPI/issues/449#issuecomment-1164660364, first paragraph on "too many results / takes too long":
Replying to @tokebe's first post https://github.com/biothings/BioThings_Explorer_TRAPI/issues/449#issuecomment-1164660364, second paragraph on "is it okay to do only the 1-hop and finish":
I'm a little confused by @tokebe 's second comment https://github.com/biothings/BioThings_Explorer_TRAPI/issues/449#issuecomment-1165737767:
I've introduced undue confusion here by used the term "sub-query" instead of something like "template-query" (I should change sub-query to template-query in the new code's logging as well...). Each template is handled as its own independent query, just bypassing our endpoints and with some special handling. What I was proposing was either passing a result limit to each template's execution, or limiting each template's execution time.
That said, there are definitely some complications with either of those solutions. Given your recent extended testing in Slack, I think it might be fine to leave the behavior as-is, and instead focus on just how many results is an acceptable amount before stopping?
The impression I've gotten is that "creative" is more important than "fast", however this is partially assumption on my part based on the nature of expecting BTE to sort of "expand" the initial query (as well as what it looked like other teams were doing seeming to lean more toward "creative".
Additionally, I thought @andrewsu had commented more along the lines of concern that the straightforward one-hop may not be "creative enough" for Translator's taste. In the end, I'm just implementing, but I'd like to double-check with @andrewsu with regards to this.
I'm not sure what's confusing (except for my incorrect use of sub-query instead of template-query), especially as your response perfectly addresses why I shouldn't re-enable category merging between the user query and our templates. I'll leave it disabled and remove the commented-out code as it won't be used.
Regarding how many results to limit creative mode to, I propose a two-number method.
The first number, X, dictates when to stop. If the queries added together have X or more results, don't execute any more templates.
The second number, Y, can be a sort of hard-coded buffer. If the currently-executed results are less than X, we only add the newest query's results if that number plus X is less than X + Y. (Otherwise, we might reasonably assume that all queries will exceed X + Y, and stop executing templates)
This should prevent cases where, say, X = 1000, and we reach 980 results, and then the next query has another 1000 results, meaning our response would have some number less than or equal to 1980, way over our intended limit.
I would say that Y could be somewhere between 100 and 500. I don't have a solid idea of what X should be, perhaps something between 1000-2000?
@tokebe
vocab:
Regarding limits, I've been assuming:
creativity:
"merging UI incoming query with templates":
@tokebe also adding a note on what another team is doing to "recognize creative-mode and its qNodes" (Translator Slack) https://ncatstranslator.slack.com/archives/C013Q5TVC87/p1656113934859089?thread_ts=1656022814.576499&cid=C013Q5TVC87
This brings up an interesting point -- currently, I have no special behavior for multiple IDs. If the user/ui supplies multiple IDs, they're all merged in an used in template-queries. Should we require that the user/ui query has only 1 ID?
Slight caveat to the X + Y method -- I can only have that take affect after the first template-query, or else it could return just a little over the proposed 1000 + 200 limit and then nothing gets returned. So there's still no guarantee we're returning 1200 or fewer results in the event the first template gets more, just a decent likelihood...
Noting some old UI discussion of creative-mode in internal Translator Slack: https://ncatstranslator.slack.com/archives/C02PG1W7HD0/p1651610305586929
recording the "Chris Bizon" provided incoming/creative-mode query:
{
"message": {
"query_graph": {
"nodes": {
"disease": {
"ids": ["MONDO:0005147"]
},
"chemical": {
"categories": ["biolink:ChemicalEntity"]
}
},
"edges": {
"t_edge": {
"object": "disease",
"subject": "chemical",
"predicates": ["biolink:treats"],
"knowledge_type": "inferred"
}
}
}
}
}
Also looks like we decided not to run the literal edge. We previously chose to run the literal edge, but I couldn't find a documented reason why we decided not to.
I think this is fine. We basically run the "literal edge" with the first template.
And we decided not to do a time-limit on each template-run at the moment (only results capping)
Recording decisions on what the creative-mode incoming query looks like (from UI / ARS):
1 Disease ID, 1 "treats" predicate
What another team was checking:
From the 8/10 lab meeting on Translator stuff (added feature):
(Jackson's plate of work) "should be technically feasible to "top off" results and trim resulting KG"
Also edit the TRAPI logs to make the cap clearer: "the cap of 1500 is reached because the template just run has 3000 results"
MVP is complete
Up to this point, Translator has focused on explicitly-declared query topologies, expressed as a subgraph of nodes and edges. Translator now wants ARAs to accept "inferred edges" in the query, which essentially give the ARA the freedom to return answers that are supported by any query topology. This design has been colloquially referred to as "creative mode".
By the end of June, Translator would like ARAs to be able to respond to one type of "inferred edge" query focusing on the question "what drugs may treat disease X". "Inferred mode" is specified by adding a
"knowledge_type": "inferred"
on the edge of a one-hop predict query (as proposed in this PR).For the output, paths between the disease and the drug should have a maximum of three edges. There is currently no cap on the overall number of nodes or edges in a result.
This doc contains a list of example diseases to be queried: https://docs.google.com/document/d/1cuYrWHv6MzT3w7lZqpHQVc7WbMY5WuLiLOh9qI3TVRo/edit
In terms of our BTE implementation of inferred mode, I think we should build on the idea of using a library of common query topologies / metapaths extracted from DrugMechDB. This would be a relatively straightforward implementation of this feature, and it would build on work that we've already done for https://github.com/biothings/BioThings_Explorer_TRAPI/issues/181.