biothings / biothings_explorer

TRAPI service for BioThings Explorer
https://explorer.biothings.io
Apache License 2.0
9 stars 10 forks source link

implement "creative/inferred mode" for "what drugs may treat disease X" query #449

Closed andrewsu closed 1 year ago

andrewsu commented 2 years ago

Up to this point, Translator has focused on explicitly-declared query topologies, expressed as a subgraph of nodes and edges. Translator now wants ARAs to accept "inferred edges" in the query, which essentially give the ARA the freedom to return answers that are supported by any query topology. This design has been colloquially referred to as "creative mode".

By the end of June, Translator would like ARAs to be able to respond to one type of "inferred edge" query focusing on the question "what drugs may treat disease X". "Inferred mode" is specified by adding a "knowledge_type": "inferred" on the edge of a one-hop predict query (as proposed in this PR).

For the output, paths between the disease and the drug should have a maximum of three edges. There is currently no cap on the overall number of nodes or edges in a result.

This doc contains a list of example diseases to be queried: https://docs.google.com/document/d/1cuYrWHv6MzT3w7lZqpHQVc7WbMY5WuLiLOh9qI3TVRo/edit

In terms of our BTE implementation of inferred mode, I think we should build on the idea of using a library of common query topologies / metapaths extracted from DrugMechDB. This would be a relatively straightforward implementation of this feature, and it would build on work that we've already done for https://github.com/biothings/BioThings_Explorer_TRAPI/issues/181.

tokebe commented 2 years ago

Just adding my general plan here for quick reference:

Main issue before start of implementation: I need that template query library, or a good place to start in generating one.

andrewsu commented 2 years ago

At least initially, we can use this query template library created by @Carolina1396 https://github.com/Carolina1396/drug_mechanisms_rare_diseases_BioThingsExplorer/tree/main/src/query_templates, which was based at least in part on an analysis of the most common metapaths in DrugMechDB shown in https://github.com/SuLab/DMDB_Analysis/blob/main/1_code/1_basic_dmdb_analysis.ipynb.

tokebe commented 2 years ago

@andrewsu I'm realizing there are some questions I've failed to ask regarding scope:

Should BTE support inferred-mode edges in multi-hop queries? More than one inferred edge?

I believe this should still be relatively simple in this implementation: we use our template query to replace the inferred edge, keeping the 'outside' topology the same, for as many templated queries as we have (in the case of multiple inferred edges, one graph for each possible combination of template queries as appropriate).

Implementation-wise this isn't very complex -- just a few extra loops to make sure we get all the permutations. Since we're not re-assembling results afterward, just de-duplicating and concatenating them, there's no overt complications with results assembly. I do however see some possible cases where we have a single multi-hop query with multiple inferred edges exploding very quickly.

Another simple clarifying question: When encountering an inferred edge, obviously BTE must create every query that can be generated from templates related to that inferred edge. I assume BTE should also still execute the original edge in addition?

andrewsu commented 2 years ago

Should BTE support inferred-mode edges in multi-hop queries? More than one inferred edge?

No, we can make the assumption (and enforce the limit) that inferred-mode queries are single-hop queries. If that's not true, then feel free to have BTE fail out with some appropriate error message.

When encountering an inferred edge, obviously BTE must create every query that can be generated from templates related to that inferred edge. I assume BTE should also still execute the original edge in addition?

Yes, I think that makes sense. Feel free to implement that in code, or by adding the appropriate template to the query library. Whichever route makes the most sense to you...

colleenXu commented 2 years ago

@tokebe @andrewsu

note on the query that is sent:

they may use "biolink:Drug" as the category....and we will likely want to add "ChemicalEntity" to this because we don't keep much under "Drug".

andrewsu commented 2 years ago

if it's clear how to fix it, go for it. If not, skip it...

colleenXu commented 2 years ago

@tokebe @andrewsu

I suggest some changes to the templates for this particular issue...

colleenXu commented 2 years ago

I've taken all the templates and turned them into 7 TRAPI queries, with some mods....https://github.com/biothings/BioThings_Explorer_TRAPI/issues/453#issuecomment-1147082666

However, these may need further tweaking (predicates, node constraint for the drug regulatory status aka whether it is a drug, EDIT: is_set:True on QNodes...)?

EDIT: node constraints not working: https://github.com/biothings/BioThings_Explorer_TRAPI/issues/174#issuecomment-1147956168

colleenXu commented 2 years ago

And...it's unclear to me how we'll assemble the separate TRAPI queries / QGraphs and their responses into 1 response / thing....is that the goal?

tokebe commented 2 years ago

@colleenXu As for the response assembly: This was discussed in a meeting, the graphs and results get de-duplicated (remove/combine duplicate node/edge references) and then concatenated into one response. Inferred query graphs are dropped, at least until behavior to the contrary is specified.

Regarding the templates, I'll attempt to include the suggestions you made to the best of my ability (where appropriate -- I think there are some things that require more discussion, such as adding predicates). My goal is to make a system where we can easily add/change templates, so anything else can be easily fixed after primary implementation.

colleenXu commented 2 years ago

Note that we may be discussing more implementation details during this relay. I have a note from current attendance of the "creative mode" session: "grouping answers by drug". I'm not sure exactly what that would look like right now.


EDIT: I'm going to adjust my templates to keep the qNodeID naming given in the "initial query" thingy.

tokebe commented 2 years ago

I also noted some question as to predicate/direction -- something along the lines of "does this creative query result specify something that actually helps, or is it the opposite?"

I'm not sure I was entirely clear if that reached an answer. Do we want to be incredibly careful with our predicate selection (both matching a creative qEdge to appropriate templates, and possibly controlling additional predicates in our templates, either statically or dynamically), or for this initial implementation should the focus be more on getting a result graph that is definitely related, whether it strictly fits the creative qEdge's predicate or not entirely?

tokebe commented 2 years ago

My previous comment may be safely ignored for now - it has been reasonably answered by discussion in the relay, and my implementation will be trivial to fix if I misunderstood.

I'm cataloguing items which need to be fixed/combined when combining/de-duplicating results, which has led me to certain node attributes, namely:

I'm not totally familiar with their purpose -- how should these behave when combining multiple results? Remove source/target qg_nodes which aren't present in the original query_graph and adjust counts accordingly? Should counts be added between responses for overlapping nodes?

andrewsu commented 2 years ago

Regarding the node attributes listed above, check to see if they have any internal usage. If not, I think they can be removed from the output for now. I believe these were added as some sort of attempt to look at node degree (so that paths using very common nodes, or "hub nodes", can eventually be down-weighted in scoring). But, I don't think we actually got around to implementing anything using those attributes.

tokebe commented 2 years ago

My current implementation searches for folders matching the subjectCategory-predicate-objectCategory of the original query (making multiple combinations if there are multiple categories/predicates), and assumes that any query templates in matching folders will be appropriate for the original query. I'm of course planning on adding functionality to ensure it's possible to search for reverse cases so we don't have to rewrite templates backwards.

This has, however, led to another question: should BTE attempt to expand the search to possible additional appropriate template groups based on the biolink hierarchy?

For instance, if the user queries something to the effect of ChemicalEntity-treats-DiseaseOrPhenotypicFeature, should BTE also try to find templates for ChemicalEntity-treats-Disease and ChemicalEntity-treats-PhenotypicFeature? Similarly, should the reverse case be checked -- expanding Disease to DiseaseOrPhenotypicFeature?

andrewsu commented 2 years ago

I suppose expanding the query classes to more specific subclasses would be useful, i.e., ChemicalEntity-treats-DiseaseOrPhenotypicFeature -> ChemicalEntity-treats-Disease and ChemicalEntity-treats-PhenotypicFeature. I think we shouldn't do the reverse.

But since we only have one creative mode template right now, I think it's also fine to punt on this class-traversal functionality until later. Up to you to decide based on the effort involved in just tackling it now...

tokebe commented 2 years ago

@andrewsu @colleenXu As a note for the creation of query templates, I'm setting up the handling such that a template is expected to have one node named subject and one named object -- for instance, subject would be a ChemicalEntity, while object would be a DiseaseOrPhenotypicFeature, with the rest of the template adding the inferred nodes/edges (which don't need to conform to any naming convention) between these two. Templates are formatted just as normal TRAPI queries, for example:

{
  "message": {
    "query_graph": {
      "nodes": {
        "subject": {
          "categories": ["biolink:ChemicalEntity"]
        },
        "n1": {
          "categories": ["biolink:Gene"]
        },
        "object": {
          "categories": ["biolink:DiseaseOrPhenotypicFeature"]
        }
      },
      "edges": {
        "e01": {
          "subject": "subject",
          "object": "n1",
          "predicates": ["biolink:affects"]
        },
        "e03": {
          "subject": "object",
          "object": "n1",
          "predicates": ["biolink:caused_by"]
        }
      }
    }
  }
}
colleenXu commented 2 years ago

@tokebe I was using "disease" and "drug" as the node names in my query templates. I imagine that the final node names/qNodeIDs depend on what the UI needs, but I don't know what they need (....asking them as a next step).

colleenXu commented 2 years ago

@tokebe I've read up on this issue, and:

tokebe commented 2 years ago

Also, with reversing I think you've misunderstood what I mean? I have to handle reverse cases by nature of how my code searches for templates, something like:

"Did the user ask for Disease X -treated_by-> Drug instead of Drug -treats-> Disease X? Ok, check if this direction or the other matches one of the groups of templates we have."

The reason for this is that template groups are searched for by folder name, where the actual name of the folder is something like Drug_treats_Disease. Actual reversing for template execution/etc. is handled by query execution and is not what I'm concerned about here.

colleenXu commented 2 years ago

From discussion (first Andrew-I, then Jackson-I)

Big picture

Template handling details

tokebe commented 2 years ago

Some of my notes related to this discussion:

For clarity in discussion of templates:

Regarding different qNodeIDs/qEdgeIDs for responses from different templates:

tokebe commented 2 years ago

(Moving from Slack discussion to here for ease-of-record)

It's been proposed that a result cap should be imposed to prevent oversized responses. What should this cap be, and are there any special considerations regarding its behavior/implementation?

I imagine the simplest form is something like "If the cumulative results reaches/surpasses , do not proceed with additional template sub-queries; Assemble final results and return."

Tagging @colleenXu @andrewsu for discussion.

andrewsu commented 2 years ago

I imagine the simplest form is something like "If the cumulative results reaches/surpasses , do not proceed with additional template sub-queries; Assemble final results and return."

I agree that this is the simplest form of adding a result cap, and I think it's where we should start...

tokebe commented 2 years ago

Further testing: with @colleenXu's template revision and now template ordering, it's possible now to stop at a reasonable number of results a fair amount of the time. However, it's still possible at times to get very large responses for one template, which causes BTE to stop after that one, but still maintain possibly very large numbers of results. It might be worth figuring out how to pass a result cap in sub-queries, or even to implement sub-query time-outs.

Additionally, I'm still concerned that we'll end up stopping after just the first template, which is going to be the basic one-hop, so I think further discussion is still warranted about the presence (or perhaps just the ordering) of the basic one-hop?

tokebe commented 2 years ago

Additional behavioral question: currently, the user-query subject and object categories and IDs are merged with those of any matching templates, meaning if the user-query specifies ChemicalEntity, then sub-queries specifying SmallMolecule will actually query for both.

Given current template revision efforts, it seems like category merging may not be the best idea, so I've removed it in the latest commit. Should I re-enable this?

colleenXu commented 2 years ago

Replying to @tokebe's first post https://github.com/biothings/BioThings_Explorer_TRAPI/issues/449#issuecomment-1164660364, first paragraph on "too many results / takes too long":

colleenXu commented 2 years ago

Replying to @tokebe's first post https://github.com/biothings/BioThings_Explorer_TRAPI/issues/449#issuecomment-1164660364, second paragraph on "is it okay to do only the 1-hop and finish":

colleenXu commented 2 years ago

I'm a little confused by @tokebe 's second comment https://github.com/biothings/BioThings_Explorer_TRAPI/issues/449#issuecomment-1165737767:

tokebe commented 2 years ago

In response to @colleenXu's first response:

I've introduced undue confusion here by used the term "sub-query" instead of something like "template-query" (I should change sub-query to template-query in the new code's logging as well...). Each template is handled as its own independent query, just bypassing our endpoints and with some special handling. What I was proposing was either passing a result limit to each template's execution, or limiting each template's execution time.

That said, there are definitely some complications with either of those solutions. Given your recent extended testing in Slack, I think it might be fine to leave the behavior as-is, and instead focus on just how many results is an acceptable amount before stopping?

In response to the second:

The impression I've gotten is that "creative" is more important than "fast", however this is partially assumption on my part based on the nature of expecting BTE to sort of "expand" the initial query (as well as what it looked like other teams were doing seeming to lean more toward "creative".

Additionally, I thought @andrewsu had commented more along the lines of concern that the straightforward one-hop may not be "creative enough" for Translator's taste. In the end, I'm just implementing, but I'd like to double-check with @andrewsu with regards to this.

In response to the third:

I'm not sure what's confusing (except for my incorrect use of sub-query instead of template-query), especially as your response perfectly addresses why I shouldn't re-enable category merging between the user query and our templates. I'll leave it disabled and remove the commented-out code as it won't be used.

tokebe commented 2 years ago

Regarding how many results to limit creative mode to, I propose a two-number method.

The first number, X, dictates when to stop. If the queries added together have X or more results, don't execute any more templates.

The second number, Y, can be a sort of hard-coded buffer. If the currently-executed results are less than X, we only add the newest query's results if that number plus X is less than X + Y. (Otherwise, we might reasonably assume that all queries will exceed X + Y, and stop executing templates)

This should prevent cases where, say, X = 1000, and we reach 980 results, and then the next query has another 1000 results, meaning our response would have some number less than or equal to 1980, way over our intended limit.

I would say that Y could be somewhere between 100 and 500. I don't have a solid idea of what X should be, perhaps something between 1000-2000?

colleenXu commented 2 years ago

@tokebe

vocab:

Regarding limits, I've been assuming:

creativity:

"merging UI incoming query with templates":

colleenXu commented 2 years ago

@tokebe also adding a note on what another team is doing to "recognize creative-mode and its qNodes" (Translator Slack) https://ncatstranslator.slack.com/archives/C013Q5TVC87/p1656113934859089?thread_ts=1656022814.576499&cid=C013Q5TVC87

tokebe commented 2 years ago

This brings up an interesting point -- currently, I have no special behavior for multiple IDs. If the user/ui supplies multiple IDs, they're all merged in an used in template-queries. Should we require that the user/ui query has only 1 ID?

tokebe commented 2 years ago

Slight caveat to the X + Y method -- I can only have that take affect after the first template-query, or else it could return just a little over the proposed 1000 + 200 limit and then nothing gets returned. So there's still no guarantee we're returning 1200 or fewer results in the event the first template gets more, just a decent likelihood...

colleenXu commented 2 years ago

Noting some old UI discussion of creative-mode in internal Translator Slack: https://ncatstranslator.slack.com/archives/C02PG1W7HD0/p1651610305586929

colleenXu commented 2 years ago

recording the "Chris Bizon" provided incoming/creative-mode query:

{
    "message": {
        "query_graph": {
            "nodes": {
                "disease": {
                    "ids": ["MONDO:0005147"]
                },
                "chemical": {
                    "categories": ["biolink:ChemicalEntity"]
                }
            },
            "edges": {
                "t_edge": {
                    "object": "disease",
                    "subject": "chemical",
                    "predicates": ["biolink:treats"],
                    "knowledge_type": "inferred"
                }
            }
        }
    }
}
colleenXu commented 2 years ago

Also looks like we decided not to run the literal edge. We previously chose to run the literal edge, but I couldn't find a documented reason why we decided not to.

I think this is fine. We basically run the "literal edge" with the first template.


And we decided not to do a time-limit on each template-run at the moment (only results capping)

colleenXu commented 2 years ago

Recording decisions on what the creative-mode incoming query looks like (from UI / ARS):

1 Disease ID, 1 "treats" predicate

What another team was checking:

colleenXu commented 2 years ago

From the 8/10 lab meeting on Translator stuff (added feature):

(Jackson's plate of work) "should be technically feasible to "top off" results and trim resulting KG"

Also edit the TRAPI logs to make the cap clearer: "the cap of 1500 is reached because the template just run has 3000 results"

andrewsu commented 1 year ago

MVP is complete