RTXteam / RTX

Software repo for Team Expander Agent (Oregon State U., Institute for Systems Biology, and Penn State U.)
https://arax.ncats.io/
MIT License
33 stars 21 forks source link

PathFinder TRAPI request to ARAXi request conversion. #2369

Closed mohsenht closed 1 month ago

edeutsch commented 2 months ago

Is there a Translator-wide agreement on what the TRAPI request for Pathfinder should look like?

mohsenht commented 2 months ago

Yes, you can find them here (https://docs.google.com/spreadsheets/d/1fxEUUOFP3X37KbLRrf7cSxkag1-kFaJN/edit?gid=1669218812#gid=1669218812)

edeutsch commented 2 months ago

Great, thanks. I'll think about how we could hook that up.

One concern/observation. I was recently manually doing some connect-the-dots queries, and will write my observation here:

I was to connect indomethacin and DHFR If I do a one-hop direct query, I get one result with a direct link from CTD https://arax.ci.transltr.io/?r=292446

If I do a two-hop query, I get 4 results including cataract, angiodema, CRC, and DILI: https://arax.ci.transltr.io/?r=292449 I'm uncertain if these are great answers, but come from translator results

Then in if I do the new Pathfinder query using ARAXi, I get 97 results: https://arax.ci.transltr.io/?r=292582

One concern about these is that as far as I can tell none of the 97 include what is found via direct one hop and two hop queries.

There's also no ranking yet.

mohsenht commented 2 months ago

Is it related to this issue?

mohsenht commented 2 months ago

Pathfinder is not completed yet, it is just the first version of it.

edeutsch commented 2 months ago

Maybe not directly related, but it is partially related in that I am trying to understand the context of the whole issue and what is currently working and not working.

My understanding is the the current Translator-agreed query_graph for this kind of Pathfinder query is this:

{
  "message": {
    "query_graph": {
      "nodes": {
        "n0": {
          "ids": [
            "CHEBI:49662"
          ]
        },
        "un": {
          "categories": [
            "biolink:NamedThing"
          ]
        },
        "n2": {
          "ids": [
            "NCBIGene:1719"
          ]
        }
      },
      "edges": {
        "e0": {
          "subject": "n0",
          "object": "un",
          "predicates": [
            "biolink:related_to"
          ],
          "knowledge_type": "inferred"
        },
        "e1": {
          "subject": "un",
          "object": "n2",
          "predicates": [
            "biolink:related_to"
          ],
          "knowledge_type": "inferred"
        },
        "e2": {
          "subject": "n0",
          "object": "n2",
          "predicates": [
            "biolink:related_to"
          ],
          "knowledge_type": "inferred"
        }
      }
    }
  }
}

As written, this returns zero results. I'm a bit perplexed why it returns zero results because there are both 1 hop and 2 hop results.

But in any case, what we want is to to short-circuit this format of query into custom Patherfinder ARAXi. OR potentially do 3 queries:

edeutsch commented 2 months ago

Documenting the current state of musings...

  1. The incoming query_graph will consist of two pinned nodes with a related_to edge between them with KT=inferred PLUS another unpinned node with related_to inferred edges between it and the two pinned nodes. That seems decided by the consortium.
  2. The query_graph_interpreter should be updated to recognize this pattern and inject option_group="directLookup" to the first edge and then option_group="Pathfinder" to the second two edges.
  3. The query will then proceed with the default ARAXi and Expand as it normally would, expanding all edges in the conventional manner.
  4. Expand and Resultify will be modified so that they no longer require one non-optional edge path. All edge paths may be optional. Each Result must still satisfy all non-optional edges if any, but if all edges are optional, then each Result must satisfy all edges of at least one option_group. Tricky.
  5. Expand will be modified so that if it encounters an option_group=Pathfinder, it will ALSO send all edges in that group and attached nodes to the special Pathfinder algorithm
  6. The Pathfinder algorithm must express its answer in terms of the 3 nodes and 2 edges given to it. Can that be done? What if there are 3 hops? Does the QG need an optional self edge on the unpinned node??

Not done musing yet...

amykglen commented 2 months ago

@mohsenht - here's some example code for calling Expand - in this example I'm assuming we want to do pathfinder between CHEBI:49662 and NCBIGene:1719:

    from ARAX_messenger import ARAXMessenger
    from ARAX_expander import ARAXExpander
    from ARAX_response import ARAXResponse
    sys.path.append(os.path.dirname(os.path.abspath(__file__))+"/../../UI/OpenAPI/python-flask-server/")
    from openapi_server.models.q_edge import QEdge
    from openapi_server.models.q_node import QNode

    # First create a message with the one-hop query graph
    response_1 = ARAXResponse()
    ARAXMessenger().create_envelope(response_1)  # Creates an empty envelope
    message_1 = response_1.envelope.message
    message_1.query_graph.nodes = {"n0": QNode(ids=["CHEBI:49662"]),
                                   "n1": QNode(ids=["NCBIGene:1719"])}
    message_1.query_graph.edges = {"e0": QEdge(subject="n0",
                                               object="n1",
                                               predicates=["biolink:related_to"])}
    # Then use Expand to answer the one-hop query
    expander = ARAXExpander()
    expander.apply(response_1, input_parameters={"kp_timeout": 10})
    # You can then find Expand's answers in the knowledge graph
    print(response_1.envelope.message.knowledge_graph)

    # Then we do the same thing but for a two-hop query
    response_2 = ARAXResponse()
    ARAXMessenger().create_envelope(response_2)
    message_2 = response_2.envelope.message
    message_2.query_graph.nodes = {"n0": QNode(ids=["CHEBI:49662"]),
                                   "nx": QNode(categories=["biolink:NamedThing"]),
                                   "n1": QNode(ids=["NCBIGene:1719"])}
    message_2.query_graph.edges = {"e0": QEdge(subject="n0",
                                               object="nx",
                                               predicates=["biolink:related_to"]),
                                   "e1": QEdge(subject="nx",
                                               object="n1",
                                               predicates=["biolink:related_to"])}
    # Then use Expand to answer the two-hop query
    expander.apply(response_2, input_parameters={"kp_timeout": 30})
    # You can then find Expand's answers in the knowledge graph
    print(response_2.envelope.message.knowledge_graph)

    # Note that the qnode that nodes fulfill is captured in the `qnode_keys` property - for example:
    node = response_2.envelope.message.knowledge_graph.nodes['CHEBI:49662']
    print(f"Node CHEBI:49662 fulfills qnode(s): {node.qnode_keys}")
    # And same with qedge_keys - example:
    edge = next(edge for edge in response_2.envelope.message.knowledge_graph.edges.values())
    print(f"Example edge fulfills qedge(s): {edge.qedge_keys}")

note that you can refine the kp_timeout parameter to whatever works best - this is basically the number of seconds that Expand will wait to get a response back from KPs' APIs (for each qedge). if you don't specify a timeout, it will default to two minutes, which I imagine is far too long.

let me know if you have any questions!

mohsenht commented 2 months ago

Hi @edeutsch,

Is it possible to response with two different query graph in ARAX? One for direct one hop result and the other for the triangle shape results?

edeutsch commented 2 months ago

Hi @mohsenht I may not quite understand the question, but there can only be one query graph coming into ARAX, and the same single query_graph should be repeated in the Response. We used to adjust the query_graph years ago but I think it was decided Translator-wide to stop this behavior.

mohsenht commented 2 months ago

@edeutsch My question is: if we have only one query graph, is it possible for one result to not follow the structure of the query graph (in this case, directly connecting two nodes instead of forming a triangle shape)?

mohsenht commented 2 months ago

@dkoslicki In the All Hands Meeting, we agreed to include one-hop and two-hop results from the expander into the PathFinder results. In this process, we'll first use the expander to generate the one-hop and two-hop results, then feed those nodes into our PathFinder algorithm to further expand and improve the results.

edeutsch commented 2 months ago

@edeutsch My question is: if we have only one query graph, is it possible for one result to not follow the structure of the query graph (in this case, directly connecting two nodes instead of forming a triangle shape)?

I think we should try it. As far as I'm aware it does not violate the schema. But it does violate some policies that we have set forth in the past before we got here. As mentioned in my musings post above, we do have an option_group mechanism that we could try to use.

so I suggest: let's try it. There's a chance that we end up doing something different.

mohsenht commented 1 month ago

Here is a sample DSL query:

create_message add_qnode(ids=PUBCHEM.COMPOUND:445154, key=n00) add_qnode(ids=NCBIGene:2739, key=n01) connect(action=connect_nodes, max_path_length=4) return(message=true, store=false)

mohsenht commented 1 month ago

Here is a sample DSL query for PathFinder with Node Constraint:

create_message add_qnode(ids=PUBCHEM.COMPOUND:445154, key=n00) add_qnode(ids=NCBIGene:2739, key=n01) connect(action=connect_nodes, max_path_length=4, node_category_constraint=biolink:NamedThing) return(message=true, store=false)

edeutsch commented 1 month ago

Hi everyone, I have committed some adjustments to the QueryGraphInterpreter that I hope will do what we want. I opted for a very light touch at first to see how it goes. Please request more adjustments if you want them.

Here's what I did:

I'm not certain this is really what we want, but seemed good. Notably, it is not restricted to the "triangle QG", but rather allows substantially more flexibility in recognizing a potential pathfinder query. I am hoping that this allows experimenting with some of the fancier pathfinder scenarios that were mentioned, beyond a single triangle.

I'm hoping that this will not trigger on any existing non-pathfinder queries. Clearly if the querier really uses knowledge_type="pathfinder", then they really mean it. And I don't think any normal production queries use two or more edges with knowledge_type="inferred", so that catches the "triangle QG" but also allows some additional experimentation.

Notable, also: perhaps against what we may have decided, it does NOT change the QG at all. It does not change the one-hop from "inferred" to "lookup" like we discussed. I decided that connect() can do that as easily as QGI can, so I'll leave that to connect(), giving it a bit more flexibility? Happy to change if you disagree.

Finally, the entire ARAXi program that is generated is:

connect(action=connect_nodes, max_path_length=4)

The expectation is that this runs on the existing query graph, whatever it is. Note that it may be more complicated than the default "triangle QG". But I assume that is a desirable feature, although may require some more effort.

This is merged into master but only minimally tested.

What do you think?

edeutsch commented 1 month ago

So I tried a full test in devED: https://arax.ncats.io/devED/ pasting in the above query graph, and I see this:

...
2024-10-09T21:51:41.549822 INFO: QueryGraphInterpreter recognized query_graph as a 'pathfinder' query: triggering pathfinder subsystem.
...
2024-10-09T21:51:41.579091 INFO: Parsing input actions list
2024-10-09T21:51:41.579120 DEBUG: Parsing action: connect(action=connect_nodes, max_path_length=4)
2024-10-09T21:51:41.708016 INFO: Processing action 'connect' with parameters {'action': 'connect_nodes', 'max_path_length': '4'}
2024-10-09T21:51:41.708110 ERROR: Connect works with just two qnodes. qnode list size: 3
2024-10-09T21:51:41.708149 DEBUG: Applying Connect to Message with parameters {'action': 'connect_nodes', 'max_path_length': 4, 'qnode_keys': ['n2', 'un', 'n0'], 'node_category_constraint': ''}

and then halt.

I assume this is more or less what we want and it is up to connect() to be adjusted by @mohsenht to handle the "triangle QG" from here?

mohsenht commented 1 month ago

@edeutsch

Thank you Eric for your help with this conversion issue.

mohsenht commented 1 month ago

I think, we made a decision for TRAPI and the structure is almost certain.

For ARAX DSL, would it be better to get just parameters (two pinned nodes and on node constraint that every path should have at least one), or the user need to build that TRAPI structure query graph with DSL code.

Something like the below this DSL code:

create_message
add_qnode(ids=MONDO:0004975, key=pinned_node_1)
add_qnode(key=middle_node,categories=biolink:Gene)
add_qnode(ids=UMLS:C1318700, key=pinned_node_2)
add_qedge(key=direct_edge,subject=pinned_node_1,object=pinned_node_2,predicates=biolink:related_to)
add_qedge(key=pinned_node_1_to_middle_node_edge,subject=pinned_node_1,object=middle_node,predicates=biolink:related_to)
add_qedge(key=middle_node_to_pinned_node_2_edge,subject=middle_node,object=pinned_node_2,predicates=biolink:related_to)
connect(action=connect_nodes, max_path_length=4)
return(message=true, store=false)
edeutsch commented 1 month ago

I don't think I really understand the question. I think it is a given that queries will come in via the query_graph like in the middle of this thread, e.g.:

{
  "message": {
    "query_graph": {
      "nodes": {
        "n0": {
          "ids": [
            "CHEBI:49662"
          ]
        },
        "un": {
          "categories": [
            "biolink:NamedThing"
          ]
        },
        "n2": {
          "ids": [
            "NCBIGene:1719"
          ]
        }
      },
      "edges": {
        "e0": {
          "subject": "n0",
          "object": "un",
          "predicates": [
            "biolink:related_to"
          ],
          "knowledge_type": "inferred"
        },
        "e1": {
          "subject": "un",
          "object": "n2",
          "predicates": [
            "biolink:related_to"
          ],
          "knowledge_type": "inferred"
        },
        "e2": {
          "subject": "n0",
          "object": "n2",
          "predicates": [
            "biolink:related_to"
          ],
          "knowledge_type": "inferred"
        }
      }
    }
  }
}

We already have that in hand. So the DSL does not need to create a message and add qnodes and qedges; they're already all there.

I think all that connect() needs to do is work with the query graph as given and create a knowledge graph and results that map back to the query graph, using its own pathfinding algorithm as well as Expand() as appropriate.

So I think the only ARAXi necessary is:

connect(action=connect_nodes, max_path_length=4)

and that is implemented and apparently working.

edeutsch commented 1 month ago

So here's my thoughts on what connect() should be doing in this scenario:

Now if the query graph is more complicated than just the triangle with all three edges being inferred, then some additional strategy needs to be invoked. But I would suggest putting that on the back burner for now until you can get the above working.

dkoslicki commented 1 month ago

I don't think I really understand the question. I think it is a given that queries will come in via the query_graph like in the middle of this thread, e.g.:

@edeutsch In talking with @mohsenht , I think his intention about the ARAXi question is more along the lines of "If someone were not doing a query through the ARS, but wanted to do it through the ARAX UI itself and build a DSL, what should the syntax be? Put parameters in the DSL constructed QG, or put it in the connect command itself?" With the conclusion being "put parameters in the DSL constructed QG (wrt intermediate node constraints and the like" in order to make the paradigm consistent: be it from ARAXi or from an ARS TRAPI message, we'll just call connect() and use the associated query graphs to determine constraints and the like (be they from the TRAPI QG or the DSL QG).

edeutsch commented 1 month ago

AH, I get it now, sorry.

Yes, I think using the DSL to build a query graph that mirrors what would come from the ARS seems easiest. One snag could be that the edges may need to have knowledge_type=inferred tags, which I think the current DSL does not support maybe (but easily added).

In the short term, this may not be needed, but eventually if we want connect() to treat NON knowledge_type=inferred edges in a special way (only send to Expand, not subject to pathfinding) then we may need to add that distinction.

mohsenht commented 1 month ago

Hi @edeutsch,

I apologize for the confusion my comment caused and for taking up your time. Occasionally, I leave comments as personal notes to track my thoughts and things I need to discuss with @dkoslicki.

Going forward, I'll label such comments as "Personal Note" to avoid taking up the time of the ARAX team.

edeutsch commented 1 month ago

no problem. Your note was actually clear enough in the first place. I just misread it/misunderstood it.