Closed mohsenht closed 1 month ago
Yes, you can find them here (https://docs.google.com/spreadsheets/d/1fxEUUOFP3X37KbLRrf7cSxkag1-kFaJN/edit?gid=1669218812#gid=1669218812)
Great, thanks. I'll think about how we could hook that up.
One concern/observation. I was recently manually doing some connect-the-dots queries, and will write my observation here:
I was to connect indomethacin and DHFR If I do a one-hop direct query, I get one result with a direct link from CTD https://arax.ci.transltr.io/?r=292446
If I do a two-hop query, I get 4 results including cataract, angiodema, CRC, and DILI: https://arax.ci.transltr.io/?r=292449 I'm uncertain if these are great answers, but come from translator results
Then in if I do the new Pathfinder query using ARAXi, I get 97 results: https://arax.ci.transltr.io/?r=292582
One concern about these is that as far as I can tell none of the 97 include what is found via direct one hop and two hop queries.
There's also no ranking yet.
Is it related to this issue?
Pathfinder is not completed yet, it is just the first version of it.
Maybe not directly related, but it is partially related in that I am trying to understand the context of the whole issue and what is currently working and not working.
My understanding is the the current Translator-agreed query_graph for this kind of Pathfinder query is this:
{
"message": {
"query_graph": {
"nodes": {
"n0": {
"ids": [
"CHEBI:49662"
]
},
"un": {
"categories": [
"biolink:NamedThing"
]
},
"n2": {
"ids": [
"NCBIGene:1719"
]
}
},
"edges": {
"e0": {
"subject": "n0",
"object": "un",
"predicates": [
"biolink:related_to"
],
"knowledge_type": "inferred"
},
"e1": {
"subject": "un",
"object": "n2",
"predicates": [
"biolink:related_to"
],
"knowledge_type": "inferred"
},
"e2": {
"subject": "n0",
"object": "n2",
"predicates": [
"biolink:related_to"
],
"knowledge_type": "inferred"
}
}
}
}
}
As written, this returns zero results. I'm a bit perplexed why it returns zero results because there are both 1 hop and 2 hop results.
But in any case, what we want is to to short-circuit this format of query into custom Patherfinder ARAXi. OR potentially do 3 queries:
Documenting the current state of musings...
Not done musing yet...
@mohsenht - here's some example code for calling Expand - in this example I'm assuming we want to do pathfinder between CHEBI:49662
and NCBIGene:1719
:
from ARAX_messenger import ARAXMessenger
from ARAX_expander import ARAXExpander
from ARAX_response import ARAXResponse
sys.path.append(os.path.dirname(os.path.abspath(__file__))+"/../../UI/OpenAPI/python-flask-server/")
from openapi_server.models.q_edge import QEdge
from openapi_server.models.q_node import QNode
# First create a message with the one-hop query graph
response_1 = ARAXResponse()
ARAXMessenger().create_envelope(response_1) # Creates an empty envelope
message_1 = response_1.envelope.message
message_1.query_graph.nodes = {"n0": QNode(ids=["CHEBI:49662"]),
"n1": QNode(ids=["NCBIGene:1719"])}
message_1.query_graph.edges = {"e0": QEdge(subject="n0",
object="n1",
predicates=["biolink:related_to"])}
# Then use Expand to answer the one-hop query
expander = ARAXExpander()
expander.apply(response_1, input_parameters={"kp_timeout": 10})
# You can then find Expand's answers in the knowledge graph
print(response_1.envelope.message.knowledge_graph)
# Then we do the same thing but for a two-hop query
response_2 = ARAXResponse()
ARAXMessenger().create_envelope(response_2)
message_2 = response_2.envelope.message
message_2.query_graph.nodes = {"n0": QNode(ids=["CHEBI:49662"]),
"nx": QNode(categories=["biolink:NamedThing"]),
"n1": QNode(ids=["NCBIGene:1719"])}
message_2.query_graph.edges = {"e0": QEdge(subject="n0",
object="nx",
predicates=["biolink:related_to"]),
"e1": QEdge(subject="nx",
object="n1",
predicates=["biolink:related_to"])}
# Then use Expand to answer the two-hop query
expander.apply(response_2, input_parameters={"kp_timeout": 30})
# You can then find Expand's answers in the knowledge graph
print(response_2.envelope.message.knowledge_graph)
# Note that the qnode that nodes fulfill is captured in the `qnode_keys` property - for example:
node = response_2.envelope.message.knowledge_graph.nodes['CHEBI:49662']
print(f"Node CHEBI:49662 fulfills qnode(s): {node.qnode_keys}")
# And same with qedge_keys - example:
edge = next(edge for edge in response_2.envelope.message.knowledge_graph.edges.values())
print(f"Example edge fulfills qedge(s): {edge.qedge_keys}")
note that you can refine the kp_timeout
parameter to whatever works best - this is basically the number of seconds that Expand will wait to get a response back from KPs' APIs (for each qedge). if you don't specify a timeout, it will default to two minutes, which I imagine is far too long.
let me know if you have any questions!
Hi @edeutsch,
Is it possible to response with two different query graph in ARAX? One for direct one hop result and the other for the triangle shape results?
Hi @mohsenht I may not quite understand the question, but there can only be one query graph coming into ARAX, and the same single query_graph should be repeated in the Response. We used to adjust the query_graph years ago but I think it was decided Translator-wide to stop this behavior.
@edeutsch My question is: if we have only one query graph, is it possible for one result to not follow the structure of the query graph (in this case, directly connecting two nodes instead of forming a triangle shape)?
@dkoslicki In the All Hands Meeting, we agreed to include one-hop and two-hop results from the expander into the PathFinder results. In this process, we'll first use the expander to generate the one-hop and two-hop results, then feed those nodes into our PathFinder algorithm to further expand and improve the results.
@edeutsch My question is: if we have only one query graph, is it possible for one result to not follow the structure of the query graph (in this case, directly connecting two nodes instead of forming a triangle shape)?
I think we should try it. As far as I'm aware it does not violate the schema. But it does violate some policies that we have set forth in the past before we got here. As mentioned in my musings post above, we do have an option_group mechanism that we could try to use.
so I suggest: let's try it. There's a chance that we end up doing something different.
Here is a sample DSL query:
create_message add_qnode(ids=PUBCHEM.COMPOUND:445154, key=n00) add_qnode(ids=NCBIGene:2739, key=n01) connect(action=connect_nodes, max_path_length=4) return(message=true, store=false)
Here is a sample DSL query for PathFinder with Node Constraint:
create_message add_qnode(ids=PUBCHEM.COMPOUND:445154, key=n00) add_qnode(ids=NCBIGene:2739, key=n01) connect(action=connect_nodes, max_path_length=4, node_category_constraint=biolink:NamedThing) return(message=true, store=false)
Hi everyone, I have committed some adjustments to the QueryGraphInterpreter that I hope will do what we want. I opted for a very light touch at first to see how it goes. Please request more adjustments if you want them.
Here's what I did:
I'm not certain this is really what we want, but seemed good. Notably, it is not restricted to the "triangle QG", but rather allows substantially more flexibility in recognizing a potential pathfinder query. I am hoping that this allows experimenting with some of the fancier pathfinder scenarios that were mentioned, beyond a single triangle.
I'm hoping that this will not trigger on any existing non-pathfinder queries. Clearly if the querier really uses knowledge_type="pathfinder", then they really mean it. And I don't think any normal production queries use two or more edges with knowledge_type="inferred", so that catches the "triangle QG" but also allows some additional experimentation.
Notable, also: perhaps against what we may have decided, it does NOT change the QG at all. It does not change the one-hop from "inferred" to "lookup" like we discussed. I decided that connect() can do that as easily as QGI can, so I'll leave that to connect(), giving it a bit more flexibility? Happy to change if you disagree.
Finally, the entire ARAXi program that is generated is:
connect(action=connect_nodes, max_path_length=4)
The expectation is that this runs on the existing query graph, whatever it is. Note that it may be more complicated than the default "triangle QG". But I assume that is a desirable feature, although may require some more effort.
This is merged into master
but only minimally tested.
What do you think?
So I tried a full test in devED: https://arax.ncats.io/devED/ pasting in the above query graph, and I see this:
...
2024-10-09T21:51:41.549822 INFO: QueryGraphInterpreter recognized query_graph as a 'pathfinder' query: triggering pathfinder subsystem.
...
2024-10-09T21:51:41.579091 INFO: Parsing input actions list
2024-10-09T21:51:41.579120 DEBUG: Parsing action: connect(action=connect_nodes, max_path_length=4)
2024-10-09T21:51:41.708016 INFO: Processing action 'connect' with parameters {'action': 'connect_nodes', 'max_path_length': '4'}
2024-10-09T21:51:41.708110 ERROR: Connect works with just two qnodes. qnode list size: 3
2024-10-09T21:51:41.708149 DEBUG: Applying Connect to Message with parameters {'action': 'connect_nodes', 'max_path_length': 4, 'qnode_keys': ['n2', 'un', 'n0'], 'node_category_constraint': ''}
and then halt.
I assume this is more or less what we want and it is up to connect() to be adjusted by @mohsenht to handle the "triangle QG" from here?
@edeutsch
Thank you Eric for your help with this conversion issue.
I think, we made a decision for TRAPI and the structure is almost certain.
For ARAX DSL, would it be better to get just parameters (two pinned nodes and on node constraint that every path should have at least one), or the user need to build that TRAPI structure query graph with DSL code.
Something like the below this DSL code:
create_message
add_qnode(ids=MONDO:0004975, key=pinned_node_1)
add_qnode(key=middle_node,categories=biolink:Gene)
add_qnode(ids=UMLS:C1318700, key=pinned_node_2)
add_qedge(key=direct_edge,subject=pinned_node_1,object=pinned_node_2,predicates=biolink:related_to)
add_qedge(key=pinned_node_1_to_middle_node_edge,subject=pinned_node_1,object=middle_node,predicates=biolink:related_to)
add_qedge(key=middle_node_to_pinned_node_2_edge,subject=middle_node,object=pinned_node_2,predicates=biolink:related_to)
connect(action=connect_nodes, max_path_length=4)
return(message=true, store=false)
I don't think I really understand the question. I think it is a given that queries will come in via the query_graph like in the middle of this thread, e.g.:
{
"message": {
"query_graph": {
"nodes": {
"n0": {
"ids": [
"CHEBI:49662"
]
},
"un": {
"categories": [
"biolink:NamedThing"
]
},
"n2": {
"ids": [
"NCBIGene:1719"
]
}
},
"edges": {
"e0": {
"subject": "n0",
"object": "un",
"predicates": [
"biolink:related_to"
],
"knowledge_type": "inferred"
},
"e1": {
"subject": "un",
"object": "n2",
"predicates": [
"biolink:related_to"
],
"knowledge_type": "inferred"
},
"e2": {
"subject": "n0",
"object": "n2",
"predicates": [
"biolink:related_to"
],
"knowledge_type": "inferred"
}
}
}
}
}
We already have that in hand. So the DSL does not need to create a message and add qnodes and qedges; they're already all there.
I think all that connect() needs to do is work with the query graph as given and create a knowledge graph and results that map back to the query graph, using its own pathfinding algorithm as well as Expand() as appropriate.
So I think the only ARAXi necessary is:
connect(action=connect_nodes, max_path_length=4)
and that is implemented and apparently working.
So here's my thoughts on what connect() should be doing in this scenario:
Now if the query graph is more complicated than just the triangle with all three edges being inferred, then some additional strategy needs to be invoked. But I would suggest putting that on the back burner for now until you can get the above working.
I don't think I really understand the question. I think it is a given that queries will come in via the query_graph like in the middle of this thread, e.g.:
@edeutsch In talking with @mohsenht , I think his intention about the ARAXi question is more along the lines of "If someone were not doing a query through the ARS, but wanted to do it through the ARAX UI itself and build a DSL, what should the syntax be? Put parameters in the DSL constructed QG, or put it in the connect
command itself?" With the conclusion being "put parameters in the DSL constructed QG (wrt intermediate node constraints and the like" in order to make the paradigm consistent: be it from ARAXi or from an ARS TRAPI message, we'll just call connect()
and use the associated query graphs to determine constraints and the like (be they from the TRAPI QG or the DSL QG).
AH, I get it now, sorry.
Yes, I think using the DSL to build a query graph that mirrors what would come from the ARS seems easiest. One snag could be that the edges may need to have knowledge_type=inferred tags, which I think the current DSL does not support maybe (but easily added).
In the short term, this may not be needed, but eventually if we want connect() to treat NON knowledge_type=inferred edges in a special way (only send to Expand, not subject to pathfinding) then we may need to add that distinction.
Hi @edeutsch,
I apologize for the confusion my comment caused and for taking up your time. Occasionally, I leave comments as personal notes to track my thoughts and things I need to discuss with @dkoslicki.
Going forward, I'll label such comments as "Personal Note" to avoid taking up the time of the ARAX team.
no problem. Your note was actually clear enough in the first place. I just misread it/misunderstood it.
Is there a Translator-wide agreement on what the TRAPI request for Pathfinder should look like?