RTXteam / RTX

Software repo for Team Expander Agent (Oregon State U., Institute for Systems Biology, and Penn State U.)
https://arax.ncats.io/
MIT License
33 stars 21 forks source link

All query_graph qnodes must be ultimately connected to all other qnodes #772

Closed edeutsch closed 4 months ago

edeutsch commented 4 years ago

I propose that ARAX should return an error if any qnode in the query graph is not connected either directly or indirectly to every other node in the query_graph. I don't think such a query_graph should be answerable. See #640 and #771

Such checking is already implicit in the ARAXQueryGraphInterpreter because it will not match an existing template.

However, it might be nice to return a more descriptive error on what's wrong, since not matching a template is quite vague.

What do you think?

dkoslicki commented 4 years ago

Boils down to just checking (eg. With #773 ) if the QG has a single weakly connected component, right?

The only problem I could foresee is if someone wants to run NGD on a pair of nodes that aren’t connected by any KP. Basically wanting to run a DSL like:

Add_qnode(name=foo, id=n1)
Add_qnode(name=bar, id=n2)
Expand() #<— this might not work, but you get the gist
Overlay(action=Compute_ngd, source_qnode_id=n1, target_qnode_id=n2)

Here, the QG isn’t connected until after the overlay step adds the virtual edge.

So I guess it boils down to when it should be checked if the QG has a single weakly connected component.

edeutsch commented 4 years ago

Would you define "weakly connected"? I used "directly connected" and "indirectly connected" and "ultimately connected", but I don't know the terms of art with respect to graphs. Is "weakly connected" the correct term of art for nodes that have some path of 2..N hops between them?

I propose the DSL for that intent should be this:

add_qnode(name=foo, id=n1) add_qnode(name=bar, id=n2) add_qedge(source_id=n1,target_id=n2, id=e1) overlay(action=Compute_ngd, source_qnode_id=n1, target_qnode_id=n2) resultify()

n1 and n2 are posited to be connected somehow, and thus e1 forms that connection. There is no desired expand(). Only to compute NGD.

I would argue that "what is the NGD between foo and bar?" translates to two qnodes and the "between" translates to a qedge of unspecified type between them.

One could envision inserting before the overlay() above: expand(edge_id=e1,kp=ARAX/KG2,continue_if_no_results=True) for an added twist. But not necessary if all the user wants is NGD.

Does this seem reasonable? And preserve the proposed issue title question?

amykglen commented 4 years ago

There is no desired expand().

If Expand isn't called, when will the actual Nodes corresponding to the two QNodes be added to the KnowledgeGraph?

edeutsch commented 4 years ago

Yes, you're quite right right. It seems like expand() is a required step as part of any workflow. My DSL code above is flawed in several ways. See #774 for what I'm currently trying.

In that, I'm thinking the user would have to do: expand(edge_id=e1,kp=ARAX/KG2,continue_if_no_results=True)

Although perhaps an alternative if the user really didn't want any edge expansion is just to expand the nodes:

expand(node_id=[n1,n2],kp=ARAX/KG2)

that syntax seems to work! Nice! So here's a full workflow:

add_qnode(name=acetaminophen, id=n1) add_qnode(name=scabies, id=n2) add_qedge(source_id=n1,target_id=n2, id=e1) expand(node_id=[n1,n2],kp=ARAX/KG2) overlay(action=compute_ngd, virtual_edge_type=NGD, source_qnode_id=n1, target_qnode_id=n2) resultify()

Resultify() still isn't happy because there isn't an e1 edge in the KG.

But aha, this works completely:

add_qnode(name=acetaminophen, id=n1) add_qnode(name=scabies, id=n2) expand(node_id=[n1,n2],kp=ARAX/KG2) overlay(action=compute_ngd, virtual_edge_type=NGD, source_qnode_id=n1, target_qnode_id=n2) resultify()

One result and everything!

I'm not sure what the implications are. I still like there to be an edge e1, but I'm less sure about it.

amykglen commented 4 years ago

Nice! Just realized - would such a query be considered 'expanding using NGD' (#693)? If so, this is kind of out there, but perhaps we would want an edge e1 that we expand using kp=NGD? So (theoretically):

add_qnode(name=acetaminophen, id=n1)
add_qnode(name=scabies, id=n2)
add_qedge(source_id=n1, target_id=n2, id=e1)
expand(kp=NGD)
resultify()
edeutsch commented 4 years ago

hmm, this is not what I had in mind when thinking about expanding using NGD (#693). To do that, we basically need to answer the question: which top N nodes have the shortest NGD from acetaminophen? So the DSL would be this:

add_qnode(name=acetaminophen, id=n1)
add_qnode(id=n2)
add_qedge(source_id=n1, target_id=n2, id=e1)
expand(edge_id=e1,kp=NGD)
resultify()
filter_results(action=limit_number_of_results, max_results=50)

and KP "NGD" (as yet to be built) would return all nodes that have a non-Inf/Nan NGD.

amykglen commented 4 years ago

Ok, yeah. Isn't the above just a version of that where both of the nodes already happen to have curies specified? It still would be adding an edge between the two nodes (assuming it can calculate an NGD value for the pair). Seems like much more straightforward DSL for finding the NGD for a pair of nodes than expanding the two nodes (but no edges) using a random KP and then overlaying the NGD value in a virtual edge.

(Raising this because if this does make sense, then perhaps we don't yet have reason to make disconnected query graphs allowed.)

amykglen commented 4 years ago

But I suppose if you mean that no edge should be returned from expand(kp=NGD) if the NGD value isn't above some set value, then that wouldn't work.

edeutsch commented 4 years ago

ah, I see, I kind of answered a different question. yeah, I think your example:

add_qnode(name=acetaminophen, id=n1)
add_qnode(name=scabies, id=n2)
add_qedge(source_id=n1, target_id=n2, id=e1)
expand(kp=NGD)
resultify()

could indeed be a reasonable thing to try, sure. At the moment, the KP "NGD" would only be able to expand on two nodes of known curie, which is quite limited compared to other KPs. But when/if that limitation is lifted, then I see no reason not to allow this case as well. And by that argument, it should be fine to support it now.

dkoslicki commented 4 years ago

Late over here and won’t be able to reply to the whole thread, but @edeutsch re:

Would you define "weakly connected"?

It is a term of art: “A directed graph is called weakly connected if replacing all of its directed edges with undirected edges produces a connected (undirected) graph” where a “connected (undirected) graph” is one where for all nodes u and v, there exists an undirected path (of any length) from u to v.

edeutsch commented 4 years ago

ah good, thanks. So the title should be: Must each query_graph be a weakly connected graph? right?

dkoslicki commented 4 years ago

ah good, thanks. So the title should be: Must each query_graph be a weakly connected graph? right?

Yup!

finnagin commented 2 years ago

@dkoslicki @edeutsch @amykglen Is this still relevant or should we close?

edeutsch commented 2 years ago

I don't know.

dkoslicki commented 4 months ago

Closing since it hasn't come up in over 2 years (especially since I don't think anyone is querying ARAX with disconnected QG's; definitely not the ARS)