Closed dkoslicki closed 4 months ago
Oh yeah, and @jaredroach might have some input due to being the originator of #261.
Dan Fischer worked out a way to do this for his project. In that project, we need to construct a knowledge tree, subject to constraints. In short, we feed in a root node, such as "inflammatory" and also input a list of leaf nodes. The output must include all input leaf nodes (although if it cannot find one, it can be dropped) and no other leaf nodes. A subroutine in the workflow to accomplish this is to find the relationship(s) between the root and the leafs. We are currently using the second of David's approach (i.e., KG2) and not specific types of relationships. But it would also be interesting ton only use specific types. However, we might have a lot more failures to find relationships if we limited it to certain edge types. Not every root-leaf is connected in COHD. Indeed, maybe very few.
Dan is using a set of CYPHER commands I believe. Later, we post-process a bit to get rid of cycles and unnecessary internal nodes. But the post processing isn't directly germane to this issue.
Interesting @jaredroach , so basically a spanning tree of sorts. I imagine the constraints are especially important due to many such spanning trees existing. Do you know how Dan does the constraints (and any chance I could take a look at the cypher commands)? The second idea I proposed would be “return me all spanning trees containing X and the Y’s, ranked by ARAX_ranker
.
Below is the sort of query I use in KG2:
MATCH (m) WHERE m.id in [ CURIE_ID_1, CURIE_ID_2, ... CURIE_ID_n ] WITH m MATCH (n {id: "NCIT:C3137"}), p=shortestPath((n)-[*]->(m)) RETURN p
Best,
Dan
Thanks Dan! I sent you a an invite in case you want to chime in to issue 1387 on the repo. Thanks for sending the cypher commands. Seems the constraints books down to minimal (in the number of edges) spanning tree.
Thanks,
~David
To reproduce the "BioGraph" experience, My sense is that some kind of greedy (or semi-greedy) algorithm that started at one (or both) input nodes and explored the tree until it found the other node would be good. Could even be something like simulated annealing. Obviously works best if edges are weighted. In that case, prefer to explore more highly weighted edges. Keep multiple solutions, ideally including very different solutions, perhaps even giving a score boost to a path that shares very few nodes with other reported paths. And if the algorithm times out without finding a path, no shame in reporting "no paths". However, could always report shortest path to guard against a complete absence of any out put paths.
Pondering how this could work within the query_graph framework, it occurs to me that maybe we can use our existing (but non-standard) option_group_id mechanism. Would the following query_graph capture "How are A and B connected in 1, 2, or 3 hops?":
One could imagine a GUI helper that would generate such a query_graph when the user gives A and B and the allowed number of hops.
Maybe Expand() could handle this out of the box today. Or maybe there could be some specialized code that recognizes this kind of query and perhaps returns the top answers from each path?
nice! yeah, I had that thought about option groups as well. one problem is that with the way we designed the system, there has to be at least one "required" path in the QG. so this couldn't work quite right currently, I believe. but maybe it wouldn't be too bad to hack around... (I think it's mainly resultify that would need to be tweaked)
the other problem is that I think this could result in wayy too big of answers due to combinatorial explosion for three-hop+ option groups (like, break the system kind of big) - but maybe if each edge expansion was followed by a FET overlay and subsequent filter step (or some other intelligent limiting mechanism), it could work better?
For this type of query, the user is most interested in the list of all nodes that connect the two input nodes, not the list of all paths. So although the number of paths explodes combinatorically, the list of all nodes does not. This insight might allow some algorithmic tweaks to avoid polynomial time/memory problems. One could supply a subgraph with a set of the returned nodes in less than polynomial time/space.
For this type of query, the user is most interested in the list of all nodes that connect the two input nodes, not the list of all paths. So although the number of paths explodes combinatorically, the list of all nodes does not. This insight might allow some algorithmic tweaks to avoid polynomial time/memory problems. One could supply a subgraph with a set of the returned nodes in less than polynomial time/space.
True, however the number of nodes also explodes due to the relatively low edge degree of nodes. Eg. Just yesterday I did a three hop query and ended up with over 100,000 nodes.
I also was figuring people would want some sort of provenance/confidence for answers, which generally lives on edges at the moment.
provenance can come later, if necessary. E.g., they can ask for provenance of a particular edge between two nodes of interest.
in reply to David, if the number of nodes is growing exponentially, then I see no way around it other than using weighted edges. Could be random weights if no weights are available prior to the query.
Agreed: I liked Amy's FET approach, or an NGD-weighting approach. Remains to be seen how to implement that. As @amykglen mentioned, since 3 hops causes a large explosion of results, perhaps we can do one (or both) of these:
I wonder, does Expand() perform some "start iteratively at all nodes with CURIEs and meet in the middle with a hash join" or something like that? Seems like that could be more efficient than a linear walk? IFF there are multiple nodes with CURIEs, of course..
After quieting the thousand hairy horsemen shouting at me, I realized you may be right!
I have been fixated on turning on is_set=true for intermediate nodes because otherwise the results blow up, but I wonder if we could have an alternate approach of a "coalescer" as part of "resultify"?
Suppose you have a two-hop query n0-e0-n1-e1-n2 with no is_set=true, right now you would generate say 500 results with each result being a unique path through the KG. What if we could build an aggressive "coalescer" that would see that 50 of the 500 results all go through the same e0 and n1 and coalesce all those 50 into one result with the same n0-e0-n1 and then fan out the differing e1s and n2s into a single result. And so on. So, compute overall all 500 possible paths which intermediate or end nodes occur most frequently and aggressively coalesce those. This is quite similar to the current Jaccard approach but seems more generalized and could apply to all non-CURIE query graph nodes. So instead of "is_setting" always n1s, you could also "is_set" n2s. Basically try to compact the 500 into the smallest list by aggressively coalescing and up-weighting the "superhighways".
Not easy. But probably some well-known (except to me) graph theory operation.
What do you think?
summary of plan decided on at mini-hackathon on May 4:
connect()
(corresponding to an ARAXConnect
class)expand
to connect the input nodes until it decides they're connected enough (using expand
means we can use KPs other than just KG2 to find connections)result
s should look like@amykglen Is this still relevant?
I think this is still relevant, since we don't currently have such an ARAXi command. @amykglen let me know your thoughts on if you have bandwidth to continue pursuing this, or if I should find others to assign to it.
sorry, I won't have bandwidth for this, at least not until April.
been wondering - is overlay_connect_knodes
similar to this idea? I'm not at all familiar with that operation but it sounds similar..
No, not really. Connect knodes just does overlays on all node pars or node triples for jaccard index.
Notes:
affects
and affected_by
. We will think later about how to handle other not-so-clear meta-predicates.From meeting today: Instead of running through the pairs one at a time try connecting all pairs by a one hop, then if they are all connected return the result. If not then try pairing all non-connected nodes with all other nodes with 2-hops and check if all nodes are connected, etc until max_path length is reached.
Idea is to replicate a "find me a minimal spanning tree connecting this set of nodes"
And since this is the Steiner tree problem (and so NP-complete), @finnagin you may try using one of the approaches specified here. Basically: "start at an arbitrary node in the query, and progressively building up a tree by adding the shortest path to a not-yet-connected node" And in our case, it might make sense to add the N shortest paths instead of just one (since we want multiple answers)
We currently have a simple connect
using #1846 which amounts to: try one hop, then two hop, then three. No fanciness to prune this down into "important" paths. Also terminates after finding any such path (eg. if one hop connects them, don't bother with 2 hops). We should discuss on AHM if shortest_path=False
should be turned on and then using subsequent filtering to decrease blowup (it might also take too long)
from 6/22 AHM: plan is for Amy (and eventually new PSU hire) to work on making connect
more efficient
first step is to brainstorm ways of using expand
to efficiently get what we want - future mini-hackathon
Update on this: in the queue for @kvnthomas98 to investigate making it more efficient via Trovares xGT (I have notebooks showing how) on RTX-KG2.
Closing as this has been supplanted by Mohsen's pathfinder
For a while now, a few people have brought up the desire to do a search along the lines of “how are X and Y connected/related?” without the need to specify an actual path or set of paths. Eg #261 mentions how a (now defunct) service called Biograph could do these sorts of “neighborhood exploration.” I imagine there could be a few different ways to do this, so I’ll throw these out for feedback:
Tagging @amykglen in case she has thoughts of how to do path finding when, iirc,
expand
only works with one hop at a time. Feedback on this idea welcome from @finnagin , @chunyuma , and @ShaopengLiu1, as well as @edeutsch since he might have ideas of how this would fit into the DSL (eg. Is it anexpand
mode? A new command?, etc.).