New ARAXi command: connectedness/relatedness

dkoslicki commented 3 years ago

For a while now, a few people have brought up the desire to do a search along the lines of “how are X and Y connected/related?” without the need to specify an actual path or set of paths. Eg #261 mentions how a (now defunct) service called Biograph could do these sorts of “neighborhood exploration.” I imagine there could be a few different ways to do this, so I’ll throw these out for feedback:

Since we have a variety of different ways to find paths between pairs of nodes, we might consider having modes of operation such as: “how are X and Y related in the literature (use NGD)”, “how are X and Y related in medical records (use COHD)”, “how are X and Y related in KG2 (use Fisher exact test)” etc.
Alternatively, we could “throw the kitchen sink” at it in the following way: find all shortest paths between X and Y in KG2 (or maybe find the shortest ~1000 paths so it doesn’t return a single link if X and Y happen to be connected), decorate with all our overlays, and then let the ranker figure things out.

Tagging @amykglen in case she has thoughts of how to do path finding when, iirc, expand only works with one hop at a time. Feedback on this idea welcome from @finnagin , @chunyuma , and @ShaopengLiu1, as well as @edeutsch since he might have ideas of how this would fit into the DSL (eg. Is it an expand mode? A new command?, etc.).

dkoslicki commented 3 years ago

Oh yeah, and @jaredroach might have some input due to being the originator of #261.

jaredroach commented 3 years ago

Dan Fischer worked out a way to do this for his project. In that project, we need to construct a knowledge tree, subject to constraints. In short, we feed in a root node, such as "inflammatory" and also input a list of leaf nodes. The output must include all input leaf nodes (although if it cannot find one, it can be dropped) and no other leaf nodes. A subroutine in the workflow to accomplish this is to find the relationship(s) between the root and the leafs. We are currently using the second of David's approach (i.e., KG2) and not specific types of relationships. But it would also be interesting ton only use specific types. However, we might have a lot more failures to find relationships if we limited it to certain edge types. Not every root-leaf is connected in COHD. Indeed, maybe very few.

Dan is using a set of CYPHER commands I believe. Later, we post-process a bit to get rid of cycles and unnecessary internal nodes. But the post processing isn't directly germane to this issue.

dkoslicki commented 3 years ago

Interesting @jaredroach , so basically a spanning tree of sorts. I imagine the constraints are especially important due to many such spanning trees existing. Do you know how Dan does the constraints (and any chance I could take a look at the cypher commands)? The second idea I proposed would be “return me all spanning trees containing X and the Y’s, ranked by ARAX_ranker.

jaredroach commented 3 years ago

Below is the sort of query I use in KG2:

MATCH (m) WHERE m.id in [ CURIE_ID_1, CURIE_ID_2, ... CURIE_ID_n ] WITH m MATCH (n {id: "NCIT:C3137"}), p=shortestPath((n)-[*]->(m)) RETURN p

Best,

Dan

jaredroach commented 3 years ago

Thanks Dan! I sent you a an invite in case you want to chime in to issue 1387 on the repo. Thanks for sending the cypher commands. Seems the constraints books down to minimal (in the number of edges) spanning tree.

Thanks,

~David

jaredroach commented 3 years ago

To reproduce the "BioGraph" experience, My sense is that some kind of greedy (or semi-greedy) algorithm that started at one (or both) input nodes and explored the tree until it found the other node would be good. Could even be something like simulated annealing. Obviously works best if edges are weighted. In that case, prefer to explore more highly weighted edges. Keep multiple solutions, ideally including very different solutions, perhaps even giving a score boost to a path that shares very few nodes with other reported paths. And if the algorithm times out without finding a path, no shame in reporting "no paths". However, could always report shortest path to guard against a complete absence of any out put paths.

edeutsch commented 3 years ago

Pondering how this could work within the query_graph framework, it occurs to me that maybe we can use our existing (but non-standard) option_group_id mechanism. Would the following query_graph capture "How are A and B connected in 1, 2, or 3 hops?":

One could imagine a GUI helper that would generate such a query_graph when the user gives A and B and the allowed number of hops.

Maybe Expand() could handle this out of the box today. Or maybe there could be some specialized code that recognizes this kind of query and perhaps returns the top answers from each path?

amykglen commented 3 years ago

nice! yeah, I had that thought about option groups as well. one problem is that with the way we designed the system, there has to be at least one "required" path in the QG. so this couldn't work quite right currently, I believe. but maybe it wouldn't be too bad to hack around... (I think it's mainly resultify that would need to be tweaked)

the other problem is that I think this could result in wayy too big of answers due to combinatorial explosion for three-hop+ option groups (like, break the system kind of big) - but maybe if each edge expansion was followed by a FET overlay and subsequent filter step (or some other intelligent limiting mechanism), it could work better?

jaredroach commented 3 years ago

For this type of query, the user is most interested in the list of all nodes that connect the two input nodes, not the list of all paths. So although the number of paths explodes combinatorically, the list of all nodes does not. This insight might allow some algorithmic tweaks to avoid polynomial time/memory problems. One could supply a subgraph with a set of the returned nodes in less than polynomial time/space.

dkoslicki commented 3 years ago

For this type of query, the user is most interested in the list of all nodes that connect the two input nodes, not the list of all paths. So although the number of paths explodes combinatorically, the list of all nodes does not. This insight might allow some algorithmic tweaks to avoid polynomial time/memory problems. One could supply a subgraph with a set of the returned nodes in less than polynomial time/space.

True, however the number of nodes also explodes due to the relatively low edge degree of nodes. Eg. Just yesterday I did a three hop query and ended up with over 100,000 nodes.

amykglen commented 3 years ago

I also was figuring people would want some sort of provenance/confidence for answers, which generally lives on edges at the moment.

jaredroach commented 3 years ago

provenance can come later, if necessary. E.g., they can ask for provenance of a particular edge between two nodes of interest.

in reply to David, if the number of nodes is growing exponentially, then I see no way around it other than using weighted edges. Could be random weights if no weights are available prior to the query.

dkoslicki commented 3 years ago

Agreed: I liked Amy's FET approach, or an NGD-weighting approach. Remains to be seen how to implement that. As @amykglen mentioned, since 3 hops causes a large explosion of results, perhaps we can do one (or both) of these:

Hop along using FET and take the top N each time
Expand first edge, decorate with NGD, take top N, then repeat for subsequent edges

edeutsch commented 3 years ago

I wonder, does Expand() perform some "start iteratively at all nodes with CURIEs and meet in the middle with a hash join" or something like that? Seems like that could be more efficient than a linear walk? IFF there are multiple nodes with CURIEs, of course..

jaredroach commented 3 years ago

Lay some superhighways through the KG. Then plot routes through it the way Google provides directions with Google maps.. This is perhaps a specific instantiation of weighted edges, but not necessarily. Basically, when trying to go from point A to point B, use local routes to get to the nearest superhighway, then follow the superhighway as close as possible to the destination, then use local routes again. Not necessary to use any superhighways if you start really close. Might be a lot of work figuring out the superhighway backbone. But nothing the Consortium as a whole wouldn't be able to provide. Backbone could be something like the best PubMed-abstract connected edges. Backbone could even be generated empirically over time. Record how many times an edge is returned in queries, or something. And then make a backbone from the most traveled edges. And yes, Arthur Dent might be problematic.

edeutsch commented 3 years ago

After quieting the thousand hairy horsemen shouting at me, I realized you may be right!

I have been fixated on turning on is_set=true for intermediate nodes because otherwise the results blow up, but I wonder if we could have an alternate approach of a "coalescer" as part of "resultify"?

Suppose you have a two-hop query n0-e0-n1-e1-n2 with no is_set=true, right now you would generate say 500 results with each result being a unique path through the KG. What if we could build an aggressive "coalescer" that would see that 50 of the 500 results all go through the same e0 and n1 and coalesce all those 50 into one result with the same n0-e0-n1 and then fan out the differing e1s and n2s into a single result. And so on. So, compute overall all 500 possible paths which intermediate or end nodes occur most frequently and aggressively coalesce those. This is quite similar to the current Jaccard approach but seems more generalized and could apply to all non-CURIE query graph nodes. So instead of "is_setting" always n1s, you could also "is_set" n2s. Basically try to compact the 500 into the smallest list by aggressively coalescing and up-weighting the "superhighways".

Not easy. But probably some well-known (except to me) graph theory operation.

What do you think?

amykglen commented 3 years ago

summary of plan decided on at mini-hackathon on May 4:

idea is to create a new ARAXi command connect() (corresponding to an ARAXConnect class)
it will repeatedly internally call expand to connect the input nodes until it decides they're connected enough (using expand means we can use KPs other than just KG2 to find connections)
can use FET to prevent combinatorial explosion and NGD to weight (David has more specific ideas here 🙂 )
perhaps find minimum spanning tree as final step
still to resolve: what the structure of the results should look like
Amy will plan to implement it this summer (goal to have it done by mid summer)

finnagin commented 2 years ago

@amykglen Is this still relevant?

dkoslicki commented 2 years ago

I think this is still relevant, since we don't currently have such an ARAXi command. @amykglen let me know your thoughts on if you have bandwidth to continue pursuing this, or if I should find others to assign to it.

amykglen commented 2 years ago

sorry, I won't have bandwidth for this, at least not until April.

been wondering - is overlay_connect_knodes similar to this idea? I'm not at all familiar with that operation but it sounds similar..

finnagin commented 2 years ago

No, not really. Connect knodes just does overlays on all node pars or node triples for jaccard index.

dkoslicki commented 2 years ago

Notes:

Eventually reach out to all KPs
Allow "at least two hops" (will need to think about how this affects the ranker since if we do "1 and 2 hops", then resultify and ranker may bork)
Think carefully about "meta predicates": things like "regulates_in_some_fashion". May need an OWL reasoner or some other more structured approach. As a first approach: imagine the meta-predicate "affects/regulates_positively/negatively" and take the direct approach of assigning +/- to each leaf descendant of affects and affected_by. We will think later about how to handle other not-so-clear meta-predicates.

finnagin commented 2 years ago

From meeting today: Instead of running through the pairs one at a time try connecting all pairs by a one hop, then if they are all connected return the result. If not then try pairing all non-connected nodes with all other nodes with 2-hops and check if all nodes are connected, etc until max_path length is reached.

dkoslicki commented 2 years ago

Idea is to replicate a "find me a minimal spanning tree connecting this set of nodes"

dkoslicki commented 2 years ago

And since this is the Steiner tree problem (and so NP-complete), @finnagin you may try using one of the approaches specified here. Basically: "start at an arbitrary node in the query, and progressively building up a tree by adding the shortest path to a not-yet-connected node" And in our case, it might make sense to add the N shortest paths instead of just one (since we want multiple answers)

dkoslicki commented 2 years ago

We currently have a simple connect using #1846 which amounts to: try one hop, then two hop, then three. No fanciness to prune this down into "important" paths. Also terminates after finding any such path (eg. if one hop connects them, don't bother with 2 hops). We should discuss on AHM if shortest_path=False should be turned on and then using subsequent filtering to decrease blowup (it might also take too long)

amykglen commented 2 years ago

from 6/22 AHM: plan is for Amy (and eventually new PSU hire) to work on making connect more efficient

first step is to brainstorm ways of using expand to efficiently get what we want - future mini-hackathon

dkoslicki commented 1 year ago

Update on this: in the queue for @kvnthomas98 to investigate making it more efficient via Trovares xGT (I have notebooks showing how) on RTX-KG2.

dkoslicki commented 4 months ago

Closing as this has been supplanted by Mohsen's pathfinder

RTXteam / RTX

New ARAXi command: connectedness/relatedness #1387