RTXteam / RTX

Software repo for Team Expander Agent (Oregon State U., Institute for Systems Biology, and Penn State U.)
https://arax.ncats.io/
MIT License
33 stars 21 forks source link

should we be surprised that this query doesn't return any genes? #1827

Open saramsey opened 2 years ago

saramsey commented 2 years ago

This question came up in today's stand-up meeting, where we looked at Translator's results for running a two-hop query (see JSON) about the connection between NCBIGene:3643 (Insulin receptor) and HP:0100750 (atelectasis). Apparently, Aragorn returned genes for this query (which it obtained via the ROBOKOP co-occurrence graph, I think?), but for some reason, ARAX is mostly returning drugs. Furthermore, we get a different number of results if we run the above query in ARAX versus if we reverse the e0 and e1 query edges (see these results for the query graph with e0 and e1 reversed; note that we set the timeout to 600 and the pruning limit to 200 for that query). I am wondering if this is by design; in other words, I wondering if due to pruning, the results that ARAX will return in general depend on the order of edges in the query graph? It was posited in today's meeting that if an ARA returns different results depending on the order of query edges in the graph, that should be regarded as a bug; the question of invariance of results under reversal of the order of the query edges is an interesting one! On some level, a graph is (mathematically speaking) a set of edges, so if our results do depend on an order of query edges in the JSON representation, then we are depending on something that is not strictly a query graph semantic, right? So I guess this is two issues for discussion, actually. :-)

edeutsch commented 2 years ago

I think we've discussed this before, and I think it's been on our TODO list for a while, but I think it would be advantageous if Expand() would expand from all pinned nodes in parallel and be able to do a "hash join" where the paths meet. this query seems like an ideal example. Expand should expand simultaneously from both pinned nodes and correlate where they meet rather than expand from one, prune, and then try to expand with the pruned set to the other pinned node. this would make the result order invariant and be smarter and better. Would be great to do. I don't know if we have the resources to do it. Would require some substantial re-engineering to intercept cases like this and pursue a different expand strategy.

amykglen commented 2 years ago

that does seem ideal to expand from all different pinned qnodes and meet in the middle, though would require some substantial reengineering like Eric mentioned. and also, for queries with more than 2 hops, we'd still likely have to do some pruning along the way due to combinatorial explosion.

however, I think the recent addition of the user-controlled pruning threshold likely worsened our responses for this kind of query - it used to be that for a query like this one, the prune threshold for the middle unpinned node would be auto-set to 5,000 (because the final hop is effectively doubly-pinned - a one-hop query with 5,000 curies on one end and 1 curie on the other can be answered pretty speedily). but now the pruning threshold always defaults to 50 via the UI, and it also looks like the UI doesn't let one enter a number for the prune threshold with more than 3 digits?

saramsey commented 2 years ago

it also looks like the UI doesn't let one enter a number for the prune threshold with more than 3 digits?

@amykglen thank you; would you mind submitting an issue report about that specifically, and maybe tagging Luis?

isbluis commented 2 years ago

Hi @saramsey and @amykglen . I've already made this update in my dev area; just pending a git commit.

https://arax.ncats.io/devLM/index.html

Will 5 digits be enough? I certainly don't want to allow super long values to be entered.

amykglen commented 2 years ago

yes, 5 digits should be plenty! thanks, @isbluis

rcpeene commented 2 years ago

After a lot of drawing graph diagrams and testing code, I have implemented a module which does the first essential part of the process @edeutsch mentioned above. graph_splitter.py was pushed to branch issue1827. I also added demographs.py, which has a list of query graphs that may be used as test cases for this module. Currently, this module is capable of taking in a query graph as input and returning a list of query graph 'fragments' which have at least one pinned node and share at least one unpinned node with at least one other fragment. In this way, these fragments could be expanded in parallel and then hash joined into the complete resulting knowledge graph after being expanded. I believe @amykglen plans to look at it at some point and identify where/how this would be implemented into expand.

rcpeene commented 2 years ago

Additional note. I took measures to make the fragmentation process 'deterministic'. In other words, if you fragment the same query graph multiple times it should result in the same fragments. (Though it pains me to say that I could not tell you the exact fragmentation pattern that a given graph will have ahead of time.)