RTXteam / RTX

Software repo for Team Expander Agent (Oregon State U., Institute for Systems Biology, and Penn State U.)
https://arax.ncats.io/
MIT License
33 stars 21 forks source link

How should we encode lists in QueryGraphs? #717

Closed edeutsch closed 3 years ago

edeutsch commented 4 years ago

It seems like we want to get the Fisher exact test stuff working again, but maybe to get this working usefully we need a way to encode an input list in a QueryGraph. What do we all think of the below? perhaps along the lines of "Which diseases are most relevant to [list of proteins]?"

"query_graph": { "edges": [ { "id": "e00", "negated": null, "relation": null, "source_id": "n00", "target_id": "n01", "type": "physically_interacts_with" }, ], "nodes": [ { "curie": ["UniProtKB:1"",UniProtKB:2"",UniProtKB:3"], "id": "n00", "is_set": true, "type": "protein" }, { "curie": null, "id": "n01", "is_set": null, "type": "disease" } ] }

This is in contrast to the standard thing that we already do successfully: "Which diseases are associated with insulin?"

"query_graph": { "edges": [ { "id": "e00", "negated": null, "relation": null, "source_id": "n00", "target_id": "n01", "type": null }, ], "nodes": [ { "curie": "UniProtKB:P01308", "id": "n00", "is_set": null, "type": "protein" }, { "curie": null, "id": "n01", "is_set": null, "type": "disease" } ] }

chunyuma commented 4 years ago

@edeutsch, I think the first one is suitable for our case.

chunyuma commented 4 years ago

@edeutsch, is it possible to allow either "is_set": true or "is_set": false for the list of n00 in the first QueryGraph? In this way, we can access the sub KG for each element of the list of curie.

edeutsch commented 4 years ago

I am thinking that it would NOT make sense for curie to be a list and is_set=false. So I would probably code up add_qnode() so that if curie is a list, then is_set would be automatically set to true, perhaps with a WARNING if is_set=true was not specified by the user.

chunyuma commented 4 years ago

@edeutsch, I think forcing add_qnode() to have is_set=true should be fine if curie is a list because using message.knowledge_graph to retrieve the results should be no difference. I would support your idea.

amykglen commented 4 years ago

I also agree. Allowing QNode.curie to be a list, but then requiring is_set=true for such QNodes seems to make good sense to me.

edeutsch commented 4 years ago

Proposed DSL: set_qnode(id=n00,curie=[UniProtKB:P14136,UniProtKB:P35579],is_set=true,type=protein) add_qnode(type=biological_process, id=n01) add_qedge(source_id=n00, target_id=n01, id=e00) expand(edge_id=e00,kp=ARAX/KG1) overlay(action=apply_FET, start_node_id=n00, end_node_id=n01, virtual_edge_type=FET) resultify()

Code change requirements:

What do you think?

chunyuma commented 4 years ago

@edeutsch this is great!

amykglen commented 4 years ago

yep, looks good to me!

dkoslicki commented 4 years ago

Looks good to me! Save for:

overlay needs to be expanded to implement "apply_FET" or similarly named functionality o FET would add a new virtual edge with id=FET to query_graph o FET would add new edges to knowledge_graph with type=FET and qedge_id=FET

The above is one envisioned use of FET, but another is to allow FET-expansion: eg. Find me the top 10 biological processes according to FET p-value such that this protein_list is involved_in them. Then the FET would add true involved_in edges in the KG and (depending on how called) would also add an involved_in QG edge as well.

FET p-values can be either added as properties of virtual edges (as @edeutsch suggested), or as node properties (on the biological process nodes in the above example). I don’t much care which is chosen, so going with virtual edges is fine. Just wanted to point out that FET may/will be adding new nodes and real (not virtual) edges to both the KG and QG in many use-cases.

edeutsch commented 4 years ago

[posted before seeing immediately preceding post by David] okay, I have implemented curie lists in add_qnode(). It seems to work so far. More testing is needed, but should be good to move forward. I am testing with this DSL:

add_qnode(id=n00,curie=[UniProtKB:P14136,UniProtKB:P35579],is_set=true) add_qnode(type=biological_process, id=n01) add_qedge(source_id=n00, target_id=n01, id=e00) expand(edge_id=e00,kp=ARAX/KG1) overlay(action=apply_FET, start_node_id=n00, end_node_id=n01, virtual_edge_type=FET) resultify()

Expand seems to work okay.

Currently crashes on: overlay(action=apply_FET, start_node_id=n00, end_node_id=n01, virtual_edge_type=FET)

because that still needs to be implemented.

Checked into master and rolled out to devED, test, and beta. But NOT production.

edeutsch commented 4 years ago

@dkoslicki : regarding your last post: I don't think I fully understand. I don't think FET-expansion is new functionality that should be implemented? Isn't this is just:

dkoslicki commented 4 years ago

@edeutsch

expand() apply_FET filter_by_FET keep top 10?

This may be a good thing to discuss during the AHM meeting since FET has many uses. You just gave one of them, but another is when you don’t know what sort of relationships might connect the nodes of interest, and want FET to find them for you. As long as add_qedge is ok to use without specifying an edge type (which the documentation does not make clear), then I guess you could stick with the paradigm you suggested. This would just result in DLS’s that look like

Another case is when we want to be able to make long path queries (like the SME workflow back in the day: disease->phenotype->protein->pathway->protein->drug, or something like that, with known edge types), but “expand” each edge by only taking the top N according to FET to avoid the combinatorial explosion for long query paths like this. This would require similar interleaving of expand/FET/filter in the DSL.

I was hoping to keep the DSL simpler than this and basically accomplish this with a FET-expand-like paradigm (I.e. Team Expander should make it easier to expand graphs, right?)

edeutsch commented 4 years ago

I can certainly appreciate the desired to simplify things. I think it would be fine to seek a short-hand notation for this. Perhaps expand() could be empowered to run (FET, filter) after each expansion step. However, if the precise filter criteria are not identical after each step, then one would need to fall back to the long form anyway.

I suggest we implement this in its long form first, and then if we find we have too much exact repetition, then we develop a short-hand syntax for chaining multiple steps into one command.

dkoslicki commented 4 years ago

Conclusion for FET issue on 5/27 AHM: just make FET an overlay, not an expanding function

amykglen commented 4 years ago

FYI - expand is now fully adjusted to handle qnode.curie being a list (it was largely working from the get-go, but required some tweaks - like to make synonym functionality work for such qnodes)

edeutsch commented 4 years ago

great! I have rolled out latest master to test, beta, devED but not production. I suggest this is the example query that we want to see run when development is complete:

add_qnode(id=n00,curie=[UniProtKB:Q96RI1,UniProtKB:O15438,UniProtKB:Q92887],is_set=true)
add_qnode(type=biological_process, id=n01)
add_qedge(source_id=n00, target_id=n01, id=e00)
expand(edge_id=e00,kp=ARAX/KG1)
overlay(action=apply_FET, start_node_id=n00, end_node_id=n01, virtual_edge_type=FET)
resultify()

If you comment out the overlay(), it seems to work and seems reasonable. @chunyuma, over to you to make overlay() work?

chunyuma commented 4 years ago

@amykglen thanks!

@edeutsch, I have almost completed the FET on overlay() but a few things need to be further tested. Once the testing is done, it can be merged into overlay. The latest code of FET is here under 'fisher_issue_chunyu' branch. But I will merge the updated expand to 'fisher_issue_chunyu' branch and test this example query.

In addition, I found a problem here that it seems like the resultify requires all node ids in the edges to have their corresponding nodes in the graph even though they are virtual edges. Do you think that I also need to add the virtual nodes to the graph.

dkoslicki commented 4 years ago

@chunyuma

In addition, I found a problem here that it seems like the resultify requires all node ids in the edges to have their corresponding nodes in the graph even though they are virtual edges. Do you think that I also need to add the virtual nodes to the graph.

I'm addressing this now. As @edeutsch mentioned, the imagined use-case is when all nodes are already in the KG, so there should be a straightforward way to address this. I'll post updates on #623.

edeutsch commented 4 years ago

Due to the design of the edge_binding system (which is debatable but what we've got), yes, if you add edges to the KG, you must add a corresponding edge to the QG. This is already done by jaccard and NGD code and all these. So just emulate what they do.

chunyuma commented 4 years ago

thanks @dkoslicki and @edeutsch. Yes, The FET code added a corresponding edge to the QG by a default option. But @edeutsch you remind me that I should not give an option for user to add a corresponding edge to QG. Adding a corresponding edge to QG should always be true. Perhaps this error is due to something wrong I defined the new QueryEdege.

isbluis commented 3 years ago

Hi @edeutsch . Does this issue still need development or testing prior to closing? If so, anything from the UI side?

edeutsch commented 3 years ago

I confess I've lost track of where we are on this topic. This DSL produces some results, although the confidence scores seem uninspired:

add_qnode(id=n00,curie=[UniProtKB:P14136,UniProtKB:P35579],is_set=true,type=protein)
add_qnode(type=biological_process, id=n01)
add_qedge(source_id=n00, target_id=n01, id=e00)
expand(edge_id=e00,kp=ARAX/KG1)
overlay(action=fisher_exact_test, source_qnode_id=n00, target_qnode_id=n01, virtual_relation_label=FET)
resultify()
saramsey commented 3 years ago

I also agree. Allowing QNode.curie to be a list, but then requiring is_set=true for such QNodes seems to make good sense to me.

I agree.

amykglen commented 3 years ago

safe to close this issue, @edeutsch? seems very outdated.

edeutsch commented 3 years ago

It is very outdated, but I'm uncertain that it ever got solved. I tried translating the DSL into modern ARAXi and came up with this:

add_qnode(key=n00,ids=[UniProtKB:P14136,UniProtKB:P35579],is_set=true)
add_qnode(categories=biolink:BiologicalProcess, key=n01)
add_qedge(subject=n00, object=n01, key=e00)
expand()
overlay(action=fisher_exact_test, subject_qnode_key=n00, object_qnode_key=n01, virtual_relation_label=FET)
resultify()

but it doesn't work.

I'll go ahead and close. If anyone has concerns that this may not be working, feel free to reopen.