RTXteam / RTX

Software repo for Team Expander Agent (Oregon State U., Institute for Systems Biology, and Penn State U.)
https://arax.ncats.io/
MIT License
33 stars 21 forks source link

should resultify have a force_isset_false option?? #658

Closed edeutsch closed 4 years ago

edeutsch commented 4 years ago

I was just testing ARAXi example 2: request = { "previous_message_processing_plan": { "processing_actions": [ "create_message", "add_qnode(name=DOID:14330, id=n00)", "add_qnode(type=protein, is_set=true, id=n01)", "add_qnode(type=chemical_substance, is_set=true, id=n02)", "add_qedge(source_id=n00, target_id=n01, id=e00)", "add_qedge(source_id=n01, target_id=n02, id=e01, type=physically_interacts_with)", "expand(edge_id=[e00,e01])", "overlay(action=compute_jaccard, start_node_id=n00, intermediate_node_id=n01, end_node_id=n02, virtual_edge_type=J1)", "filter_kg(action=remove_edges_by_attribute, edge_attribute=jaccard_index, direction=below, threshold=.2, remove_connected_nodes=t, qnode_id=n02)", "filter_kg(action=remove_edges_by_property, edge_property=provided_by, property_value=Pharos)", "resultify(ignore_edge_direction=true, force_isset_false=[n02])", "return(message=true, store=true)", ] } }

And I notice that based on my understanding the third qnode: "add_qnode(type=chemical_substance, is_set=true, id=n02)", is incorrectly having is_set=true. This should not be there, right?

and lower down, I see: "resultify(ignore_edge_direction=true, force_isset_false=[n02])", a strange parameter of force_isset_false.

What is the logic here? In my view this looks like there is the addition of a needless parameter to compensate for an earlier error in the code?

Comments?

dkoslicki commented 4 years ago

@edeutsch "add_qnode(type=chemical_substance, is_set=true, id=n02)" is required to identify all found nodes with the qnode_id n02 so the subsequent call: "overlay(action=compute_jaccard, start_node_id=n00, intermediate_node_id=n01, end_node_id=n02, virtual_edge_type=J1)" knows to do this for ALL chemical_substances that correspond with end_node_id == qnode_id==n02. After that, I want the results to be listed "one result for each chemical substance found" hence: "resultify(ignore_edge_direction=true, force_isset_false=[n02])"

So not a needless parameter nor an error, but rather, a very important parameter to get overlay() to work correctly and resultify() to display results in the way we want.

edeutsch commented 4 years ago

okay, well, I got email pushback from Steve as well, so I guess you both know what you're intending to do. But for the record, I don't understand and I think it is abuse of what I understood is_set=true meant.

I will try one final feeble push back to say that I am skeptical that you are doing: "knows to do this for ALL chemical_substances that correspond" You can't compute a jaccard index for the entire set of chemical substances as a set.

I suspect that you are computing the Jaccard index for EACH chemical_substance. Which means that chemical_substance is treated individually rather than as a set. The proteins are treated as sets.

Anyway, I thought I understood, but I am baffled why n02 is is_set=true (validated by the fact that there's a special parameter trying to undo it later). But if you're both happy with it, then I'll go along.

edeutsch commented 4 years ago

in my understanding, the current query as written above should end in an error in line: overlay(action=compute_jaccard, start_node_id=n00, intermediate_node_id=n01, end_node_id=n02, virtual_edge_type=J1)

ERROR: end_node_id=n02 is labeled with is_set=true. Neither the start_node nor the end_node may be sets. All intermediate nodes must be sets.

dkoslicki commented 4 years ago

Fwiw: is_set is mostly used in resultify() at this point: that’s a flag indicating that every result should strive to have each individual result contain a subgraph that contains each (possible) path between nodes with is_set=True. I’ll have to run some tests (at some point), but I’m pretty sure that the jaccard_index may not actually care if something has is_set=True in any of its parameters (as it only needs qnode_id’s). But I do know that once I get to implementing the fisher_exact capability per #623, then what you (@edeutsch) are imagining re: errors being reported if some nodes do/do not have is_set=True becomes quite important. Let me do some tests and let you know if I can simplify this example.

edeutsch commented 4 years ago

I would argue that although jaccard_index "may not actually care" about the is_set setting, it is then implicitly assuming that the start/end nodes are is_set=false and the intermediate nodes are is_set=true. It is not bothered that n02 is (wrongly) is_set=true only because it doesn't check. But it probably should check them and issue warnings or errors if it does not conform to its assumptions. It is placing a virtual edge between two nodes. Neither of those two can be part of a set. The spanned nodes are the sets.

So have we reached a conclusion then? Can this: "add_qnode(type=chemical_substance, is_set=true, id=n02)", "resultify(ignore_edge_direction=true, force_isset_false=[n02])", Be changed to: "add_qnode(type=chemical_substance, id=n02)", "resultify(ignore_edge_direction=true)", ?

As a side note, I am a little concerned that our ARAXi examples are pretty complicated. Anything we can do to make them simpler would be good. I am imagining that Noel, who seems to be quite interested in Example 2 (he presented it at the intro meeting?), will take special interest in the ARAXi for this and will want to understand it completely. IMO the easier it is for him to grasp what the ARAXi is saying, the better our demo will be. (thus my interest in stripping out a pair of unneeded parameters)

saramsey commented 4 years ago

From the perspective of ARAX_resultify.py, the module just needs to know the is_set Boolean values for each of the qnodes. If I can't get that information reliably from the QueryGraph object, then I need to get that information some other way. Thus the optional force_isset_false parameter was born. I agree it complicates the DSL.

From @dkoslicki's comment it sounds like in some legitimate use-cases, overlay needs a qnode to have is_set=true but for which we do not want is_set=true for the purpose of resultification? So, we are computing a statistic based on a set of qnodes of a certain type, but then displaying results in which each of those qnodes appears one at at time? Is that what we are doing here? If so, we are sort of forced into some ugliness: either we (1) create a new and different boolean parameter for each qnode, other than is_set, that would apply to aggregation for the Fisher Exact Test; (2) put a cumbersome force_is_set_true option on the overlay FET code; or (3) stick with the cumbersome force_is_set_false option on resultify. Right? I can't see any way around it, if the above italicized assumption holds for even a single legitimate use-case.

saramsey commented 4 years ago

From the perspective of ARAX_resultify.py, the module just needs to know the is_set Boolean values for each of the qnodes. If I can't get that information reliably from the QueryGraph object, then I need to get that information some other way. Thus the optional force_isset_false parameter was born. I agree it complicates the DSL.

From @dkoslicki's comment it sounds like in some legitimate use-cases, overlay needs a qnode to have is_set=true but for which we do not want is_set=true for the purpose of resultification? So, we are computing a statistic based on a set of qnodes of a certain type, but then displaying results in which each of those qnodes appears one at at time? Is that what we are doing here? If so, we are sort of forced into some ugliness: either we (1) create a new and different boolean parameter for each qnode, other than is_set, that would apply to aggregation for the Fisher Exact Test; (2) put a cumbersome force_is_set_true option on the overlay FET code; or (3) stick with the cumbersome force_is_set_false option on resultify. Right? I can't see any way around it, if the above italicized assumption holds for even a single legitimate use-case.

I can see an argument for using is_set from the get-go (i.e., in the add_qnode statement) to define how we want the results summarized. But that would just mean adding a cumbersome force_is_set_true option to the overlay code for the case of a Fisher test; so there is no net gain of elegance in the DSL.

saramsey commented 4 years ago

Imma mark this issue as high priority. Seems like a good topic for a three-way huddle. What TZ are you in, @dkoslicki ? I'm in CST at the moment.

amykglen commented 4 years ago

I've also been starting to question how is_set=True is being used in expand() (namely when a query graph is being expanded one edge at a time)...

If query graph nodes are intermediary and don't have a 'curie' specified, should they always be considered is_set=True?

edeutsch commented 4 years ago

SAR> From the perspective of ARAX_resultify.py, the module just needs to know the is_set SAR> Boolean values for each of the qnodes.

agreed!

SAR> If I can't get that information reliably from the QueryGraph object,

ah, but I will argue that you must get it reliably from the QueryGraph. If the QueryGraph is wrong, then it should be an error or a fail.

SAR> then I need to get that information some other way.

There should be no other way. If the QueryGraph is wrong, then the QueryGraph is wrong. We don't want encourage users to provide wrong QueryGraphs and then compensate for it with later unnecessary bandaid parameters.

SAR> Thus the optional force_isset_false parameter was born. I agree it complicates the DSL.

Happy to huddle any time that my schedule allows.

edeutsch commented 4 years ago

AKG> If query graph nodes are intermediary and don't have a 'curie' specified, should AKG> they always be considered is_set=True?

no, I don't think so. It would be equally valid (but a different question) for intermediate nodes to be is_set=false

edeutsch commented 4 years ago

SAR> I can see an argument for using is_set from the get-go (i.e., in the add_qnode statement) SAR> to define how we want the results summarized.

I disagree! This is abuse of what is_set=true means. Each drug is considered on its own and how it interacts with a set of proteins. The drugs are not considered in sets. Only the proteins are considered in sets.

SAR> But that would just mean adding a cumbersome force_is_set_true option to the overlay SAR> code for the case of a Fisher test; so there is no net gain of elegance in the DSL.

Also not necessary in my view if one uses is_set=true as originally intended.

dkoslicki commented 4 years ago

@edeutsch @saramsey @amykglen So after a bit of testing, it appears that the jaccard index cares not if anything is_set=True or not. This is actually intended. The logic behind the jaccard index DSL command:

overlay(action=compute_jaccard, start_node_id=n00, intermediate_node_id=n01, end_node_id=n02, virtual_edge_type=J1)

Is, in pseudo-code:

for each KG node X with qnode_id == n00:
    for each KG node Y with qnode_id == n02:
        create a virtual edge X<->Y labeled J1 with the number of intermediate KG nodes with qnode_id==n01 that satisfy the path X--(some node with qnode_id==n01)--Y
    end
end
create a QG edge X--Y labeled J1

This is how I envisioned the jaccard index working, and I do not think it should throw an error if something is not is_set=False/True or whatever. The intent of the Jaccard index is simply to quantify nodes in common.

So as of this moment, is_set appears to only be used in the resultify().

I've updated ARAX_query.py 12 in the demo branch to reflect this, and also in the ARAX DSL Examples Google document to simplify the DSL commands (which was one of Eric's intents, I think). After removing is_set=True for n02 and force_isset_false=[n02] in resultify(), everything still works exactly like intended. So as Eric noticed, it was unnecessary verbosity.

Still, +1 to a multi-way huddle to figure out what exactly we want this is_set thing to do: only affect the results? Or also affect the intermediate DSL commands? For example, I can envision a way to get the fisher_exact to work without any knowledge of is_set, but that would be post-demo thinking.

So, in conclusion:

  1. I recant my previous comment saying that the is_set=True was very important (that's what I get for commenting without testing what I had actually implemented).
  2. There may be some misunderstandings about how the jaccard index is intended to work
  3. A discussion about is_set is still warranted.

FWIW, here's the simplified DSL query for ARAX_query.py 12 (i.e. demo example 2):

query = { "previous_message_processing_plan": { "processing_actions": [
            "create_message",
            "add_qnode(name=DOID:14330, id=n00)",
            "add_qnode(type=protein, is_set=true, id=n01)",
            "add_qnode(type=chemical_substance, id=n02)",
            "add_qedge(source_id=n00, target_id=n01, id=e00)",
            "add_qedge(source_id=n01, target_id=n02, id=e01, type=physically_interacts_with)",
            "expand(edge_id=[e00,e01])",
            "overlay(action=compute_jaccard, start_node_id=n00, intermediate_node_id=n01, end_node_id=n02, virtual_edge_type=J1)",
            "filter_kg(action=remove_edges_by_attribute, edge_attribute=jaccard_index, direction=below, threshold=.2, remove_connected_nodes=t, qnode_id=n02)",
            "resultify(ignore_edge_direction=true)",
            "return(message=true, store=false)",
            ] } }
edeutsch commented 4 years ago

great, I agree with most of the above. I'm pretty open on Thursday. Friday is a bit tricky, although available after 10:40a PST. The conference I was going to be almost all next week has just been cancelled due to COVID-19. So, my availability next week is suddenly looking fantastic!

dkoslicki commented 4 years ago

I'll send my availability via email to relevant parties.

saramsey commented 4 years ago

uh, sorry, did I miss the is_set huddle? I was traveling last Tuesday/Wednesday.

edeutsch commented 4 years ago

There was no in-person huddle. I think we all agreed by email or some other messaging medium that n02 will not have is_set=true for the demo and we can continue to argue over it later if we are not voted off the island.

saramsey commented 4 years ago

OK, marking this issue as Low Priority.

saramsey commented 4 years ago

Folks, I re-read this issue and I am confused as to where we are at. Is this still an issue?

edeutsch commented 4 years ago

I think we still need to have an extended discussion with examples to achieve a documented decision on how we want to use is_set=Bool

Regarding this issue specifically, I feel that force_isset_false is a nonsense option and should be removed from the code. But I am uncertain if we are all agreed on that. I do not understand a scenario where is_set should reasonably be set to true for part of the code execution and then forced false by resultify(). The only times we used it (I think) is to set it false in cases where it should never have been true in the first place. So the error was making it true in the first place.

edeutsch commented 4 years ago

decided 5/27/2020 that it is not needed. Create an issue to remove this feature.

dkoslicki commented 4 years ago

Closing in favor of #779