Closed edeutsch closed 4 years ago
@edeutsch "add_qnode(type=chemical_substance, is_set=true, id=n02)" is required to identify all found nodes with the qnode_id n02 so the subsequent call: "overlay(action=compute_jaccard, start_node_id=n00, intermediate_node_id=n01, end_node_id=n02, virtual_edge_type=J1)" knows to do this for ALL chemical_substances that correspond with end_node_id == qnode_id==n02. After that, I want the results to be listed "one result for each chemical substance found" hence: "resultify(ignore_edge_direction=true, force_isset_false=[n02])"
So not a needless parameter nor an error, but rather, a very important parameter to get overlay() to work correctly and resultify() to display results in the way we want.
okay, well, I got email pushback from Steve as well, so I guess you both know what you're intending to do. But for the record, I don't understand and I think it is abuse of what I understood is_set=true meant.
I will try one final feeble push back to say that I am skeptical that you are doing: "knows to do this for ALL chemical_substances that correspond" You can't compute a jaccard index for the entire set of chemical substances as a set.
I suspect that you are computing the Jaccard index for EACH chemical_substance. Which means that chemical_substance is treated individually rather than as a set. The proteins are treated as sets.
Anyway, I thought I understood, but I am baffled why n02 is is_set=true (validated by the fact that there's a special parameter trying to undo it later). But if you're both happy with it, then I'll go along.
in my understanding, the current query as written above should end in an error in line: overlay(action=compute_jaccard, start_node_id=n00, intermediate_node_id=n01, end_node_id=n02, virtual_edge_type=J1)
ERROR: end_node_id=n02 is labeled with is_set=true. Neither the start_node nor the end_node may be sets. All intermediate nodes must be sets.
Fwiw: is_set
is mostly used in resultify()
at this point: that’s a flag indicating that every result should strive to have each individual result contain a subgraph that contains each (possible) path between nodes with is_set=True
.
I’ll have to run some tests (at some point), but I’m pretty sure that the jaccard_index may not actually care if something has is_set=True
in any of its parameters (as it only needs qnode_id’s).
But I do know that once I get to implementing the fisher_exact
capability per #623, then what you (@edeutsch) are imagining re: errors being reported if some nodes do/do not have is_set=True
becomes quite important.
Let me do some tests and let you know if I can simplify this example.
I would argue that although jaccard_index "may not actually care" about the is_set setting, it is then implicitly assuming that the start/end nodes are is_set=false and the intermediate nodes are is_set=true. It is not bothered that n02 is (wrongly) is_set=true only because it doesn't check. But it probably should check them and issue warnings or errors if it does not conform to its assumptions. It is placing a virtual edge between two nodes. Neither of those two can be part of a set. The spanned nodes are the sets.
So have we reached a conclusion then? Can this: "add_qnode(type=chemical_substance, is_set=true, id=n02)", "resultify(ignore_edge_direction=true, force_isset_false=[n02])", Be changed to: "add_qnode(type=chemical_substance, id=n02)", "resultify(ignore_edge_direction=true)", ?
As a side note, I am a little concerned that our ARAXi examples are pretty complicated. Anything we can do to make them simpler would be good. I am imagining that Noel, who seems to be quite interested in Example 2 (he presented it at the intro meeting?), will take special interest in the ARAXi for this and will want to understand it completely. IMO the easier it is for him to grasp what the ARAXi is saying, the better our demo will be. (thus my interest in stripping out a pair of unneeded parameters)
From the perspective of ARAX_resultify.py
, the module just needs to know the is_set
Boolean values for each of the qnodes. If I can't get that information reliably from the QueryGraph
object, then I need to get that information some other way. Thus the optional force_isset_false
parameter was born. I agree it complicates the DSL.
From @dkoslicki's comment it sounds like in some legitimate use-cases, overlay
needs a qnode to have is_set=true
but for which we do not want is_set=true
for the purpose of resultification? So, we are computing a statistic based on a set of qnodes of a certain type, but then displaying results in which each of those qnodes appears one at at time? Is that what we are doing here? If so, we are sort of forced into some ugliness: either we (1) create a new and different boolean parameter for each qnode, other than is_set
, that would apply to aggregation for the Fisher Exact Test; (2) put a cumbersome force_is_set_true
option on the overlay FET code; or (3) stick with the cumbersome force_is_set_false
option on resultify
. Right? I can't see any way around it, if the above italicized assumption holds for even a single legitimate use-case.
From the perspective of
ARAX_resultify.py
, the module just needs to know theis_set
Boolean values for each of the qnodes. If I can't get that information reliably from theQueryGraph
object, then I need to get that information some other way. Thus the optionalforce_isset_false
parameter was born. I agree it complicates the DSL.From @dkoslicki's comment it sounds like in some legitimate use-cases,
overlay
needs a qnode to haveis_set=true
but for which we do not wantis_set=true
for the purpose of resultification? So, we are computing a statistic based on a set of qnodes of a certain type, but then displaying results in which each of those qnodes appears one at at time? Is that what we are doing here? If so, we are sort of forced into some ugliness: either we (1) create a new and different boolean parameter for each qnode, other thanis_set
, that would apply to aggregation for the Fisher Exact Test; (2) put a cumbersomeforce_is_set_true
option on the overlay FET code; or (3) stick with the cumbersomeforce_is_set_false
option onresultify
. Right? I can't see any way around it, if the above italicized assumption holds for even a single legitimate use-case.
I can see an argument for using is_set
from the get-go (i.e., in the add_qnode
statement) to define how we want the results summarized. But that would just mean adding a cumbersome force_is_set_true
option to the overlay code for the case of a Fisher test; so there is no net gain of elegance in the DSL.
Imma mark this issue as high priority. Seems like a good topic for a three-way huddle. What TZ are you in, @dkoslicki ? I'm in CST at the moment.
I've also been starting to question how is_set=True
is being used in expand()
(namely when a query graph is being expanded one edge at a time)...
If query graph nodes are intermediary and don't have a 'curie' specified, should they always be considered is_set=True
?
SAR> From the perspective of ARAX_resultify.py, the module just needs to know the is_set SAR> Boolean values for each of the qnodes.
agreed!
SAR> If I can't get that information reliably from the QueryGraph object,
ah, but I will argue that you must get it reliably from the QueryGraph. If the QueryGraph is wrong, then it should be an error or a fail.
SAR> then I need to get that information some other way.
There should be no other way. If the QueryGraph is wrong, then the QueryGraph is wrong. We don't want encourage users to provide wrong QueryGraphs and then compensate for it with later unnecessary bandaid parameters.
SAR> Thus the optional force_isset_false parameter was born. I agree it complicates the DSL.
Happy to huddle any time that my schedule allows.
AKG> If query graph nodes are intermediary and don't have a 'curie' specified, should AKG> they always be considered is_set=True?
no, I don't think so. It would be equally valid (but a different question) for intermediate nodes to be is_set=false
SAR> I can see an argument for using is_set from the get-go (i.e., in the add_qnode statement) SAR> to define how we want the results summarized.
I disagree! This is abuse of what is_set=true means. Each drug is considered on its own and how it interacts with a set of proteins. The drugs are not considered in sets. Only the proteins are considered in sets.
SAR> But that would just mean adding a cumbersome force_is_set_true option to the overlay SAR> code for the case of a Fisher test; so there is no net gain of elegance in the DSL.
Also not necessary in my view if one uses is_set=true as originally intended.
@edeutsch @saramsey @amykglen
So after a bit of testing, it appears that the jaccard index cares not if anything is_set=True
or not. This is actually intended. The logic behind the jaccard index DSL command:
overlay(action=compute_jaccard, start_node_id=n00, intermediate_node_id=n01, end_node_id=n02, virtual_edge_type=J1)
Is, in pseudo-code:
for each KG node X with qnode_id == n00:
for each KG node Y with qnode_id == n02:
create a virtual edge X<->Y labeled J1 with the number of intermediate KG nodes with qnode_id==n01 that satisfy the path X--(some node with qnode_id==n01)--Y
end
end
create a QG edge X--Y labeled J1
This is how I envisioned the jaccard index working, and I do not think it should throw an error if something is not is_set=False/True
or whatever. The intent of the Jaccard index is simply to quantify nodes in common.
So as of this moment, is_set
appears to only be used in the resultify()
.
I've updated ARAX_query.py 12
in the demo branch to reflect this, and also in the ARAX DSL Examples Google document to simplify the DSL commands (which was one of Eric's intents, I think). After removing is_set=True
for n02
and force_isset_false=[n02]
in resultify()
, everything still works exactly like intended. So as Eric noticed, it was unnecessary verbosity.
Still, +1 to a multi-way huddle to figure out what exactly we want this is_set
thing to do: only affect the results? Or also affect the intermediate DSL commands? For example, I can envision a way to get the fisher_exact
to work without any knowledge of is_set
, but that would be post-demo thinking.
So, in conclusion:
is_set=True
was very important (that's what I get for commenting without testing what I had actually implemented).is_set
is still warranted.FWIW, here's the simplified DSL query for ARAX_query.py 12
(i.e. demo example 2):
query = { "previous_message_processing_plan": { "processing_actions": [
"create_message",
"add_qnode(name=DOID:14330, id=n00)",
"add_qnode(type=protein, is_set=true, id=n01)",
"add_qnode(type=chemical_substance, id=n02)",
"add_qedge(source_id=n00, target_id=n01, id=e00)",
"add_qedge(source_id=n01, target_id=n02, id=e01, type=physically_interacts_with)",
"expand(edge_id=[e00,e01])",
"overlay(action=compute_jaccard, start_node_id=n00, intermediate_node_id=n01, end_node_id=n02, virtual_edge_type=J1)",
"filter_kg(action=remove_edges_by_attribute, edge_attribute=jaccard_index, direction=below, threshold=.2, remove_connected_nodes=t, qnode_id=n02)",
"resultify(ignore_edge_direction=true)",
"return(message=true, store=false)",
] } }
great, I agree with most of the above. I'm pretty open on Thursday. Friday is a bit tricky, although available after 10:40a PST. The conference I was going to be almost all next week has just been cancelled due to COVID-19. So, my availability next week is suddenly looking fantastic!
I'll send my availability via email to relevant parties.
uh, sorry, did I miss the is_set
huddle? I was traveling last Tuesday/Wednesday.
There was no in-person huddle. I think we all agreed by email or some other messaging medium that n02 will not have is_set=true for the demo and we can continue to argue over it later if we are not voted off the island.
OK, marking this issue as Low Priority.
Folks, I re-read this issue and I am confused as to where we are at. Is this still an issue?
I think we still need to have an extended discussion with examples to achieve a documented decision on how we want to use is_set=Bool
Regarding this issue specifically, I feel that force_isset_false is a nonsense option and should be removed from the code. But I am uncertain if we are all agreed on that. I do not understand a scenario where is_set should reasonably be set to true for part of the code execution and then forced false by resultify(). The only times we used it (I think) is to set it false in cases where it should never have been true in the first place. So the error was making it true in the first place.
decided 5/27/2020 that it is not needed. Create an issue to remove this feature.
Closing in favor of #779
I was just testing ARAXi example 2: request = { "previous_message_processing_plan": { "processing_actions": [ "create_message", "add_qnode(name=DOID:14330, id=n00)", "add_qnode(type=protein, is_set=true, id=n01)", "add_qnode(type=chemical_substance, is_set=true, id=n02)", "add_qedge(source_id=n00, target_id=n01, id=e00)", "add_qedge(source_id=n01, target_id=n02, id=e01, type=physically_interacts_with)", "expand(edge_id=[e00,e01])", "overlay(action=compute_jaccard, start_node_id=n00, intermediate_node_id=n01, end_node_id=n02, virtual_edge_type=J1)", "filter_kg(action=remove_edges_by_attribute, edge_attribute=jaccard_index, direction=below, threshold=.2, remove_connected_nodes=t, qnode_id=n02)", "filter_kg(action=remove_edges_by_property, edge_property=provided_by, property_value=Pharos)", "resultify(ignore_edge_direction=true, force_isset_false=[n02])", "return(message=true, store=true)", ] } }
And I notice that based on my understanding the third qnode: "add_qnode(type=chemical_substance, is_set=true, id=n02)", is incorrectly having is_set=true. This should not be there, right?
and lower down, I see: "resultify(ignore_edge_direction=true, force_isset_false=[n02])", a strange parameter of force_isset_false.
What is the logic here? In my view this looks like there is the addition of a needless parameter to compensate for an earlier error in the code?
Comments?