RTXteam / RTX

Software repo for Team Expander Agent (Oregon State U., Institute for Systems Biology, and Penn State U.)
https://arax.ncats.io/
MIT License
33 stars 21 forks source link

Get fisher exact test into ARAX_overlay #623

Closed dkoslicki closed 4 years ago

dkoslicki commented 4 years ago

This may result in use being able to really get demo example 2 to shine (as I think it's what was done with the original tidbit).

dkoslicki commented 4 years ago

Low-prio now as not in milestones and will require discussions with @amykglen if this belongs in expand or overlay as there are options for it to either utilize message KG (if the proper nodes have been added) or KG1 or KG2 and results will be vastly different depending on which is which.

chunyuma commented 4 years ago

Hi @dkoslicki, I updated the code of fisher exact test on fisher_issue_chunyu branch which right now adds virtual edges with FET p-values to KG. I also added a test DSL call to run FET in ARAX_query.py. To run it, just enter python ARAX_query.py 500. For this run, I got an error message below:

  - 2020-05-28 03:12:46.662982 INFO: Processing action 'resultify' with parameters {'': 'true'}
  - 2020-05-28 03:12:46.663863 ERROR: Graph has an edge {'confidence': None,
 'defined_datetime': '2020-05-28 03:12:46',
 'edge_attributes': [{'name': 'Fisher Exact Test P-value',
                      'type': 'float',
                      'url': None,
                      'value': '0.03490995624353869'}],
 'evidence_type': None,
 'id': 'FET1_0',
 'is_defined_by': 'ARAX/KG1',
 'negated': None,
 'provided_by': 'ARAX/RTX',
 'publications': None,
 'qedge_id': 'FET1_e00',
 'qualifiers': None,
 'relation': 'involved_in',
 'source_id': 'UniProtKB:P35354',
 'target_id': 'GO:0034612',
 'type': 'virtual_edge',
 'weight': None} that refers to a node ID (GO:0034612) that is not in the graph

It seems like the resultify requires all node ids in the edges to have their corresponding nodes in the graph even though they are virtual edges.

dkoslicki commented 4 years ago

@chunyuma Main issues to address:

Also mentioned a TODO that is blocking on the ability to get all nodes of a certain type from all KP's we use.

chunyuma commented 4 years ago

@dkoslicki, I went through the issues you mentioned.

For the second-to-last issue about decorating existing node pairs, I saw your detailed comments that if either the KG source_id or the target_id are NOT in the KG, add the appropriate node to the KG (via appending to the message.knowledge_graph.nodes list). I think adding node to KG should be achieved later using other methods within the ComputeFTEST class but not for right now, right? (I saw you mentioned 'they do exist for the time being'). Since we assume all nodes exist, we just need to add virtual edges to KG and QG but don't need to care about the existence of nodes.

For the last issue about this argument, the reason I created this here is that I used parallel running to execute this method here which need to pass a list containing all arguments to each computation node. So this is a list. Actually it has description to describe all the possible arguments here but if needed, I can make more detailed description. In addition, I don't want this method to be called independently outside ComputeFTEST class. So do you think I should set it as private method?

chunyuma commented 4 years ago

@dkoslicki, I'm a little bit confusing about the issue 3 and issue 4. In issue 3, you mentioned adding target_qnode_id into QG, I think it should be an id "n00" or "n01" similar to what we used in add_qnode(), right? But the problem is that if we only add virtual edges but not actual target nodes to KG (as you said we assume all nodes exist in KG in issue 4). What id in KG the target_qnode_id in QG should match to? In the virtual edge, it has target_id attribute which is actually node curie. I know the node instance in KG has qnode_id attribute which can match to QG. However, it seems like we don't want to add any new nodes (target nodes) to KG right now.

dkoslicki commented 4 years ago

@chunyuma

I think adding node to KG should be achieved later using other methods within the ComputeFTEST class but not for right now, right? (I saw you mentioned 'they do exist for the time being'). Since we assume all nodes exist, we just need to add virtual edges to KG and QG but don't need to care about the existence of nodes.

Correct, for now, assume that the nodes are in the KG and QG, but do check to make sure they are and throw an error if not (recall, an edge must be connected to both a source node and a target node; it cannot "dangle"). In the future, we can investigate adding nodes if they aren't already in the KG, but for now, just throw an error if you try to add an edge and one of the nodes doesn't exist.

I'm a little bit confusing about the issue 3 and issue 4.

I am imagining something like this:

query = {"previous_message_processing_plan": {"processing_actions": [
            "create_message",
            "add_qnode(id=n00, curie=CHEMBL.COMPOUND:CHEMBL521)",
            "add_qnode(id=n01, type=protein)",
            "add_qedge(id=e00, source_id=n00, target_id=n01)",
            "add_qnode(id=n02, type=biological_process)",
            "add_qedge(id=e01, source_id=n01, target_id=n02)",
            "expand(edge_id=[e00, e01], kp=ARAX/KG1)",
            "overlay(action=fisher_exact_test, virtual_edge_type=FET1, source_qnode_id=n01, target_qnode_id=n02, cutoff=0.05)",
            "resultify()",
            "return(message=true, store=true)"

since all KG nodes inherit their respective qnode_id's, you can use this to identify which nodes you should be have a virtual edge added between them. Basically, do this, but with the p-value instead of the ngd value.

chunyuma commented 4 years ago

Thanks @dkoslicki, this makes more sense.

chunyuma commented 4 years ago

@dkoslicki, is it possible that there are more than one type of edge between the source nodes with specified qnode_id (eg "n00") and the target nodes with specified qnode_id (eg "n01")? If it is, do we need to separately calculate the FET p-values for different type of edges?

chunyuma commented 4 years ago

@dkoslicki, here are a few questions I encountered when I wrote the code:

  1. Do we need to consider 'source_qedge_id' (this might further narrow down the source nodes. For example, if the ''source_qnode_id' is n02 which presents a bunch of proteins connected to n00 which is pathway via e00 and n01 which is biological process via e01, we could use e00 to narrow down n02 to only connect to pathway)?

  2. Do we need to consider 'target_qedge_id'? This involves my previous question (is it possible that there are more than one type of edge between the source nodes with specified qnode_id (eg "n00") and the target nodes with specified qnode_id (eg "n01")? If it is, do we need to separately calculate the FET p-values for different type of edges?).

assume that the nodes are in the KG and QG, but do check to make sure they are and throw an error if not (recall, an edge must be connected to both a source node and a target node; it cannot "dangle").

  1. Why should we need to check the existence of node? If we confirmed that n01 is our source node and n02 is our target node. Then they must exist in KG and QG. If we don't plan to add any new nodes to KG and QG, all FET calculation should just base on the existence of nodes in KG and QG. So they must have already existed in the QG and KG. For example, previously if we have a source node (a protein) and target type (eg 'biological process'), we first find all target node with type of 'biological process' connect to this protein in KP. But in this case, all these target nodes with type of 'biological process' should already exist in the KG, right? And they must have an actual edge connecting them. What we need to do is to add a virtual edge with FET p-value to KG.

add_qnode(name=acetaminophen, id=n1) add_qnode(name=scabies, id=n2) expand(kp=ARAX/KG2,continue_if_no_results=True) overlay(action=compute_ngd, virtual_edge_type=NGD, source_qnode_id=n1, target_qnode_id=n2) resultify()

Based on the DSL you proposed #774, I'm thinking of a problem that if we have the following DSL, no query edge between two qnode ids, how should we calculate FET for them because we don't know which connects to which?

add_qnode(curie=[protin1, protein2, protein3], id=n1) add_qnode(curie=[bp1,bp2], id=n2) expand(kp=ARAX/KG2,continue_if_no_results=True) overlay(action=fisher_exact_test, virtual_edge_type=FET, source_qnode_id=n1, target_qnode_id=n2) resultify()

Please note that bp1 and bp2 are the pseudo curie name of biological process. This might be a rare case but might happen.

dkoslicki commented 4 years ago

@chunyuma Let's talk in more detail in person, but briefly:

Do we need to consider 'source_qedge_id' Do we need to consider 'target_qedge_id'?

Ah, I see what you mean. Lets discuss in person as I think we may need to make some DSL design choices to consider edge types.

Why should we need to check the existence of node?

Since it's good defensive programming convention to error check. Granularity of error checking is up to you, so try except etc. in the right places will accomplish this. eg here.

Based on the DSL you proposed #774, I'm thinking of a problem that if we have the following DSL, no query edge between two qnode ids, how should we calculate FET for them because we don't know which connects to which?

Let's not worry about this at all right now, since FET needs edges to exist between nodes in order to be calculated, so it won't/shouldn't be run if there are no qedges

chunyuma commented 4 years ago

@dkoslicki, I fixed the most of FIXME problems you proposed previously and also added a few example DSL commands here for testing this FET module. The FET module can pass all these tests. The virtual edge was defined based on some of rules on issue #786. So please take your time to test it and let me know what else I need to fix or add to the FET module.

Regarding the url for the virtual edge attribute you proposed here, where can I find the url for this?

chunyuma commented 4 years ago

@dkoslicki, I ran an example DSL query 6232 by removing overlay() for both ARAX/KG1 and ARAX/KG2. It works for ARAX/KG1 but fails for ARAX/KG2. I also ran it on ARAX UGI and it seems like this DSL has some problems for ARAX/KG2. Did you run it successfully? I guess the reason causing it run forever is not due to fisher_exact_test.

chunyuma commented 4 years ago

@dkoslicki, fisher_issue_chunyu branch is ready to merge into master branch. Before merging, I need to figure out the failure of the following DSL for querying ARAX/KG2.

add_qnode(id=n00, curie=CHEMBL.COMPOUND:CHEMBL521)
add_qnode(id=n01, type=protein)
add_qedge(id=e00, source_id=n00, target_id=n01)
add_qnode(id=n02, type=biological_process)
add_qedge(id=e01, source_id=n01, target_id=n02)
expand(edge_id=[e00, e01], kp=ARAX/KG2)
resultify()
edeutsch commented 4 years ago

The main problem with the DSL is that it is too unrestricted and will just blow up to 100s of results. If that same query_graph comes in pre-written into ARAX, the QueryGraphInterpreter will decide that you really should have is_set=true on n01 and that computing a Jaccard distance is appropriate to rank your biological_processes using this template: https://github.com/RTXteam/RTX/blob/9b1680c0ed79314de57729de6d3f779c7ed1460f/code/ARAX/ARAXQuery/ARAX_query_graph_interpreter_templates.yaml#L84

But as we discussed at today's call regarding #773, if you write the query_graph with DSL, then it bypasses these checks and ARAX will try to execute the DSL as written, even if it sends itself off a cliff.

chunyuma commented 4 years ago

Thanks @edeutsch, I see. So perhaps the problem raised by @dkoslicki that running this DSL with overlay(action=fisher_exact_test) is slow is not due to the fisher_exact_test.

So I think it can merge into master branch.

edeutsch commented 4 years ago

I rolled out the latest code in master to test and tried the following DSL:

add_qnode(curie=DOID:14330, id=n00)
add_qnode(type=protein, is_set=true, id=n01)
add_qnode(type=chemical_substance, id=n02)
add_qedge(source_id=n00, target_id=n01, id=e00)
add_qedge(source_id=n01, target_id=n02, id=e01, type=physically_interacts_with)
expand(edge_id=[e00,e01], kp=ARAX/KG1)
overlay(action=fisher_exact_test, source_node_id=n01, target_node_id=n02, virtual_relation_label=FET1, top_n=20)
filter_kg(action=remove_edges_by_attribute, edge_attribute=Fisher Exact Test p-value, direction=above, threshold=1e-10, remove_connected_nodes=t, qnode_id=n02)
resultify()

I'm not really sure how FET is supposed to work, but this seemed like a reasonable test. It produced a result that might be reasonable. But it took 5 minutes! Is Fisher exact test that expensive?

I also tried this and it did not complete:

add_qnode(id=n00, curie=CHEMBL.COMPOUND:CHEMBL521)
add_qnode(id=n01, type=protein, is_set=true)
add_qedge(id=e00, source_id=n00, target_id=n01)
add_qnode(id=n02, type=biological_process)
add_qedge(id=e01, source_id=n01, target_id=n02)
expand(edge_id=[e00, e01], kp=ARAX/KG2)
overlay(action=fisher_exact_test, source_node_id=n01, target_node_id=n02, virtual_relation_label=FET1)
filter_kg(action=remove_edges_by_attribute, edge_attribute=Fisher Exact Test p-value, direction=above, threshold=1e-10, remove_connected_nodes=t, qnode_id=n02)
resultify()
chunyuma commented 4 years ago

Thanks @edeutsch, I will test these DSLs and give your some feedbacks. If the target node list is not big, then Fisher exact test is fast.

chunyuma commented 4 years ago

@edeutsch and @dkoslicki, I added two more DSL examples which are two-hop and three-hop for fisher exact test to ARAX_query.py in the master branch.

In the two-hop example (its DSL is below) which is essentially the one proposed by @edeutsch today and tries to find (DOID:14330) - (protein) - (chemical_substance), I added overlay(action=fisher_exact_test) and filter_kg(action=remove_edges_by_attribute) before we expand n01(protein nodes) to n02(chemical_substance nodes). This can filter out 22 "non-representative" proteins (FET p-value>0.001) from the original 39 proteins connecting to DOID:14330 and reduce the running time from 5 minutes to around 2 minutes and 20s.

add_qnode(curie=DOID:14330, id=n00)
add_qnode(type=protein, is_set=true, id=n01)
add_qedge(source_id=n00, target_id=n01, id=e00)
expand(edge_id=e00, kp=ARAX/KG1)
overlay(action=fisher_exact_test, source_qnode_id=n00, target_qnode_id=n01, virtual_relation_label=FET1)
filter_kg(action=remove_edges_by_attribute, edge_attribute=fisher_exact_test_p-value, direction=above, threshold=0.001, remove_connected_nodes=t, qnode_id=n01)
add_qnode(type=chemical_substance, id=n02)
add_qedge(source_id=n01, target_id=n02, id=e01, type=physically_interacts_with)
expand(edge_id=e01, kp=ARAX/KG1)
overlay(action=fisher_exact_test, source_qnode_id=n01, target_qnode_id=n02, virtual_relation_label=FET2)
filter_kg(action=remove_edges_by_attribute, edge_attribute=fisher_exact_test_p-value, direction=above, threshold=0.005, remove_connected_nodes=t, qnode_id=n02)
resultify()

@edeutsch , please note that if you use the parameter top_n or cutoff in fisher exact test in your example, it only decorates or adds the virtual edge with the top(smallest) N number of FET p-values or with the p-values less than cutoff. Since the filter_kg(action=remove_edges_by_attribute) only filters the edges with FET p-values and its corresponding nodes, some edges would not be filtered out even though they have FET p-values direction=above or direction=below the threshold used in filter_kg(action=remove_edges_by_attribute) because they don't get return from overlay(action=fisher_exact_test).

overlay(action=fisher_exact_test, source_node_id=n01, target_node_id=n02, virtual_relation_label=FET1, top_n=20)
filter_kg(action=remove_edges_by_attribute, edge_attribute=Fisher Exact Test p-value, direction=above, threshold=1e-10, remove_connected_nodes=t, qnode_id=n02)
chunyuma commented 4 years ago

Hi @edeutsch and @amykglen, I'm testing using a list of source nodes to call one DSL command to find their adjacent nodes with certain type to see if it can be faster than looping over them to call DSL command one by one. But I find some problems.

Let say, I used the latest code of ARAX_query.py to call the following DSL first to find the paths: DOID:14330 - protein - chemical_substance in KG1.

add_qnode(curie=DOID:14330, id=n00)
add_qnode(type=protein, is_set=true, id=n01)
add_qedge(source_id=n00, target_id=n01, id=e00)
expand(edge_id=e00, kp=ARAX/KG1)
add_qnode(type=chemical_substance, id=n02)
add_qedge(source_id=n01, target_id=n02, id=e01, type=physically_interacts_with)
expand(edge_id=e01, kp=ARAX/KG1)
resultify()

This result generated a knowledge graph with protein node list ['UniProtKB:P02675', 'UniProtKB:P01903', 'UniProtKB:P09601', 'UniProtKB:Q02878', 'UniProtKB:P01375', 'UniProtKB:Q9BXM7', 'UniProtKB:P05181', 'UniProtKB:I3WAC9', 'UniProtKB:P50914', 'UniProtKB:P62241', 'UniProtKB:O15217', 'UniProtKB:P09488', 'UniProtKB:P15559', 'UniProtKB:P06213', 'UniProtKB:Q13330', 'UniProtKB:P38646', 'UniProtKB:P05231', 'UniProtKB:Q9BRL8', 'UniProtKB:P01344', 'UniProtKB:Q99683', 'UniProtKB:P11717', 'UniProtKB:P01308', 'UniProtKB:P10635', 'UniProtKB:P23560', 'UniProtKB:Q01959', 'UniProtKB:P21728', 'UniProtKB:O60260', 'UniProtKB:Q9NQ11', 'UniProtKB:P21397', 'UniProtKB:P10636', 'UniProtKB:P08183', 'UniProtKB:P09211', 'UniProtKB:P04062', 'UniProtKB:P14416', 'UniProtKB:Q5S007', 'UniProtKB:Q99497', 'UniProtKB:P27338', 'UniProtKB:P37840', 'UniProtKB:P08069']

Then I tried to find the paths only between the above proteins and chemical_substance in KG1, which is the second hop in above case, by using the following DSL command:

add_qnode(curie=[UniProtKB:P02675,UniProtKB:P01903,UniProtKB:P09601,UniProtKB:Q02878,UniProtKB:P01375,UniProtKB:Q9BXM7,UniProtKB:P05181,UniProtKB:I3WAC9,UniProtKB:P50914,UniProtKB:P62241,UniProtKB:O15217,UniProtKB:P09488,UniProtKB:P15559,UniProtKB:P06213,UniProtKB:Q13330,UniProtKB:P38646,UniProtKB:P05231,UniProtKB:Q9BRL8,UniProtKB:P01344,UniProtKB:Q99683,UniProtKB:P11717,UniProtKB:P01308,UniProtKB:P10635,UniProtKB:P23560,UniProtKB:Q01959,UniProtKB:P21728,UniProtKB:O60260,UniProtKB:Q9NQ11,UniProtKB:P21397,UniProtKB:P10636,UniProtKB:P08183,UniProtKB:P09211,UniProtKB:P04062,UniProtKB:P14416,UniProtKB:Q5S007,UniProtKB:Q99497,UniProtKB:P27338,UniProtKB:P37840,UniProtKB:P08069], is_set=true, id=n00)
add_qnode(type=chemical_substance, id=n01)
add_qedge(source_id=n00, target_id=n01, id=e00, type=physically_interacts_with)
expand(edge_id=e00,kp=ARAX/KG1)
resultify()
return(message=true, store=false)

It failed with error message ERROR: Returned answer KG does not contain any results for QNode n00 and also had information about 'INFO: Could not find any equivalent curies' for multiple curie ids (eg. UniProtKB:P02675, UniProtKB:P01903). I'm wondering why it kept saying no found for these curies because they should be found in KG1 based on the first DSL command.

I also tried just using the first 10 curie ids in the protein node list which is ['UniProtKB:P02675', 'UniProtKB:P01903', 'UniProtKB:P09601', 'UniProtKB:Q02878', 'UniProtKB:P01375', 'UniProtKB:Q9BXM7', 'UniProtKB:P05181', 'UniProtKB:I3WAC9', 'UniProtKB:P50914', 'UniProtKB:P62241'] to call the second DSL and it works.

Could you tell me what is wrong with my second DSL command and how to fix it? Thank you!

amykglen commented 4 years ago

Hmm, where are you running this DSL, @chunyuma? When I run your second DSL:

add_qnode(curie=[UniProtKB:P02675,UniProtKB:P01903,UniProtKB:P09601,UniProtKB:Q02878,UniProtKB:P01375,UniProtKB:Q9BXM7,UniProtKB:P05181,UniProtKB:I3WAC9,UniProtKB:P50914,UniProtKB:P62241,UniProtKB:O15217,UniProtKB:P09488,UniProtKB:P15559,UniProtKB:P06213,UniProtKB:Q13330,UniProtKB:P38646,UniProtKB:P05231,UniProtKB:Q9BRL8,UniProtKB:P01344,UniProtKB:Q99683,UniProtKB:P11717,UniProtKB:P01308,UniProtKB:P10635,UniProtKB:P23560,UniProtKB:Q01959,UniProtKB:P21728,UniProtKB:O60260,UniProtKB:Q9NQ11,UniProtKB:P21397,UniProtKB:P10636,UniProtKB:P08183,UniProtKB:P09211,UniProtKB:P04062,UniProtKB:P14416,UniProtKB:Q5S007,UniProtKB:Q99497,UniProtKB:P27338,UniProtKB:P37840,UniProtKB:P08069], is_set=true, id=n00)
add_qnode(type=chemical_substance, id=n01)
add_qedge(source_id=n00, target_id=n01, id=e00, type=physically_interacts_with)
expand(edge_id=e00,kp=ARAX/KG1)
resultify()

on https://arax.rtx.ai/beta/, it looks like it errors out on add_qnode:

Screen Shot 2020-06-13 at 3 27 16 PM
amykglen commented 4 years ago

And the messages you saw about 'INFO: Could not find any equivalent curies..' just mean that no synonyms were identified for that particular curie (nothing to worry about).

amykglen commented 4 years ago

Although I do see the error you originally reported when I run the code from the terminal in master, @chunyuma . I'll look into what's going on...

chunyuma commented 4 years ago

Hi @amykglen, yes, I ran them on terminal. I added them to ARAX_query.py here for you to test what's going on. Thanks!

amykglen commented 4 years ago

nice, thanks! ok, just pushed a fix for this to master, and those two tests are passing now.

dkoslicki commented 4 years ago

Feel free to add those tests to test\test_ARAX_overlay.py @chunyuma !

chunyuma commented 4 years ago

Awesome @amykglen, thank you!

chunyuma commented 4 years ago

Hi @amykglen, I found one more problem for the curie list.

For the test in this time, I tried to find the diseases associated with these proteins (['UniProtKB:P02675', 'UniProtKB:P01903', 'UniProtKB:P09601', 'UniProtKB:Q02878', 'UniProtKB:P01375', 'UniProtKB:Q9BXM7', 'UniProtKB:P05181', 'UniProtKB:I3WAC9', 'UniProtKB:P50914', 'UniProtKB:P62241', 'UniProtKB:O15217', 'UniProtKB:P09488', 'UniProtKB:P15559', 'UniProtKB:P06213', 'UniProtKB:Q13330', 'UniProtKB:P38646', 'UniProtKB:P05231', 'UniProtKB:Q9BRL8', 'UniProtKB:P01344', 'UniProtKB:Q99683', 'UniProtKB:P11717', 'UniProtKB:P01308', 'UniProtKB:P10635', 'UniProtKB:P23560', 'UniProtKB:Q01959', 'UniProtKB:P21728', 'UniProtKB:O60260', 'UniProtKB:Q9NQ11', 'UniProtKB:P21397', 'UniProtKB:P10636', 'UniProtKB:P08183', 'UniProtKB:P09211', 'UniProtKB:P04062', 'UniProtKB:P14416', 'UniProtKB:Q5S007', 'UniProtKB:Q99497', 'UniProtKB:P27338', 'UniProtKB:P37840', 'UniProtKB:P08069']) by using the following DSL.

add_qnode(curie=[UniProtKB:P02675,UniProtKB:P01903,UniProtKB:P09601,UniProtKB:Q02878,UniProtKB:P01375,UniProtKB:Q9BXM7,UniProtKB:P05181,UniProtKB:I3WAC9,UniProtKB:P50914,UniProtKB:P62241,UniProtKB:O15217,UniProtKB:P09488,UniProtKB:P15559,UniProtKB:P06213,UniProtKB:Q13330,UniProtKB:P38646,UniProtKB:P05231,UniProtKB:Q9BRL8,UniProtKB:P01344,UniProtKB:Q99683,UniProtKB:P11717,UniProtKB:P01308,UniProtKB:P10635,UniProtKB:P23560,UniProtKB:Q01959,UniProtKB:P21728,UniProtKB:O60260,UniProtKB:Q9NQ11,UniProtKB:P21397,UniProtKB:P10636,UniProtKB:P08183,UniProtKB:P09211,UniProtKB:P04062,UniProtKB:P14416,UniProtKB:Q5S007,UniProtKB:Q99497,UniProtKB:P27338,UniProtKB:P37840,UniProtKB:P08069], is_set=true, id=n00)
add_qnode(type=disease, id=n01)
add_qedge(source_id=n00, target_id=n01, id=e00)
expand(edge_id=e00,kp=ARAX/KG1)
resultify()

From the result, I found none disease connected to the protein UniProtKB:Q13330. However, I found three diseases (eg. DOID:12337, DOID:289, DOID:14330) associated with this protein if I only used UniProtKB:Q13330 to call DSL which is as follow:

add_qnode(curie=UniProtKB:Q13330, id=n00)
add_qnode(type=disease, id=n01)
add_qedge(source_id=n00, target_id=n01, id=e00)
expand(edge_id=e00,kp=ARAX/KG1)
resultify()
{'confidence': None,
 'defined_datetime': None,
 'edge_attributes': None,
 'evidence_type': None,
 'id': 'UniProtKB:Q13330--gene_associated_with_condition--DOID:12337',
 'is_defined_by': 'ARAX/KG1',
 'negated': None,
 'provided_by': 'DisGeNet',
 'publications': None,
 'qedge_ids': ['e00'],
 'qualifiers': None,
 'relation': 'gene_associated_with_condition',
 'source_id': 'UniProtKB:Q13330',
 'target_id': 'DOID:12337',
 'type': 'gene_associated_with_condition',
 'weight': None}
{'confidence': None,
 'defined_datetime': None,
 'edge_attributes': None,
 'evidence_type': None,
 'id': 'UniProtKB:Q13330--gene_associated_with_condition--DOID:289',
 'is_defined_by': 'ARAX/KG1',
 'negated': None,
 'provided_by': 'BioLink',
 'publications': None,
 'qedge_ids': ['e00'],
 'qualifiers': None,
 'relation': 'associated_with_disease',
 'source_id': 'UniProtKB:Q13330',
 'target_id': 'DOID:289',
 'type': 'gene_associated_with_condition',
 'weight': None}
{'confidence': None,
 'defined_datetime': None,
 'edge_attributes': None,
 'evidence_type': None,
 'id': 'UniProtKB:Q13330--gene_associated_with_condition--DOID:14330',
 'is_defined_by': 'ARAX/KG1',
 'negated': None,
 'provided_by': 'BioLink',
 'publications': None,
 'qedge_ids': ['e00'],
 'qualifiers': None,
 'relation': 'associated_with_disease',
 'source_id': 'UniProtKB:Q13330',
 'target_id': 'DOID:14330',
 'type': 'gene_associated_with_condition',
 'weight': None}

Could you please help me figure out why this is difference?

chunyuma commented 4 years ago

Hi all, I'm wondering if the ARAX User Interface https://arax.rtx.ai/devLM/ is crashed down. It always says An error was encountered while contacting the server (SyntaxError: Unexpected end of JSON input).

edeutsch commented 4 years ago

I don't recommend using /devLM. This is Luis's dev area and may be mid-development or not synced or whatever. Unless you're testing something of Luis's, I suggest staying /test or /beta

chunyuma commented 4 years ago

@edeutsch I see, thanks! But it seems like /test or /beta still can't accept using curie list for add_qnode

edeutsch commented 4 years ago

They can, and it worked for me. But I did notice that the error message when one of the curies is not in the KG was poor. So I fixed that. This DSL now generates a better error message. and if you remove the offending curie, it runs successfully:

add_qnode(curie=[UniProtKB:P02675,UniProtKB:P01903,UniProtKB:P09601,UniProtKB:Q02878,UniProtKB:P01375,UniProtKB:Q9BXM7,UniProtKB:P05181,UniProtKB:I3WAC9,UniProtKB:P50914,UniProtKB:P62241,UniProtKB:O15217,UniProtKB:P09488,UniProtKB:P15559,UniProtKB:P06213,UniProtKB:Q13330,UniProtKB:P38646,UniProtKB:P05231,UniProtKB:Q9BRL8,UniProtKB:P01344,UniProtKB:Q99683,UniProtKB:P11717,UniProtKB:P01308,UniProtKB:P10635,UniProtKB:P23560,UniProtKB:Q01959,UniProtKB:P21728,UniProtKB:O60260,UniProtKB:Q9NQ11,UniProtKB:P21397,UniProtKB:P10636,UniProtKB:P08183,UniProtKB:P09211,UniProtKB:P04062,UniProtKB:P14416,UniProtKB:Q5S007,UniProtKB:Q99497,UniProtKB:P27338,UniProtKB:P37840,UniProtKB:P08069], is_set=true, id=n00)
add_qnode(type=chemical_substance, id=n01)
add_qedge(source_id=n00, target_id=n01, id=e00, type=physically_interacts_with)
expand(edge_id=e00,kp=ARAX/KG1)
resultify()
filter_results(action=limit_number_of_results, max_results=50)
chunyuma commented 4 years ago

Hi @edeutsch, I tried it and it did work. You're true that I did get an error message saying A node with CURIE UniProtKB:I3WAC9 is not in our knowledge graph. However, if you run the following DSL which can generate a graph with UniProtKB:I3WAC9 from KG1. So I'm wondering why it would say A node with CURIE UniProtKB:I3WAC9 is not in our knowledge graph.

add_qnode(curie=DOID:14330, id=n00)
add_qnode(type=protein,is_set=true, id=n01)
add_qedge(source_id=n00, target_id=n01, id=e00)
expand(edge_id=e00, kp=ARAX/KG1)
resultify()
Screen Shot 2020-06-15 at 1 20 03 AM
amykglen commented 4 years ago

hey @chunyuma - just pushed a fix to master for this bug you reported. that dsl now produces the proper three diseases for UniProtKB:Q13330.

isbluis commented 4 years ago

I don't recommend using /devLM. This is Luis's dev area and may be mid-development or not synced or whatever. Unless you're testing something of Luis's, I suggest staying /test or /beta

devLM should be good now; it was left pointing to a non-working API after my previous commit.

edeutsch commented 4 years ago

@chunyuma (and relevant to others): regarding UniProtKB:I3WAC9. the reason it is not found is odd. add_qnode() always uses KG2 since it is has no guidance on which KG to use. And since KG2 is supposed to include all of KG1 (I thought?) this should be ideal. concepts are mapped to KG1 when possible, or that was the idea.

However, I find that NodeNamesDescriptions_KG1.tsv contains UniProtKB:I3WAC9, but strangely NodeNamesDescriptions_KG2.tsv does NOT contain UniProtKB:I3WAC9. Why is that? that is the source of the problem. I didn't think that should be possible.

amykglen commented 4 years ago

has the server's KGNodeIndex been updated at all recently, @edeutsch? because when I run add_qnode locally with that curie, it works fine. and I see the node is indeed in kg2endpoint2:

Screen Shot 2020-06-15 at 12 12 22 AM

(if it hasn't been updated since #705 was completed, that could explain the discrepancy.)

chunyuma commented 4 years ago

Thanks @amykglen

chunyuma commented 4 years ago

@chunyuma (and relevant to others): regarding UniProtKB:I3WAC9. the reason it is not found is odd. add_qnode() always uses KG2 since it is has no guidance on which KG to use. And since KG2 is supposed to include all of KG1 (I thought?) this should be ideal. concepts are mapped to KG1 when possible, or that was the idea.

However, I find that NodeNamesDescriptions_KG1.tsv contains UniProtKB:I3WAC9, but strangely NodeNamesDescriptions_KG2.tsv does NOT contain UniProtKB:I3WAC9. Why is that? that is the source of the problem. I didn't think that should be possible.

Perhaps we can discuss this problem again on this Wednesday. As we discussed on last Friday mini-hackaton meeting, Fisher's exact test depends on the detected KP in the knowledge graph which has the maximum number of edges connected to both source nodes and target nodes. If the detected KP has one, this should not be a problem. But if more than one, the KP with the max number of edges in the knowledge graph will be used to calculate FET. So if KG2 might include all nodes in KG1, perhaps we should use KG2 always. How do you think @dkoslicki?

edeutsch commented 4 years ago

@amykglen, aha that would explain it. KGNodeIndex has not been updated on the main server since March. it did not occur to me to update again after #720. It would be good if we send around an announcement to everyone to upgrade their KGNodeIndexes whenever the KGs change.

chunyuma commented 4 years ago

Based on the comment by @edeutsch here, since it might have situation that some nodes might exist in one KP but not in other KPs, I'm thinking if in the Fisher Exact Test we should not add an virtual edge for the target nodes not in the KP that we used to calculate the FET but just give a warning saying this node might not in KP.

amykglen commented 4 years ago

@amykglen, aha that would explain it. KGNodeIndex has not been updated on the main server since March. it did not occur to me to update again after #720. It would be good if we send around an announcement to everyone to upgrade their KGNodeIndexes whenever the KGs change.

very true - that should probably be part of the routine for rolling out a new KG2 build to 'production' (to give everyone a head's up to update their KGNodeIndexes)

chunyuma commented 4 years ago

@edeutsch and @amykglen, sorry about keeping reporting problem. But I did notice one more strange problem when I tested Fisher Exact Test.

If I called the following DSL, each disease with prefix "OMIM" have 2 or 3 associated proteins in the returned message KG.

add_qnode(curie=[OMIM:616004,OMIM:202400,OMIM:610424,OMIM:614034,OMIM:606963,OMIM:607136,DOID:14330], id=n00)
add_qnode(type=protein, is_set=true id=n01)
add_qedge(source_id=n00, target_id=n01, id=e00)
expand(edge_id=e00,kp=ARAX/KG1)
resultify()

However, if I called DSL with a single OMIM disease or any combination of them, it reported No paths were found in KG1 satisfying this query graph. For example:

add_qnode(curie=[OMIM:616004,OMIM:202400,OMIM:610424,OMIM:614034,OMIM:606963,OMIM:607136], id=n00)
add_qnode(type=protein, is_set=true id=n01)
add_qedge(source_id=n00, target_id=n01, id=e00)
expand(edge_id=e00,kp=ARAX/KG1)
resultify()

or

add_qnode(curie=OMIM:616004, id=n00)
add_qnode(type=protein, is_set=true id=n01)
add_qedge(source_id=n00, target_id=n01, id=e00)
expand(edge_id=e00,kp=ARAX/KG1)
resultify()
amykglen commented 4 years ago

very interesting... huh, ok - it looks like this has to do with the type that add_qnode assigns to node n00 in the different cases:

when you do this:

add_qnode(curie=[OMIM:616004,OMIM:202400,OMIM:610424,OMIM:614034,OMIM:606963,OMIM:607136,DOID:14330], id=n00)

the created qnode is given a type of disease:

{
   'id':'n00',
   'curie':['OMIM:616004', 'OMIM:202400', 'OMIM:610424', 'OMIM:614034', 'OMIM:606963', 'OMIM:607136', 'DOID:14330'],
   'type':'disease',
   'is_set':True
}

but when you do this (only the OMIM curies):

add_qnode(curie=[OMIM:616004,OMIM:202400,OMIM:610424,OMIM:614034,OMIM:606963,OMIM:607136], id=n00)

the created qnode is given a type of phenotypic_feature:

{
   'id':'n00',
   'curie':['OMIM:616004', 'OMIM:202400', 'OMIM:610424', 'OMIM:614034', 'OMIM:606963', 'OMIM:607136'],
   'type':'phenotypic_feature',
   'is_set':True
}

but those OMIM nodes are diseases in KG1, which is why Expand doesn't find any results for the OMIM-only query:

Screen Shot 2020-06-15 at 5 24 15 PM

and those same OMIM nodes are phenotypic_features in KG2, which explains where add_qnode gets that type from (because it uses KGNodeIndex, which uses KG2):

Screen Shot 2020-06-15 at 5 23 23 PM

(and no worries about reporting issues, @chunyuma! it's been very helpful for working out kinks in the code.)

amykglen commented 4 years ago

so I guess we may need to adjust how add_qnode assigns types to nodes, @edeutsch?

do we really need to assign a type to a qnode if it has a curie specified?

edeutsch commented 4 years ago

It seemed like a good idea at the time. For each add_qnode(), the code looks up the curie in the KGNodeIndex and then sets the type to what it finds. If there's no curie found, then there's no type to add.

I do not recall why it seemed like a good idea to add a type when it was not specified. I seem to recall one error message where BTE said it needed a type.

but in any case, it is an easy fix to NOT insert the type when a curie is found if that helps. do you want me to make that change? I do not know the repercussions of that. Maybe nothing bad, I don't know.

amykglen commented 4 years ago

yeah, if add_qnode no longer adds the type, then I should probably adjust Expand so that it tries to find a type for BTE expansions, since BTE requires them. But KG1/2 Expands don't require a type... though they would be a bit slower without them.

I suppose it would be good to make this change... unless there are other ways to address the problem?

edeutsch commented 4 years ago

okay, new behavior of add_qnode() if a curie is provided is as follows:

This is checked into master and rolled out to /test and /beta

chunyuma commented 4 years ago

Thanks @amykglen and @edeutsch

chunyuma commented 4 years ago

Does the node type become a list? I just updated the latest code and found that node type has already become a list. So a node might have more than one type?

edeutsch commented 4 years ago

knowledge_graph node types have been a list for a long time (at least a year). But query_graph qnode types are not a list, always a scalar. Although it is an interesting question to ask why not.