RTXteam / RTX

Software repo for Team Expander Agent (Oregon State U., Institute for Systems Biology, and Penn State U.)
https://arax.ncats.io/
MIT License
33 stars 21 forks source link

ARAX_query.py was successfully executed but returned empty message object to client #735

Closed chunyuma closed 4 years ago

chunyuma commented 4 years ago

Hello David and Eric,

I ran the current version of ARAX_query.py with the command line python ARAX_query.py 500 which queries the answer by

query = {"previous_message_processing_plan": {"processing_actions": [ "create_message", "add_qnode(name=hypertension, id=n00)", "add_qnode(type=protein, is_set=True, id=n01)", "add_qedge(source_id=n01, target_id=n00, id=e00)", "expand(edge_id=e00)", "overlay(action=compute_ngd)", "resultify(ignore_edge_direction=false)", "return(message=true, store=false)", ]}}

I found that the query was successfully executed but not returned to client. I checked the code carefully and it seems like the problem is because the completed message isn't assigned to self.message in executeProcessingPlan function.

In addition, the data structure of message has some problems in line 1038 because x doesn't have essence attribute. Could you please me check these errors?

Thank you!

edeutsch commented 4 years ago

Hi Chunyu, can you check if you're using the latest code in the master branch? Some bugs like the first problem you mention have been fixed already. I suspect (but am not sure) that if you update to the latest code in the master branch, your problem will go away?

Also, line 1038 in the latest code in master is not executed code, so again this points to not using the latest code.

Would you try to update and report? thanks!

chunyuma commented 4 years ago

Hi Eric, thanks for your notice that you updated the code in the master branch yesterday. I just tested the latest code in the master branch but the bugs haven't been fixed yet. Actually, the line 1038 in the fisher_issue_chunyu branch is not the line 1038 in the master but the line 1030. So this line is still executed in the latest code.

edeutsch commented 4 years ago

I found two more locations where the message was not correctly being placed into the parent object. I have pushed those fixes to master. Does that address the problem you're seeing?

chunyuma commented 4 years ago

Eric, I just tested it but it hasn't solved the problem yet. I think the message should be placed into the parent object in line 540, right? Since there is a for loop to create a query graph and return the results, the completed message should be returned to parent object after the for loop.

edeutsch commented 4 years ago

Maybe. It already happens at 467 and shouldn't need to happen again unless there is another message = line, which I don't see. Unless some called method replaces the message instead of changing it. But that would cause all sorts of problems, so we don't want that. So maybe this bug is indicative of something else bad happening?

Feel free to try putting what you think will fix the problem at 540, but I don't see at the moment why that should be needed. For the most part, message should be operated on in place rather than replaced.

chunyuma commented 4 years ago

I don't know if I understand the code correctly. If my understanding is wrong, please help me figure it out. Based on my understanding, the parent message placed at 467 was just initialized. But within each loop, the message was updated but the self.message was not changed. So I'm wondering if self.message should also be updated after the for loop. I tried to print out the message and self.message at 540 and found that message was updated but self.message is the same as 467.

edeutsch commented 4 years ago

So at 467 there is a message that can come from various places and the pointer to message is placed into self.message so they are the same object. From then on, all code should operate on the message contents without changing the pointer. The exceptions are 481 and 513, which is why they are followed by putting message back into self.message.

It's possible that some called method is also doing the same thing. But it shouldn't. If it is, then that's the bug that should be fixed I think.

This is not usually a problem. So I'm wondering if somewhere in your new code or code that you're testing you're replacing message?

It's hard for me to know without wading into the problem and testing. If you can't figure it out, your can check everything into your branch and send me a repro.

chunyuma commented 4 years ago

Thanks @edeutsch. Could you please try running command line python ARAX_query.py 103 at your end using the latest code and see if you can get the correct return? I ran it and got the same problem as I saw in my code. Your result will help me figure out if this is the problem from my new code. Thank you!

edeutsch commented 4 years ago

okay, I ran python3 ARAX_query.py 103 using the latest code in master and see this: ... Number of results: 0 Essence names in the answers: [] All edge attribute names: set() Number of KnowledgeProviders in KG: Counter()

which is not good, I agree. But looking at 103, there is no resultify() command, so naturally there are no results. There should always be a resultify(). So I changed 103 to: query = {"previous_message_processing_plan": {"processing_actions": [ "create_message", "add_qnode(name=DOID:1227, id=n00)", "add_qnode(type=chemical_substance, is_set=true, id=n01)", "add_qedge(source_id=n00, target_id=n01, id=e00)", "expand(edge_id=e00)", "overlay(action=add_node_pmids, max_num=15)", "filter_kg(action=remove_nodes_by_property, node_property=uri, property_value=https://www.ebi.ac.uk/chembl/compound/inspect/CHEMBL2111164)", "resultify()", "return(message=true, store=false)" ]}}

I ran it again, and now see this error:

But I suppose that is missing because the filter_kg() call is removing that.

I don't understand what 103 is supposed to be doing. Do we know who wrote 103 and whether it should work?

The DSL looks weirdly flawed to me. Not obvious that the ARAX code itself is flawed.

So, as written 103 completes okay for me and looking at the knowledge_graph it looks okay, but there are no results because resultify() is not run.

Does anyone know more about 103? I don't.

chunyuma commented 4 years ago

Thank you for testing, @edeutsch. It seems like 103 indeed has problems. But I also tested 7 by adding resultify() to it and running it with python ARAX_query.py 7. So it was changed to

query = {"previous_message_processing_plan": {"processing_actions": [ "create_message", "add_qnode(curie=DOID:14330, id=n00)", # parkinsons "add_qnode(type=protein, is_set=True, id=n01)", "add_qnode(type=chemical_substance, is_set=true, id=n02)", "add_qedge(source_id=n01, target_id=n00, id=e00)", "add_qedge(source_id=n01, target_id=n02, id=e01)", "expand(edge_id=[e00,e01])", "overlay(action=compute_jaccard, start_node_id=n00, intermediate_node_id=n01, end_node_id=n02, virtual_edge_type=J1)", "resultify()", "return(message=true, store=false)", ]}}

This DSL should be fine because it has 1138 nodes and 4879 edges.

However, it seems like the following code lines have some problems because of only 1 result and none return.

1028 print(f"Number of results: {len(message.results)}"), 1032 print(f"Essence names in the answers: {[x.essence for x in message.results]}") 1213 print(f"message id: {json.dumps(ast.literal_eval(repr(message.id)), sort_keys=True, indent=2)}")

Here is the return:

Number of results: 1 Essence names in the answers: [None] All edge attribute names: {'probability', 'jaccard_index'} Number of KnowledgeProviders in KG: Counter({'ChEMBL': 3020, 'ARAX/RTX': 1119, 'Pharos': 722, 'DisGeNet': 13, 'BioLink': 5}) message id: null

Could you help me take a look these a few lines? Thank you!

amykglen commented 4 years ago

Hi Chunyu! For python ARAX_query.py 7, I think the DSL may be flawed - I don't think qnode n02 should have is_set=true - so if you adjust the query to look like this:

query = {"previous_message_processing_plan": {"processing_actions": [
"create_message",
"add_qnode(curie=DOID:14330, id=n00)", # parkinsons
"add_qnode(type=protein, is_set=True, id=n01)",
"add_qnode(type=chemical_substance, id=n02)",
"add_qedge(source_id=n01, target_id=n00, id=e00)",
"add_qedge(source_id=n01, target_id=n02, id=e01)",
"expand(edge_id=[e00,e01])",
"overlay(action=compute_jaccard, start_node_id=n00, intermediate_node_id=n01, end_node_id=n02, virtual_edge_type=J1)",
"resultify()",
"return(message=true, store=false)",
]}}

It then produces 1119 results (rather than 1).

amykglen commented 4 years ago

I think a similar issue may be happening with the original query you posted, @chunyuma - I think qnode n01 probably shouldn't have is_set=True. If I remove it so that the query looks like this:

query = {"previous_message_processing_plan": {"processing_actions": [
    "create_message",
    "add_qnode(name=hypertension, id=n00)",
    "add_qnode(type=protein, id=n01)",
    "add_qedge(source_id=n01, target_id=n00, id=e00)",
    "expand(edge_id=e00)",
    "overlay(action=compute_ngd)",
    "resultify(ignore_edge_direction=false)",
    "return(message=true, store=false)", ]}}

resultify produces 79 results (rather than 1). (When I run it in the master branch.)

edeutsch commented 4 years ago

Thanks, @amykglen. This underscores the is_set=true problem. My added code (#737) forces intermediate nodes to be is_set=true (only for incoming query_graphs, NOT for explicit DSL). I wonder if it should remove is_set=true for end nodes, where it probably (almost?) never belongs as well? Unless the user really understands it and uses it properly, it seems very easy to go astray with it.

chunyuma commented 4 years ago

Thanks so much for figuring out the problem, @amykglen. It works right now. I think it will be helpful if the program can automatically determine if it is_set=true I'm wondering if there are any files or visualization graph to check if it should be is_set=true or not.

chunyuma commented 4 years ago

I think this issue can be closed, right?

edeutsch commented 4 years ago

fine to close as far as I'm concerned.