RTXteam / RTX

Software repo for Team Expander Agent (Oregon State U., Institute for Systems Biology, and Penn State U.)
https://arax.ncats.io/
MIT License
33 stars 21 forks source link

Apparently recoverable InternalError on ARAX "kitchen sink" query #1734

Open saramsey opened 2 years ago

saramsey commented 2 years ago

So, when running this DSL query on arax-backup.rtx.ai (the production dev-area) last night:

add_qnode(name=arthritis, key=n00)
add_qnode(categories=biolink:Protein, is_set=true, key=n01)
add_qnode(categories=biolink:ChemicalEntity, key=n02)
add_qedge(subject=n00, object=n01, key=e00)
add_qedge(subject=n01, object=n02, key=e01, predicates=biolink:physically_interacts_with)
expand(edge_key=[e00,e01], kp=RTX-KG2)
overlay(action=overlay_clinical_info, observed_expected_ratio=true, virtual_relation_label=C1, subject_qnode_key=n00, object_qnode_key=n02)
filter_kg(action=remove_edges_by_continuous_attribute, edge_attribute=probably_treats, direction=below, threshold=.8, remove_connected_nodes=t, qnode_keys=[n02])
overlay(action=compute_jaccard, start_node_key=n00, intermediate_node_key=n01, end_node_key=n02, virtual_relation_label=J1)
overlay(action=predict_drug_treats_disease, subject_qnode_key=n02, object_qnode_key=n00, virtual_relation_label=P1)
resultify(ignore_edge_direction=true)
filter_results(action=limit_number_of_results, max_results=15)
return(message=true, store=false)

[I adapted this query from the test test_kitchen_sink_api in the test module RTX/code/ARAX/test_production_api/test_ARAX_api.py.]

So, when I ran it, I saw an interesting entry in the logfile from the query_controller child process (attached), which seemed to occur in the ranker (?):

2021-11-16T06:56:23.056243 WARNING: [] QueryGraph appears to be circular or has a strange geometry. This might cause trouble
2021-11-16T06:56:23.056301 ERROR: [InteralError_F260] Reached loop max: 21

Is this expected? It seemed to recover from the error and continue on its way. Though there were no results in the end, in this case.

From the interpreted DSL, the query graph was:

2021-11-16T06:56:23.058652 DEBUG: [] Query graph is {'edges': {'C1': {'constraints': [],
                  'exclude': None,
                  'object': 'n02',
                  'option_group_id': None,
                  'predicates': 'biolink:has_real_world_evidence_of_association_with',
                  'subject': 'n00'},
           'J1': {'constraints': [],
                  'exclude': None,
                  'object': 'n02',
                  'option_group_id': None,
                  'predicates': ['biolink:has_jaccard_index_with'],
                  'subject': 'n00'},
           'P1': {'constraints': [],
                  'exclude': None,
                  'object': 'n00',
                  'option_group_id': None,
                  'predicates': 'biolink:probably_treats',
                  'subject': 'n02'},
           'e00': {'constraints': [],
                   'exclude': False,
                   'object': 'n01',
                   'option_group_id': None,
                   'predicates': None,
                   'subject': 'n00'},
           'e01': {'constraints': [],
                   'exclude': False,
                   'object': 'n02',
                   'option_group_id': None,
                   'predicates': ['biolink:physically_interacts_with'],
                   'subject': 'n01'}},
 'nodes': {'n00': {'categories': None,
                   'constraints': [],
                   'ids': ['MONDO:0005578'],
                   'is_set': False,
                   'name': None,
                   'option_group_id': None},
           'n01': {'categories': ['biolink:Protein'],
                   'constraints': [],
                   'ids': None,
                   'is_set': True,
                   'name': None,
                   'option_group_id': None},
           'n02': {'categories': ['biolink:ChemicalEntity'],
                   'constraints': [],
                   'ids': None,
                   'is_set': False,
                   'name': None,
                   'option_group_id': None}}}

arax-query-controller-child-process-gc92r_dt.log.gz

saramsey commented 2 years ago

Tagging @finnagin and @dkoslicki for their take. And since the logfile entry referenced a weird graph geometry, tagging @amykglen as well.

edeutsch commented 2 years ago

This is a problem in the query_graph_info module. But I don't know what the cause is. It looks like a simple query-graph

edeutsch commented 2 years ago

If query_graph_info is invoked after all the overlay stuff, then it may be getting confused by virtual edge C1. We could add biolink:has_real_world_evidence_of_association_with as an edge to be ignored at line 130. BUT, I'm vaguely thinking that biolink:has_real_world_evidence_of_association_with may sometimes be a real edge, too? We may want to devise a more reliable way to annotate virtual edges so that ARAX can ignore them better. Both in query_graph_info and query_graph_interpreter.

saramsey commented 2 years ago

FWIW, KG2.7.4 has 1,027 edges with the biolink:has_real_world_evidence_of_association_with predicate:

Screen Shot 2021-11-16 at 12 16 11 PM
edeutsch commented 2 years ago

ah, if the same predicate can be used for a virtual edge and a real edge, it will cause some problems here, and it is all the more important to devise a more reliable way to annotate virtual edges

finnagin commented 2 years ago

So does query_graph_info always get called after DSL is run? I thought that was something that was called first before the DSL?

edeutsch commented 2 years ago

I was kinda wondering the same thing. I think I originally wrote it to inspect the incoming query_graph. But then somehow I later inferred that it was being run by resultify. Perhaps to decide what the essence of a result it. In short, I don't know, but it seems so.

dkoslicki commented 2 years ago

FWIW, since CHP/ICEES/COHD all want to use the biolink:has_real_world_evidence_of_association_with predicate, and this is used by our overlay_clinical_info, such instances of predicates for real and virtual edges will arise. So question is:

  1. Are the biolink:has_real_world_evidence_of_association_with edges supposed to be in KG2/C
  2. If so, the options are a) change the overlay name or b) change the query_graph_info code to ignore virtual edges. I would vote the later b). Since virtual edges are "second class citizens"
finnagin commented 2 years ago

Looks like this attribute that Chunyu added could be used to identify virtual edges: https://github.com/RTXteam/RTX/issues/1566#issuecomment-903036486

finnagin commented 2 years ago

From meeting: plan is to work on this at the next mini hackathon on 1/19