NCATSTranslator / minihackathons

MIT License
5 stars 5 forks source link

Workflow query C.2 help with NGD overlays #290

Closed rtroper closed 2 years ago

rtroper commented 2 years ago

Query C.2 gives best results by making use of normalized google distance overlays. We have used the ARAX DSL to play around with variations on this query to get it working well and giving decent results. Once this was done, we copied the JSON conversion of the DSL query from the ARAX UI (by clicking on the {json} button).

Current queries C.2a, C.2b, and C.2c, as found in the workflow C folder, are structured exactly as we copied them from the ARAX UI. However, in this conversion, the normalized google distance edges are explicitly specified, rather than implicitly specified in the TRAPI via an overlay operation. It would be great if we could get help (e.g. from @dkoslicki or someone else from the workflow/operations group) in getting a faithful "translation" of these queries to proper TRAPI using an overlay section.

Even once we replace the normalized google distance edges in the current queries with an overlay section, it's not clear to me if this will do exactly the same thing as the original DSL queries. Getting it as close to possible will be the objective. For reference, here's a DSL query that we've found to work well:

add_qnode(key=n0, categories=biolink:Gene, is_set=True)
add_qnode(key=n1, ids=CHEMBL.COMPOUND:CHEMBL941, categories=biolink:SmallMolecule)
add_qedge(key=e01, subject=n0, object=n1, predicates=biolink:interacts_with)
add_qnode(key=n2, ids=MONDO:0005301, categories=biolink:Disease)
add_qedge(key=e02, subject=n0, object=n2, predicates=biolink:genetic_association)
add_qnode(key=n3, categories=biolink:SmallMolecule)
add_qedge(key=e03, subject=n0, object=n3, predicates=biolink:interacts_with)
expand()
overlay(action=compute_ngd,default_value=inf,virtual_relation_label=N1,subject_qnode_key=n3,object_qnode_key=n2)
overlay(action=compute_ngd,default_value=inf,virtual_relation_label=N2,subject_qnode_key=n3,object_qnode_key=n1)
overlay(action=compute_ngd,default_value=inf,virtual_relation_label=N3,subject_qnode_key=n2,object_qnode_key=n3)
resultify()
filter_results(action=limit_number_of_results, max_results=500, prune_kg=true)
edeutsch commented 2 years ago

Hi @rtroper, it is not always possible to translate DSL programs into TRAPI that will do exactly the same thing because of the flexibility of the DSL. However, in this case it looks like you haven't done anything fancy, so it should be fine. Just take the output JSON and remove the NGD edges and you should get something that is equivalent. That won't always be possible, but should be in this case. I tried and get this:

{
  "nodes": {
    "n0": {
      "categories": [
        "biolink:Gene"
      ],
      "is_set": true,
      "constraints": []
    },
    "n1": {
      "ids": [
        "CHEMBL.COMPOUND:CHEMBL941"
      ],
      "categories": [
        "biolink:SmallMolecule"
      ],
      "is_set": false,
      "constraints": []
    },
    "n2": {
      "ids": [
        "MONDO:0005301"
      ],
      "categories": [
        "biolink:Disease"
      ],
      "is_set": false,
      "constraints": []
    },
    "n3": {
      "categories": [
        "biolink:SmallMolecule"
      ],
      "is_set": false,
      "constraints": []
    }
  },
  "edges": {
    "e01": {
      "predicates": [
        "biolink:interacts_with"
      ],
      "subject": "n0",
      "object": "n1",
      "constraints": [],
      "exclude": false
    },
    "e02": {
      "predicates": [
        "biolink:genetic_association"
      ],
      "subject": "n0",
      "object": "n2",
      "constraints": [],
      "exclude": false
    },
    "e03": {
      "predicates": [
        "biolink:interacts_with"
      ],
      "subject": "n0",
      "object": "n3",
      "constraints": [],
      "exclude": false
    }
  }
}

Note that I also fixed your predicates=genetic_association to predicates=biolink:genetic_association

rtroper commented 2 years ago

Oops, thanks for pointing that out ("_Note that I also fixed your predicates=genetic_association to predicates=biolink:geneticassociation"). I've updated it, above.

Okay, I'll just exclude the NGD edges for now. I thought there might be a way to use an overlay in the workflow section of the TRAPI that uses normalized google distance. In the case of this rather complex query, I've found that the NGD edges really have a positive impact on the quality of the results.

dkoslicki commented 2 years ago

@rtroper the trick here is the following: when ARAX is presented with a TRAPI query, if it fits one of the pre-defined templates, we decorate it with some extra stuff (you can take a look here to see exactly what happens) which in your case will auto-decorate it with NGD.

As Eric mentioned, the TRAPI returned by ARAX tells “here’s what I did” even if it isn’t really a valid TRAPI input JSON query. I’ll translate your above DSL into the workflow language shortly.

dkoslicki commented 2 years ago

@rtroper Here's an exact duplicate of your DSL query in the workflow language (i.e. can be posted to the ARS and ARAX will know what to do with it):

{
    "workflow": [
        {
            "id": "fill"
        },
        {
            "id": "bind"
        },
        {
            "id": "overlay_compute_ngd",
            "parameters": {
                "virtual_relation_label": "N1",
                "qnode_keys": [
                    "n3",
                    "n2"
                ]
            }
        },
        {
            "id": "overlay_compute_ngd",
            "parameters": {
                "virtual_relation_label": "N2",
                "qnode_keys": [
                    "n3",
                    "n1"
                ]
            }
        },
        {
            "id": "overlay_compute_ngd",
            "parameters": {
                "virtual_relation_label": "N3",
                "qnode_keys": [
                    "n2",
                    "n3"
                ]
            }
        },
        {
            "id": "complete_results"
        },
        {
            "id": "score"
        },
        {
            "id": "filter_results_top_n",
            "parameters": {
                "max_results": 500
            }
        }
    ],
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "categories": [
                        "biolink:Gene"
                    ],
                    "is_set": true,
                    "constraints": []
                },
                "n1": {
                    "ids": [
                        "CHEMBL.COMPOUND:CHEMBL941"
                    ],
                    "categories": [
                        "biolink:SmallMolecule"
                    ],
                    "is_set": false,
                    "constraints": []
                },
                "n2": {
                    "ids": [
                        "MONDO:0005301"
                    ],
                    "categories": [
                        "biolink:Disease"
                    ],
                    "is_set": false,
                    "constraints": []
                },
                "n3": {
                    "categories": [
                        "biolink:SmallMolecule"
                    ],
                    "is_set": false,
                    "constraints": []
                }
            },
            "edges": {
                "e01": {
                    "predicates": [
                        "biolink:interacts_with"
                    ],
                    "subject": "n0",
                    "object": "n1",
                    "constraints": [],
                    "exclude": false
                },
                "e02": {
                    "predicates": [
                        "biolink:genetic_association"
                    ],
                    "subject": "n0",
                    "object": "n2",
                    "constraints": [],
                    "exclude": false
                },
                "e03": {
                    "predicates": [
                        "biolink:interacts_with"
                    ],
                    "subject": "n0",
                    "object": "n3",
                    "constraints": [],
                    "exclude": false
                }
            }
        }
    }
}

Note that you are overlaying the NGD edges twice for the same pair of two nodes: you have an n2->n3 and an n3->n2. Since NGD is symmetric, this won't really add anything. Perhaps you wanted a different pair of nodes?

rtroper commented 2 years ago

Thanks, @dkoslicki, this is really helpful! I'll try this query out and update the query json files in the workflow C folder. I had noticed I had two NGD edges between the same two nodes. And I had indeed assumed the NGD relation was symmetric, but at one point I took one out and I thought I ended up with different results. Either I'm remembering wrong, or perhaps duplicating the edge somehow impacts the way the query is processed, even if the predicate is, technically, symmetric. I'll try it again (taking one out) and see if it does in fact impact the result.

rtroper commented 2 years ago

Interesting, I tried the query with (1) all three NGD edges (N1, N2, N3), (2) only NGD edges N1, N2, and (3) only NGD edges N2, N3 and here's what I got (screenshot below). Results are the same for scenarios 1 and 3, and completely different for scenario 2. I guess that decides it. I'll go with scenario 3. Not sure why results would be so different for scenario 2.

image

dkoslicki commented 2 years ago

That’s quite curious! Do you mind sharing the link to all 3 of those results? I’m now thinking that the extra edge might impact the max flow, Frobenius norm, and longest path portions of the ranking, but I’ll need to check.

rtroper commented 2 years ago

That’s quite curious! Do you mind sharing the link to all 3 of those results? I’m now thinking that the extra edge might impact the max flow, Frobenius norm, and longest path portions of the ranking, but I’ll need to check.

Yeah, no problem, here you go:

NGD edges N1, N2, N3: https://arax.ncats.io/?r=27472 NGD edges N1, N2: https://arax.ncats.io/?r=27476 NGD edges N2, N3: https://arax.ncats.io/?r=27481

rtroper commented 2 years ago

In today's minihackathon, there was discussion about not including the workflow section with overlays so that other ARAs can have a chance to respond to this query. @edeutsch noted that ARAX should (under most circumstances, I believe) impose the overlays anyway, even if not explicitly included in the query as an overlay operation.

I just tested this: https://arax.ncats.io/?source=ARS&id=9dce8294-186c-4f6f-bacc-e297c62b9b72

Omitting the workflow section, I now see results from both BTE and ARAX (whereas, with the overlays in the workflow section, there are only results from ARAX). However, for this particular query (C.2 from workflow C), I found that ARAX is not automatically doing the overlay. I've included the exact query below.

Without the overlays, the results are not great. In the top results, I see things like reactive oxygen species, ribonucleic acid, peptide, insulin, estrogen, lipids, galactose, calcium, amino acids in the ARAX results (https://arax.ncats.io/?r=27561).

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "categories": [
                        "biolink:Gene"
                    ],
                    "is_set": true,
                    "constraints": []
                },
                "n1": {
                    "ids": [
                        "CHEMBL.COMPOUND:CHEMBL941"
                    ],
                    "categories": [
                        "biolink:SmallMolecule"
                    ],
                    "is_set": false,
                    "constraints": []
                },
                "n2": {
                    "ids": [
                        "MONDO:0005301"
                    ],
                    "categories": [
                        "biolink:Disease"
                    ],
                    "is_set": false,
                    "constraints": []
                },
                "n3": {
                    "categories": [
                        "biolink:SmallMolecule"
                    ],
                    "is_set": false,
                    "constraints": []
                }
            },
            "edges": {
                "e01": {
                    "predicates": [
                        "biolink:interacts_with"
                    ],
                    "subject": "n0",
                    "object": "n1",
                    "constraints": [],
                    "exclude": false
                },
                "e02": {
                    "predicates": [
                        "biolink:genetic_association"
                    ],
                    "subject": "n0",
                    "object": "n2",
                    "constraints": [],
                    "exclude": false
                },
                "e03": {
                    "predicates": [
                        "biolink:interacts_with"
                    ],
                    "subject": "n0",
                    "object": "n3",
                    "constraints": [],
                    "exclude": false
                }
            }
        }
    }
}
edeutsch commented 2 years ago

Thanks, Ryan. The problem is that the ARAX QueryGraphInterpreter is having trouble understanding the graph and where to put the NGD edges, so it doesn't put them anywhere. (I note that in the above conversation the humans are also struggling with the same thing!) So this needs to be solved somehow. The best idea I can come up with is to have the QGI put an NGD edge in parallel with all the existing edges but not where the are no edges. Does that seem sensible? In principle you could put them in other configurations, too. This is why the QGI currently throws up its hands and you have to do it manually via workflows/DSL. I don't have a solution yet.

rtroper commented 2 years ago

TL/DR - Following up on an idea mentioned in the minihackathon yesterday (which I'm not sure I understood very well) to see if we could pursue the option of having the ARS send out queries to ARAs with NGD overlays specified in a workflow section.

@dkoslicki @edeutsch @MarkDWilliams @cbizon @jh111 - I'm not sure of the best place to have this conversation (or the specific people to ping), so I'm putting it here since it's related to using overlay workflow operations (specifically, normalized google distance).

For workflow C, we'd like to include as many ARAs and KPs as possible. Right now, for query C.2, we rely on the NGD overlay to pare results down to ones that we've found to be quite good. The issue right now is that when the query graph (with the overlay workflow operations) is submitted to the ARS, it appears to die there and not get sent out to ARAs. As we've been developing the narrative, we've just been submitting the query directly to ARAX.

Without the workflow section, when we submit just the query graph through the ARS, we do get results back from both ARAX and BTE. In the past, I think I've also seen Aragorn return results, although it might be timing out, now. As Eric mentioned above, it's possible they could automatically impose overlays (even without specifying overlay operations in the TRAPI itself), which they do for some query structures, but it's not clear where to put the overlays given the complexity of this query.

A good option (which was brought up in the minihackathon yesterday) may be to have the ARS pass the query through and send it out to ARAs, even with the workflow section as-is. ARAX could use the information specified in the workflow section and other ARAs could just ignore and respond to the query graph itself. Not sure if it was Mark or someone else that brought this up. I just wanted to see if we could pursue this option.

I'm not sure how difficult this would be. I'm also not sure if some components (ARAs or KPs) might choke on the workflow section? Or (hopefully) just ignore portions that they don't understand?

jh111 commented 2 years ago

Thanks for the solution. The following is working well through ARS. { "message": { "query_graph": { "nodes": { "n0": { "categories": [ "biolink:Gene" ], "is_set": true, "constraints": [] }, "n1": { "ids": [ "CHEMBL.COMPOUND:CHEMBL1201607" ], "categories": [ "biolink:SmallMolecule" ], "is_set": false, "constraints": [] }, "n2": { "ids": [ "MONDO:0005301" ], "categories": [ "biolink:Disease" ], "is_set": false, "constraints": [] }, "n3": { "categories": [ "biolink:SmallMolecule" ], "is_set": false, "constraints": [] } }, "edges": { "e01": { "predicates": [ "biolink:interacts_with" ], "subject": "n0", "object": "n1", "constraints": [], "exclude": false }, "e02": { "predicates": [ "biolink:genetic_association" ], "subject": "n0", "object": "n2", "constraints": [], "exclude": false }, "e03": { "predicates": [ "biolink:interacts_with" ], "subject": "n0", "object": "n3", "constraints": [], "exclude": false } } } } }