RTXteam / RTX

Software repo for Team Expander Agent (Oregon State U., Institute for Systems Biology, and Penn State U.)
https://arax.ncats.io/
MIT License
34 stars 20 forks source link

MolPro results with "superclass reasoning" causing issues with ARAX #1886

Closed saramsey closed 1 week ago

saramsey commented 1 year ago

The following query,

{
  "edges": {
    "e0": {
      "object": "n0",
      "subject": "n1"
    }
  },
  "nodes": {
    "n0": {
      "categories": [
        "biolink:PhenotypicFeature",
        "biolink:Disease"
      ],
      "ids": [
        "MONDO:0100231",
        "MONDO:0100232"
      ],
      "is_set": false
    },
    "n1": {
      "is_set": false
    }
  }
}

(which came out of the Translator Question of the Month session today), is generating some results (see ARAX results 55221 that show loss of semantic precision that I think derives from the "superclass reasoning" that MolPro is doing (which was discussed at length in the Expander Agent all-hands meeting on Aug. 3, 2022; see also RTX issue 1855 which I think has the same root cause as this issue). The aforementioned query is asking for concepts (any concepts) that are biolink:related_to the disease "suseptibility to psoriatic arthritis" (coded as a query node with a pair of CURIEs, MONDO:0100231 and MONDO:0100232). Seems straightforwad, but we are seeing a bunch of organic compounds returned that are not related to "susceptibility to psoriatic arthritis" but instead are related to MONDO:0042489 ("disease susceptibility"):

Screen Shot 2022-08-05 at 9 39 49 AM

Note, MONDO:0042489 is two levels higher in the MONDO hierarchy than MONDO:0100232, as shown here:

Screen Shot 2022-08-05 at 9 41 08 AM
saramsey commented 1 year ago

I am tagging @dkoslicki and @edeutsch to see if this can be brought up in the Architecture Committee meeting. I think the best remedy here would be if MolPro stops doing this kind of "superclass reasoning", at least by default. The second-best remedy would be if MolPro makes "superclass reasoning" something that can be turned off by specifying an option somewhere in the TRAPI query graph.

saramsey commented 1 year ago

Another example of MolPro results where superclass reasoning is evident (provided in the Agenda for the Aug. 3 AHM) can be see in this ARS query result link, where one should go to the ARAX results. We see all kinds of drugs in the result-list that are connected to high-level disease concepts like "rheumatic disorder", "psychiatric disorder", "disease or disorder", "neurodegenerative disease", "immune systems disease", basically the whole gamut.

saramsey commented 1 year ago

Marking this "high priority" now that it has come up in multiple Translator stand-up meetings or Question-of-the-Month sessions.

saramsey commented 1 year ago

It may be that MolPro has already inferred and stored all of these "superclass-reasoned" triples internally, in which case, it may be more difficult for them to filter them out. We should maybe try to ascertain whether or not this is the case, as it may be relevant to what kind of remedy we can hope to get.

amykglen commented 1 year ago

in the short-term should we remove MolePro as a KP? that is a trivial change, easy to undo when a solution is settled on.

saramsey commented 1 year ago

I'm open to the idea. Would like to get David and Eric's take on it as well.

Building on your idea, perhaps we could temporarily disable MolPro in the production system but keep a "with MolPro" version handy (e.g., in /beta or /test or whatever) should we need it, for example, to support ongoing discussions with the MolPro team about the effect of superclass reasoning on ARAX results.

edeutsch commented 1 year ago

It is a bit of a drastic move, but I would support it.

I wonder if a good way to handle this is to compute a relevance score for each received edge? Although a fair bit of work, maybe we could be able to consider each edge we receive and see if it is relevant. If we ask for information about type 1 diabetes and get back answers for immune disorder, perhaps we could keep it, but downweight it substantially as not very relevant. This would be for pinned nodes in queries. If we had easy access to ontologies, we could compute how relevant is the returned node relative to the pinned nodes. Exact matches are highly relevant. Children are less relevant but okay. Ancestors are severely downweighted, perhaps say a factor of 2 in each generation. This could allow such ancestor reasoning to stay, but be downweighted as not very relevant to the question. If there are relevant answers, they get prioritized and these ancestor relationships are way down the rank. If there's nothing relevant, then less relevant things are top.

Perhaps a good abstract question is: if I ask ARAX about specific disease X, and there's nothing highly relevant for X or children of X, might I be interested in generic things for ancestors of X like immune disorders, or do I want nothing in response. Nothing or less relevant?

jh111 commented 1 year ago

I'd actually prefer to have something less relevant than nothing.

saramsey commented 1 year ago

What's the status on this? Did we end up removing MolPro from Expand?

amykglen commented 1 year ago

no, no action was taken - we still currently use MolePro

dkoslicki commented 1 year ago

See https://github.com/NCATSTranslator/Feedback/issues/148 for a possible additional issue caused by superclass reasoning

saramsey commented 1 year ago

So the MolePro team says they have fixed the issue with their disease hierarchy reasoning. We might want to double-check if this issue has "gone away" in the latest version of MolePro queried via ARAX.

dkoslicki commented 1 year ago

Take a look at: https://arax.ncats.io/?r=135342, specifically things like result 14, where superclass reasoning is used. Maybe their fix hasn't been deployed yet?

jh111 commented 1 year ago

The problem still occurs with https://arax.ncats.io/?r=135342, and answer like 4-aminophenylarsenoxide are not related to the query. You can disregard my earlier note, the answer is not more helpful than having nothing at all. It's just that edges were displaying on top of each other. Here they are spread out.

image

jh111 commented 1 year ago

At the same time, I expect people will query with "Psoriatic Arthritis", not "Susceptibility to Psoriatic Arthritis".

saramsey commented 1 year ago

Thank you @jh111 and @dkoslicki for pointing out this issue is still going on. I have reached out to the MolePro team via Slack DM and via a comment on NCATSTranslator/Feedback issue 148, to find out which ITRB service maturity level their fix was deployed to (test, dev, or prod).

dkoslicki commented 1 year ago

Update: turns out this was a subtle issue between ARAGORN/Automat and MolePro. Apparently an Automat fix will be up within a week, and then MolePro will re-build and push a fix.

saramsey commented 1 year ago

Thank you @dkoslicki!

kvnthomas98 commented 1 week ago

Seems to be fixed now the test query