Investigation of failing test: Melphalan treats Nemaline myopathy (must be top 50%)

edeutsch commented 1 week ago

In the set of tests that ARAX fails but Aragorn passes, there is:

Melphalan treats Nemaline myopathy (must be top 50%)

Why do we fail?

edeutsch commented 1 week ago

Jared wrote:

re melphalan - nemaline myopathy https://arax.ncats.io/beta/?r=313412 only direct ARAX connection is https://arax.ncats.io/beta/?r=313412 CHEBI:28876----biolink:treats_or_applied_or_studied_to_treat----MONDO:0018958 Id: infores:automat-robokop:CHEBI:28876--biolink:treats_or_applied_or_studied_to_treat--None--None--None--MONDO:0018958--infores:text-mining-provider-targeted

So if "treats_or_applied_or_studied_to_treat" is a synonym for "treats" then we actually pass this one. Maybe we need to have the testing algorithm recognize "treats_or_applied_or_studied_to_treat" and "treats" as equally acceptable.

Perhaps not surprisingly, we get exactly the same answer directly from the ROBOKOP interface https://robokop.renci.org/question-builder/answer "treats_or_applied_or_studied_to_treat" So it is quite odd that ARAGORN passes, and ARAX does not.

Unless perhaps an older version of ARAX is being tested??

fwiw, the two-hop answers are here https://arax.ncats.io/beta/?r=313414 interesting, but no real additional insight from these.

edeutsch commented 1 week ago

The test harness results for this query are:

https://arax.ci.transltr.io/?r=7930f345-9099-4a76-8e29-dfdf6450048c

14 results, none of which are melphalan . Why is melphalan missing?

Jared points out above that a direct query results returns the desired answer: https://arax.ncats.io/beta/?r=313412

When we get an inferred treats query, I think we are supposed to invoke both a direct query to all KPs and merge that with the results of xDTD. It is possible that:

1) This is not happening the way it should. Maybe xDTD does not have it and we are not doing direct lookup. I thought this was working, so probably not the reason.

2) This is an artifact of the edge type. The direct query comes back with edge type: biolink:treats_or_applied_or_studied_to_treat, which is different than biolink:treats. Maybe Expand is only performing a query to all KPs using biolink:treats or otherwise in a way that does not capture biolink:treats_or_applied_or_studied_to_treat. Maybe we need to query both predicates? But if we query for both these two predicates, should they be presented differently in the final results? Should a biolink:treats edge get inserted directly into the results? but perhaps a biolink:treats_or_applied_or_studied_to_treat edge should be placed into a support graph behind a manufactured biolink:treats edge?

This is an interesting case. @amykglen @mohsenht do you have insight into this?

amykglen commented 1 week ago

ahh, yes, so I think the answer to this is option 2: treats_or_applied_or_studied_to_treat is an ancestor of treats, the predicate used in the input query, so treats_or_applied_or_studied_to_treat edges (like the melphalan one returned from ROBOKOP) should not be returned according to Biolink.

however, we still have our patch in place that makes an exception for RTX-KG2 - so if an inferred 'treats' query comes in, Expand will query RTX-KG2 with the higher-level treats_or_applied_or_studied_to_treat predicate, but it queries all other KPs only with the treats predicate. for edges then returned from KG2**, Expand edits their predicate from treats_or_applied_or_studied_to_treat (or other ancestral treats predicate) --> treats, as applicable, and records the original predicate in a biolink:original_predicate attribute on the edge. no one has complained about this approach so far.. :)

so could we/would we want to query all KPs with treats_or_applied_or_studied_to_treat for the lookup portion of inferred treats queries? we've been doing it for KG2 for a while, and no one has complained (or maybe no one has caught on, ha). it is an 'inferred' query after all, so I guess we have some creative freedom?

** exception here: Expand filters out any KG2 edges from semmeddb that do not use the direct 'treats' predicate, since semmeddb isn't very trustworthy

edeutsch commented 1 week ago

I vaguely recall that the Expand editing of treats_or_applied_or_studied_to_treat into treats was a temporary patch because all treats edges in KG2 were globally changed to treats_or_applied_or_studied_to_treat, but we expected that to be fixed to some degree at some point in a newer KG2? Where are we on that? When can we back that out yet? or still waiting for a newer KG2? Or is my memory flawed?
I like the approach suggested by @amykglen above. I think we should query all KPs with treats_or_applied_or_studied_to_treat (and also treats? or do we trust that they'll do that automatically?) (maybe no harm in being explicit?)
What do we do in the final output? What do you think of preserving all treats edges and showing them as such in the output, but then for all of the treats_or_applied_or_studied_to_treat edges returned, create a single treats [sg] edge, with a support graph that contains all the non-treats edges for support for the treats edge? Would that be clearer than the use of the original_predicate approach?

amykglen commented 1 week ago

yes - we did expect to back the patch out because we expected the CTKP treats edges (recently ingested into KG2) to be enough, but I still haven't thoroughly investigated whether that's true. for our example inferred treats query in the ARAX UI (for Castleman), CTKP produces 1 edge while KG2 (with the patch) produces 11 - notably those edges from KG2 include DrugCentral edges, which technically use the higher-level predicate applied_to_treat, but still seem like something we'd want to return? and strangely, KG2 doesn't appear to return any direct 'treats' CTKP edges for that query, even though the CTKP API itself returns 1 - I have not looked into why that would be the case (maybe version differences, or something related to KG2pre's ingest of CTKP)? so in short, I think it's best not to back the patch out yet, and perhaps never, since now we're thinking of basically applying this patch to all KPs!
cool - sure, we might as well include the 'treats' predicate in the query just in case predicate reasoning isn't happening properly on the KP's end (possible given the complexity of the treats-related predicates, which form a directed acyclic graph instead of a tree)
that's an interesting idea - I suppose that is a little safer in that the non-treats edges will have less of an impact on ranking in that case, and it does seem a little more transparent than simply editing an edge from another KP. so would we list infores:arax as the primary_knowledge_source on that treats [sg] edge? and maybe the contributing KPs as upstream supporting_data_sources?

edeutsch commented 1 week ago

What about predicates "biolink:applied_to_treat" or "biolink:in_clinical_trials_for", do we want to apply all the same handling for those predicates as well?
hm, yes, I think it is best to list 'infores:arax' as the 'primary_knowledge_source' on that 'treats [sg]' edge. I don't think I would add anything else there. Then all the edges inside the support graph would have their usual provenance and those details would all be there. I think that's best?

amykglen commented 1 week ago

yes, well, we do already handle all of those - they're descendants of treats_or_applied_or_studied_to_treat, so KPs should automatically be returning them (the Multiomics KPs and KG2 do), but we could list them as predicates in the query to be safe..
sure! sounds good

edeutsch commented 5 days ago

What should be the plan be for getting this implemented? Anyone able to do it?

amykglen commented 5 days ago

I can implement it - the edge support graph aspect may be a little tricky - aside from pathfinder, I don't think our system has used Edge support graphs (only Result support graphs), so I think Resultify may not play well with them (I think it might blow those edges away from the KG, thinking that they're not used in any results). may be able to get around this with a relatively small tweak to resultify, or other workaround.. will plan on working on this later today/tomorrow

update: turned out I was mistaken about our system not using edge support graphs - DTD/CRG do, and it ended up not being a big deal at all to use them here as well. no tweaks to resultify required.

amykglen commented 3 days ago

ok, implemented this and merged it into master - now we get 30 results for this query instead of 14, including Melphalan, with an edge support graph showing the ROBOKOP treats_or_applied_or_studied_to_treat edge..

Screenshot 2024-11-14 at 5 33 49 PM

Screenshot 2024-11-14 at 4 45 19 PM

Melphalan isn't ranked particularly high -- at result 16 out of 30 -- which I think may be enough to pass this test, since it's top 30?

I wasn't sure what to list as the knowledge_level and agent_type for these ARAX-created treats [sg] edges - went with not_provided and computational_model after scanning the Biolink definitions, but this could use some more thought...

I'm a little concerned that our inferred query results have changed significantly due to this - for instance, our 'Example 2' query (castleman) produces only 45 results on TEST, with siltuximab as the first result, while it now gets 94 results on CI, with siltuximab ranked 62nd! I think a tweak to the ranker might be needed?

also worth noting that currently, when both DTD and this creative expand mechanism return the same answer, the support graph for the DTD edge is also appearing on the creative expand edge (as a second SG)... will get that fixed soon..

amykglen commented 3 days ago

ok, I fixed that latter mixing up of support graphs issue I mentioned.

and as for the Ranker - I'm thinking that these creative expand edges are being given too high of scores because they have no attributes that the Ranker recognizes as useful for scoring, so I think it defaults to giving them a confidence of 1 (here), which seems way too high for these sorts of edges (which are kind of speculative).

maybe an easy first solution would be to have Expand give these edges an attribute with some sort of confidence score (maybe something simple based on the number of edges in the support graph - like min(0.2*num_support_edges, 1)), which the Ranker could then be updated to consider? or is it easier for Ranker to just somehow recognize these edges and decide what confidence to give them itself?

edeutsch commented 3 days ago

Thanks, Amy, this is great! But you're right, that we should follow up carefully to see if we make other tests worse by making melphalan better. I think you're right that we'll now pass melphalan, but how many others will switch from PASS to FAIL with this?

I would agree that if the Ranker does does recognize any basis for ranking, it should not default 1.0, but rather something lower. Perhaps 0.5 or something?

amykglen commented 3 days ago

yeah, something like 0.5 (vs. 1) seems more appropriate to me when the Ranker can't find any scoring-relevant attributes. although making that change may disrupt our rankings even further, ha - I doubt this is the only source without Ranker-recognized attributes. 🙂 may require some careful testing..

edeutsch commented 3 days ago

So I think the Castleman disease - Siltuximab example is an ideal example to refine the Ranker further https://arax.ci.transltr.io/?r=314984

The desired result siltuximab is 62 out of 94. We will now fail this test. BUT, I would contend that our overall resultset is much improved and we just need better ranking. The ranking is not performing the way I would intuitively want it to.

I think the main issue indeed might be that after all our agonizing over refining of scoring in #2242, it may be that when the ranker doesn't recognize data on which to base its decisions, it just gives it a 1.0, thereby overriding all our careful decisions. It seems much better to set things so that if the ranker doesn't understand the data, it should default to a low score, not the absolute best score? (if indeed that's what's happening, I'm not certain)

amykglen commented 3 days ago

agreed - although I realized another issue here, which was that the 'creative expand' edge support graphs were not being excluded from ranking, as the 'creative DTD' and 'creative CRG' edge support graphs are (I remembered that was a whole topic of its own, and their exclusion for DTD/CRG resulted in a pretty big ranking performance improvement, per @chunyuma).

so I just updated the Ranker so that it also excludes these new edge support graphs from Expand, and that seems to have helped a lot! Siltuximab is back as the 1st result for the castleman query. and melphalan is now the 2nd result for the Nemaline myopathy query!

so maybe that's all that really needed to be adjusted? probably worth investigating more, but at least it doesn't seem so urgent now

edeutsch commented 3 days ago

outstanding, thanks! The new results look excellent!

But I confess I don't understand what it means that 'creative expand' edge support graphs should be excluded from ranking.

amykglen commented 2 days ago

yeah, it's a little bit confusing, but basically when the Ranker is working with the Results, all of the 'support' edges haven't actually been moved into support graphs yet - instead they're just regular edges in the result. so, for instance, for the Nemaline myopathy -- Melphalan result, Ranker would see three edges: the Expand-created treats [sg] edge, the treats_or_applied_or_studied_to_treat edge from ROBOKOP, and an NGD edge. Chunyu found that excluding the edges that are really just creative support edges (i.e., the treats_or_applied_or_studied_to_treat edge from ROBOKOP) gives us better ranking. I think this is because when they're included, those results look misleadingly connected, giving, for instance, higher max flow vs. if those 'support' edges are ignored. and indeed, after I adapted the exclusion code to also filter out these expand-created support edges before Ranker gets to work, that seems to have drastically improved the results.

edeutsch commented 2 days ago

okay, thanks for the explanation. I think I get it mostly, although I am thinking that the weight of the treats [sg] edge should be influenced by the contents of the support graph. So my understanding is that at this time the weight of the treats [sg] edge is the same regardless of whether the support graph is fantastic or feeble? If true, that's okay, but an opportunity for improvement.

amykglen commented 2 days ago

yes, that's right. I agree, it's almost like the support graphs are nested ranking problems - like we could run the ranker on those to assign scores to support graphs, which are then annotated on the main edge. or Expand could just tack some sort of score onto the treats [sg] edge when it creates it.

I'm not sure whether the DTD/CRG support graphs have any influence on the score given to the main treats edge in those results - I want to say no, but really not sure.

RTXteam / RTX

Investigation of failing test: Melphalan treats Nemaline myopathy (must be top 50%) #2412