NCATSTranslator / Feedback

A repo for tracking gaps in Translator data and finding ways to fill them.
7 stars 0 forks source link

Why are we seeing significantly fewer support paths returned in the UI now (compared to 9 months months ago)? #795

Open mbrush opened 5 months ago

mbrush commented 5 months ago

The Problem:

I don't have hard evidence to support or quantify this, just my impressions and anecdotal comparisons. But I recall when I was doing more QA/testing last fall, seeing very many Results that had 30, 40, 50, even hundreds of support paths. These days, it is uncommon to find results with support paths in the double digits.

I have a few screengrabs from last fall to support my concern - which that show support path numbers we saw at that time. The image below shows the top results for "what may treat Cerebral Palsy?"): image

Compare this to what I see in the top page of results for this query today (in test):

image

This is the most extreme example I had noted - but there are several others I could share that support my concern.

Also, note that there is no overlap in top 10 results, to allow for direct comparison of paths for a given result, but here is what the Triamcinolone result looks like today (15 paths compared to 154 last fall) image


Given that support paths are the central focus of our UI, and the unique value we bring in Translator, I think it is worth some effort to understand if this drop is real/concerning, determine why it is happening, and figure out if/how we want to address it.

Several possible reasons have been proposed to explain what might be happening:

  1. Introduction of the 5 min limit on results, or other performance targeted restrictions, leads to ARAs not being able to return all support paths they could provide. If this is the case, valuable/legitimate support for results is not being shown to users.
  2. The paths that no longer show up were ones that contained generic/inappropriate concepts that are now being removed by the blocklist.
  3. Semmed improvements have removed many "low quality" edges that were frequently showing up in support paths - resulting in many fewer paths returned for many results (this may be esp true for BTE predictions which in the past were very common, and relied on chemical-treats-phenotype edges from semmed that may now be dropped) @andrewsu could shed more light here.
  4. Reasoners have fine tuned their algorithms and trimmed out 'bad' rules/templates, or trimming paths of the same metatype if too many come back - resulting in return of fewer paths.

I suspect that # 2, 3, and 4 above are contributing some to the drop - in particular # 3, as I recall seeing many results based on long lists of paths based on this "ChemtreatsPhenoOfDisease" BTE template where hop1 always came from semmed).

But would like to be sure that # 1 is not a significant contributor, because this would represent a loss of legitimate and valuable support paths that would help users trust and understand results.


Proposed Action:

Are there some tests that could be run (by ARS folks perhaps) to assess the change in support path numbers over time, and try and understand if we may be loosing paths due to time limits or other performance/caching "improvements" that have been implemented in the last year? @MarkDWilliams @ShervinAbd92 Or is this something that each ARA might have to test on their own (@cbizon @andrewsu)?

Also curious what @sstemann has to say about this - given her UI testing experience/expertise.

sstemann commented 4 months ago

Based on my recent testing, the ARS is not dropping support paths. Answers with many paths are typically lower in Sugeno and therefore not on the front page. I believe in your screenshot, the sorting may have been by Evidence and was prior to the score implementation.

https://ui.test.transltr.io/main/results?l=Common%20Cold&i=MONDO:0005709&t=0&r=0&q=9aefda7f-5337-4710-8698-182b29e8c1a1 > sort by the evidence column

image

Acebutolol - returned by ARAX only, with a .95 score. I believe since it was only one ARA and no other scoring components it was ranked 81 with sugeno .95 based on the O&O/Appraiser/ARS Sugeno pipeline

Methoxyflurane - returned by ARAX only, with a .93 score. I believe since it was only one ARA and no other scoring components it was ranked 107 with sugeno .93 based on the O&O/Appraiser/ARS Sugeno pipeline

the same applies to asthma wrt to the more evidence/paths are not on the first page

UI: https://ui.test.transltr.io/main/results?l=Asthma&i=MONDO:0004979&t=0&r=0&q=5c2b8789-9ae8-43ce-b11c-116578d863a8 > sort by the Evidence Column

image

however its even less clear how they were ranked, given they were returned by multiple ARAs, those those ARAs scores are over a range troleandomycin - ARAX score .63, BTE score .82, sugeno .93, rank 145 Triamcinolone - ARAX score .84, Improving score .39, BTE score .82, sugeno .98, rank 36

i get to the non-UI scores by using the "ARS merge result summary" Collab (https://colab.research.google.com/drive/1kKC0rCnL18z3sgDsD7P2bLpNoI-wtN6C?usp=sharing), which produced the spreadsheets here: https://drive.google.com/drive/folders/16BzfiYzDfFrNE1hB3DSggHY1qKRS_Bhj?usp=drive_link

after sorting in the UI by evidence, i took the top evidence substances and did a "find" search in the Excel spreadsheets and looked across the row.

so, systematically, i think this all working as designed. do i understand the score/rank? no

another way to look at it is by viewing, the ARAs top results:

ARAX: top result Corticotropin, this is the support graph

image

BTE: top result Fluticasone, this is the support graph

image

I don't believe this is an ARS merge issue. Based on the above:

sstemann commented 4 months ago

it may be different for Prod. attempting to test all of this in Prod now.