RTXteam / RTX

Software repo for Team Expander Agent (Oregon State U., Institute for Systems Biology, and Penn State U.)
https://arax.ncats.io/
MIT License
33 stars 21 forks source link

Ensure more creative results are shown in our results #2157

Open kvnthomas98 opened 12 months ago

kvnthomas98 commented 12 months ago

Currently Lookup Results may dominate the results. If we have a creative query, we need to ensure that creative results don't get filtered out.

Suggestions proposed by @dkoslicki: i) Manually place creative results on top. ii) Interleave creative results between lookup results.

saramsey commented 11 months ago

Slotting this on the agenda for tomorrow's AHM. Good topic for group discussion.

saramsey commented 11 months ago

I am intrigued with the "interleave" idea. While xDTD is awesome, I have concerns about a modification such that creative mode results are always (and only) at the top; I can expound on that tomorrow at the meeting.

dkoslicki commented 11 months ago

Thoughts from the AHM:

Different possible approaches include:

  1. (easy; but xDTD could be on later pages of the UI) Rank the lookups, take the xDTD results, take n from the former and m from the latter where n+m < 500 (or whatever the specified cutoff is)
  2. (harder, but proper) Get rid of the noise/cruft in the lookup results. There is no way there are >500 drugs that treat a given disease. I suspect this is due to SemMedDB. A naive approach would be to impose "we never return more than N lookups where N is small (eg. 25). A nuanced approach would be to see what's causing the explosion of lookup results, and downrank them in the ranker.
  3. (not ideal, but fast) Just interleave the results and check with the UI team if this will make the UI show the creative results
dkoslicki commented 11 months ago

Eg. of a bazillion lookups: https://arax.ncats.io/?r=174764 Any common(ish) disease will do

saramsey commented 11 months ago

@dkoslicki As a test (and since there was a TRAPI query for it in #2187) I ran "what drugs treat multiple sclerosis" through ARAX (the arax.ncats.io/beta endpoint) using knowledge_type="inferred". I got 500 results. The first 50 results look pretty reasonable, with a minority of experimental/investigational treatments in there (vitamin D, epigallocatechin, cannabidiol, estriol, ibudilast, melatonin, biotin, etc.). Below the first 50, we start to get some really broad categories like "Antibodies" or "Interferons" or "Vaccines", or "Vitamins" or "immunomodulators". We also start to get some puzzling results like "ethylene glycol" (which may reflect text-mining getting confused by text about PEGylation of some other therapeutic agent). Below the first 150 results, we do start to see increased frequency of crazy stuff like "caffeine", "fish oils", "ketamine", "nicotine", "tadalafil", and so forth.

I think there are four things driving such a large number of lookup results:

  1. Insufficient canonicalization. I see a lot of essentially the same results that are repeated with slightly different names. Like "glatiramer" and "glatiramer acetate", that kind of thing.
  2. We have general terms like "Vaccines", "Antibodies", "Cannabinoids", "Immunoglobulins", etc. that are cluttering up the results
  3. We have a lot of drugs that are on the list because they are being intensively studied for efficacy in MS. Theoretically, if we were to filter to get only the drugs that are marketed (i.e., indicated) for MS (and I'm not suggesting we do that in practice), we'd see the result list length drop by probably 8X to 10X.
  4. Drugs that are used to treat other comorbidities of MS, but that are really not MS (e.g., tadalafil or what have you).

I think our scores are, overall, a bit too high for the drugs that are not indicated for MS (e.g., the investigational treatments). For the drugs that are indicated for MS, the scores are fine.

Our scores are way too high for the overly general stuff like "Vaccines" and stuff like that. Ideally, those should be either filtered out or have their score reduced due to the concepts' generality. I know we've talked about this a lot, I guess I'm just echoing the feeling here that it would be good if we weren't seeing "antibodies" and "vaccines" and "vitamins" in the results.

saramsey commented 11 months ago

So in conclusion, I concur, there really aren't 500 different treatments for M.S. But there are probably at least 60-70 that are used to manage M.S. (remember it's a complex multi-faceted disease for which there is AFAIK no cure), plus another 100 to 150 being actively investigated.

dkoslicki commented 11 months ago

@saramsey do we have a KP or edge property that we can use explicitly for "indicated for"? IIRC, when we ask for treats edges, KP's don't distinguish between investigational and indicated for. Perhaps there's something in KG2 we could use to cross check?

saramsey commented 11 months ago

@dkoslicki I am not sure. It is a problem that the biolink:treats is being used for investigational/experimental therapies like vitamin D. I think the Biolink people and the Predicates WG people are working on "refactoring" the biolink:treats predicate to allow more precise statements for such cases.

In the meantime, I like the idea of trying to pull in that information from somewhere. I am not sure about where we could get it, though. I guess if someone were to go through all 500 results and label them as "indicated", "investigational", and "neither" (this would take an afternoon though!), we could try to find which sources are contributing to the "indicated" vs. "investigational". I suspect there will be a bias towards certain sources.

dkoslicki commented 11 months ago

https://www.fda.gov/drugs/drug-approvals-and-databases/drugsfda-data-files perhaps? Doesn't cover biologics and like though

amykglen commented 11 months ago

while I'm not aware of something like indicated_for edges that capture which drugs are FDA approved to treat which conditions (that seems very useful), we do have the ability to constrain queries on FDA approval status (#1599, which makes use of KG2 data (#1497))... it wouldn't let us filter down the result set to drugs approved specifically for MS, but maybe it would at least get rid of general terms and drugs not yet approved for anything?

dkoslicki commented 11 months ago

Ah, I wasn't aware of that. Should be a good first pass, so @kvnthomas98 please do make note of Amy's comment once you start working on this.

saramsey commented 11 months ago

Thank you @amykglen, good suggestion

dkoslicki commented 2 months ago

Related: #2327