NCATSTranslator / minihackathons

MIT License
5 stars 5 forks source link

Filtering and ranking Connection Hypothesis Provider results #301

Closed jh111 closed 2 years ago

jh111 commented 2 years ago

In Demo C, it will be valuable for the user to either rank cytotoxic drugs lower, or filter them out. @rtroper @dkoslicki May we get your assistance setting this up through the ARS?

jh111 commented 2 years ago

@edeutsch Do you have suggestions on how we might proceed?

edeutsch commented 2 years ago

How can we know from the knowledge graphs that a SmallMolecule or Drug is cytotoxic? Or if not from the knowledge providers, then from any source?

jh111 commented 2 years ago

I have renamed this to separate out multiple issues.

Where want to to include/highlight Connection Hypothesis Provider (CHP) results in Demo C.

  1. However, we have an temporary EPC workaround of using a predicate _has_real_world_evidence_of_associationwith to mean "probably_treats" (supported by EHR Risk KP supervised ML and COHD attribute relative frequency). We're put an agenda item for the EPC meeting to understand the recommend long term solution for qualifiers for real world evidence (RWE), and will continue to work on data modeling in the Clinical Data Committee.

  2. We want to create a query in Demo C that highlight Connection Hypothesis and rank the results highly. Conversely, CHP drugs for breast cancer are currently ranked high and it would be helpful to have interpretable provenance and/or lower ranking.

GregHydeDartmouth commented 2 years ago

Hey @jh111 I've been doing some investigating on the Expander Agent front with @edeutsch. I suspect the results you are referring to would be for Workflow C2b_Etanercept_MultSclerosis_GeneSet_and_SmallMolecule.json. I've included ARAX results here. Typically ARAX uses normalized google distance (NGD) or jaccard measures to rank results however in my discussion with Eric, it appears that these measures aren't computed when we use is_set=True operations for queries. Using is_set results in forked queries that make using NGD and jaccard a little less straight forward. In place of this they rank results by responses that yield the highest groupings over the node indicated to be a set (in this case genes). If you click through the results from the link I provided you'll see this behavior on the genes. To me this makes sense why our results are ranked so highly, because we have a limited scope of drugs and when we rank gene-drug associations there is the possibility of high overlap on resulting drugs (i.e. the set of drugs we find sensitive to gene_1 could be somewhat similar to the set of drugs we find sensitive to gene_2). This more fully connected behavior of our results means that we will yield high gene set groupings. Eric is going to bring it up with his team to incorporate some form of NGD or jaccard (or other ranking) over set nodes. I feel reasonably confident this is what is resulting in the cytotoxans as ranking so high. For instance, when we query drugs -> has_real_world_evidence_of_association_with -> MS the cytotoxans (result graph for ARAX here) we return are ranked more intuitively. You'll noticed Cyclophosphamide (a result we return) is ranked 21. Importantly the NGD seems to handle this ranking appropriately, as the NGD edge includes papers like: https://pubmed.ncbi.nlm.nih.gov/3332608/. Feel free to correct anything I may have wrong Eric!

edeutsch commented 2 years ago

The result of the meeting is that it would be best to use workflows directly to explore some possible overlays that gives you what you're looking for and then once you have figured out an overlay that you like, we can potentially adjust the back end to use that by default.

jh111 commented 2 years ago

@edeutsch I think the note above is for the other issue (278, finding nimodipine). This doesn't have to be fixed before for the December Demo.

jh111 commented 2 years ago

Note from David Koslicki regarding filtering for cytotoxic drugs.

RTX-KG2 has nodes with the label “Information content entity”, and one of these is “cytotoxic” (UMLS:C1511636). I have previously used this to filter out drugs connected to it, though this uses the ARAXi ability to ask for things not connected to.