RTXteam / RTX

Software repo for Team Expander Agent (Oregon State U., Institute for Systems Biology, and Penn State U.)
https://arax.ncats.io/
MIT License
33 stars 21 forks source link

Experimental ranker really needs some fine tuning #941

Closed dkoslicki closed 2 years ago

dkoslicki commented 4 years ago

New issue since #783 introduced the ranker, we can now fiddle with it. Here's an example of it needing fine tuning: query:

{
   "edges": [
      {
         "id": "e00",
         "source_id": "n00",
         "target_id": "n01",
         "type": null
      }
   ],
   "nodes": [
      {
         "id": "n00",
         "curie": "DOID:11830",
         "type": "disease"
      },
      {
         "id": "n01",
         "type": "chemical_substance"
      }
   ]
}

Top result by ranker: RNA, Messenger (score 0.99) top

However, we have another result: Zinc acetate (score 0.57) zinc

Note that the edges include prevented_by, treated_by etc. and also contain much more info (eg. high observed expected ratio, high drug-predicts-disease probability, and low NGD).

We need to fiddle with the ranker so things like this bubble up to the top.

@edeutsch Have you put the ranker in a separate class so we can start experimenting with how to fine tune the rankings?

edeutsch commented 4 years ago

I started the process and have a half-finished class in my dev area. I will try to get it finished and integrated very soon so you can refine it.

dkoslicki commented 4 years ago

copypasta #900 and closing #900 since this seems more relevant:

This is related to #783, I think we need:

  1. The attribute names are dynamically mapped to functions that handle their weightings (for ease of renaming attribute names)
  2. Weighting of individual attributes (eg. "probability" should be trusted MUCH less than "probability_treats")
  3. Auto-handling of normalizing scores to be in [0,1] (eg. observed_expected ration \in (-inf, inf) while probability \in (0,1)
  4. Auto-thresholding of values (eg. if chi_square <0.05, penalize the most, if probability_treats < 0.8, penalize the most, etc.)
  5. Allow for ranked answers (eg. observed_expected can have a single, huge value, skewing the rest of them

Proposition:

  1. Take all the edge_attributes
  2. Order them in their correct order (ascending or descending)
  3. Weight their significance by some arbitrary weighting scheme to be determined
  4. Return results in descending order of results from 3.
  5. Figure out how to get the numbers to be between 0 and 1

Discuss during AHM for @dkoslicki to give an example comprehensible by @edeutsch

edeutsch commented 4 years ago

@dkoslicki I have just committed to master a new class ARAXRanker. I did my best to port current functionality in the Messenger method into this class. There are some things that were changed by DMK or others since my original implementation, a few of which I'm a little leery of. I tried to leave a lot more comments in the code. I added a few inline comments about things that concern me with FIXMEs attached. Please feel free to delete those after your have examined them.

Maybe we should walk through the code together sometime soon.

dkoslicki commented 4 years ago

@edeutsch added some comments and am looking forward to what you find concerning/dubious. (as most of the stuff I added was dubious from my own assessment anyways).

I think this would be a great idea for a mini-hackathon on a Friday some time!

dkoslicki commented 4 years ago

Examples of results where ranking can be refined: https://arax.rtx.ai/?m=2687

{
  "edges": [
    {
      "id": "e00",
      "source_id": "n00",
      "target_id": "n01"
    }
  ],
  "nodes": [
    {
      "curie": "DOID:1227",
      "id": "n00",
      "type": "disease"
    },
    {
      "id": "n01",
      "type": "chemical_substance"
    }
  ]
}

https://arax.rtx.ai/?m=2688

{
  "edges": [
    {
      "id": "e00",
      "source_id": "n00",
      "target_id": "n01"
    }
  ],
  "nodes": [
    {
      "curie": "DOID:11830",
      "id": "n00",
      "type": "disease"
    },
    {
      "id": "n01",
      "type": "chemical_substance"
    }
  ]
}
dkoslicki commented 4 years ago

@dkoslicki To do:

jaredroach commented 4 years ago

What data are we feeding into rankings? Do we have a series of quality and relevance metrics for each edge? For example, in a conversation with @kvarforl, she found four drugs that interacted with ACE2. I manually "ranked" them.

n-(2-aminoethyl)-1-aziridine ethanamine Spp1148 hydroxyxhloroquine chloroquine

The first two were pretty much only from PubChem. One of them (Spp1148) had little metadata and no references, so should get an exceptionally low edge quality. The other (n-(2-aminoethyl)-1-aziridine ethanamine) had a bit more metadata, but still no references so should get a very low edge quality. Hydroxyxhloroquine & chloroquine have a lot of references, but most of them loop back to original papers which largely rely on speculation to posit an interaction between them and ACE2, so should get low edge quality. I am not sure how to get Expander to assign such qualities. Something like CHOD scores. Or co-occurrence in PubMed abstracts... Or character counts in records from originating databases (e.g., PubChem).

Perhaps this is the same conversation as that of #723.

saramsey commented 4 years ago

My suggestion is the we define a "hamiltonian path norm" like this (sorry for the R code, LOL):

library(expm)
score_matrix_of_probs <- function(matrix_of_probs) {
    N <- nrow(matrix_of_probs)
    matrix_of_probs <- 0.5*(matrix_of_probs + t(matrix_of_probs))
    stopifnot(N == ncol(matrix_of_probs))
    score_hamiltonian <- max(max((matrix_of_probs)%^%(N-1)))/factorial(N-1)
    score_hamiltonian   
}

and then combine the Hamiltonian path norm with the Frobenius norm, so that we can award a high-scoring spanning tree, per this paper's elucidation of a connection between Frobenius norm and spanning tree: http://www.math.tau.ac.il/~haimav/AvronBoutsidis13-SIMAX.pdf

dkoslicki commented 4 years ago

For the record (email thread in reverse chronological order):

 True, but I was hoping that max flow would “account” for the many-edges problem. i.e. if we have a weakest link in the path that (upon removal) would split the subgraph/answer into two disconnected components, then it seems right to rank this lower. If, however, we have many other edges that could be taken in the flow from source to sink with higher capacity, then the “weak links” wouldn’t really matter, right?

Thanks,

~David

——
Associate Professor of Computer Science and Engineering, Biology, and the Huck Institute of the Life Sciences
The Pennsylvania State University
W354 Westgate Building, University Park, PA 16802
(p) 814-865-1611 / (e) dmk333@psu.edu

From: Stephen Ramsey <sar@furamsey.org> 
Sent: Wednesday, August 26, 2020 10:27 PM
To: Koslicki, David <dmk333@psu.edu>
Subject: Re: simple scoring function?

Only problem is that I think in a weighted graph, max-flow for any single path should be dominated by the weakest edge in the path, right?

On Aug 26, 2020, at 7:25 PM, Stephen Ramsey <sar@furamsey.org> wrote:

I'm liking the "max over all-pairs, max-flow/min-cut" idea. I bet NetworkX has an implementation of the Edmonds-Karp augmenting paths algorithm.

On Aug 26, 2020, at 7:18 PM, Koslicki, David <dmk333@psu.edu> wrote:

Good thinking: directionality is important (in some cases at least)!

Perhaps maybe an even easier solution could be had with: treating the problem as a “max flow” problem (since we are really only currently considering “source-sink” kinds of graphs at the moment), and ranking the results by their max flow value (1 on source, -1 on sink) with edge capacity equal to edge_confidence.

Thanks,

~David

——
Associate Professor of Computer Science and Engineering, Biology, and the Huck Institute of the Life Sciences
The Pennsylvania State University
W354 Westgate Building, University Park, PA 16802
(p) 814-865-1611 / (e) dmk333@psu.edu

From: Stephen Ramsey <sar@furamsey.org> 
Sent: Wednesday, August 26, 2020 9:23 PM
To: Koslicki, David <dmk333@psu.edu>
Cc: Amy Glen <glena@oregonstate.edu>; Eric Deutsch <edeutsch@systemsbiology.org>; Finn Womack <womackf@oregonstate.edu>
Subject: Re: simple scoring function?

Ah wait, maybe don't symmetrize after all. I think symmetrizing may be a mistake b/c the entrywise max of the (N-1)th power of the adj matrix may be dominated by paths that go X->Y->X->Y... via an edge with 1.0.

On Aug 26, 2020, at 3:22 PM, Koslicki, David <dmk333@psu.edu> wrote:

“the combined score of the highest-scoring Hamiltonian path"
I like it!! Better than multiplying all edges confidences/probs together for sure!

Would this work for multi-graphs as well (since we can have multiple edges between two nodes)? Off the top of my head I think it could, but the resulting matrix_of_probs could get quite large…

Eventually, I had wanted to take “conflicting information” into account. For example, if the answer/subgraph has both
(n1:drug)-[:treats]-(n2:disease) and
(n1:drug)-[:contraindicated_for]-(n2:disease)
In it, then we should probably down-weight this one.

But that’s perhaps an issue for a later time…

Thanks,

~David

——
Associate Professor of Computer Science and Engineering, Biology, and the Huck Institute of the Life Sciences
The Pennsylvania State University
W354 Westgate Building, University Park, PA 16802
(p) 814-865-1611 / (e) dmk333@psu.edu

From: Ramsey, Stephen <Stephen.Ramsey@oregonstate.edu> 
Sent: Wednesday, August 26, 2020 6:09 PM
To: Koslicki, David <dmk333@psu.edu>
Cc: Glen, Amy <glena@oregonstate.edu>; Eric Deutsch <edeutsch@systemsbiology.org>; Womack, Finn <womackf@oregonstate.edu>
Subject: simple scoring function?

Hi all, 

Is this potential scoring function useful?  (Sorry for the R code, LOL):

library(expm)
score_matrix_of_probs <- function(matrix_of_probs) {
    N <- nrow(matrix_of_probs)
    matrix_of_probs <- 0.5*(matrix_of_probs + t(matrix_of_probs))
    stopifnot(N == ncol(matrix_of_probs))
    score_hamiltonian <- max(max((matrix_of_probs)%^%(N-1)))/factorial(N-1)
    score_hamiltonian   
}

Assumption: no self-loops; graph is connected; and all off-diagonal elements are on the closed unit interval.

Procedure:  We symmetrize the matrix of probabilities and then take the matrix to the N-1 power and then normalize by dividing by (N-1)!.  Then take the maximum matrix element.

It's sort of like finding the combined score of the highest-scoring Hamiltonian path.

Cheers,
Steve

-----------------------------------------------------
Stephen Ramsey, PhD
Associate Professor, Oregon State University
* Department of Biomedical Sciences
* School of Electrical Engineering and Computer Science

stephen.ramsey@oregonstate.edu
https://lab.saramsey.org
dkoslicki commented 4 years ago

Note: I am officially handing off this issue to someone else for the next ~4 weeks. Volunteers welcome!

saramsey commented 4 years ago

Proposal: let's call the current ranker (as implemented by EWD) "Alpha Ranker". We'll assume that Alpha Ranker produces a normalized rank (in the range (0,1] ) for each result subgraph; if the code instead is currently producing integer ranks, we can trivially divide the ranks by the number of subgraphs to get a "normalized alpha-ranker rank". We'll denote the normalized Alpha-Ranker rank by A.

I propose to create a "Beta Ranker" which will produce a different normalized rank \in (0,1] for each subgraph, by summing the (0,1] edge-probabilities for each edge in the subgraph to get a Beta-Ranker Subgraph Score, and then computing the normalized ranks of the Beta-Ranker Subgraph Scores. We'll denote the normalized Beta-Ranker rank by B.

I propose that for each subgraph, we compute a final score:

Final_Score = w_1 A + w_2 B

and rank the result subgraphs based on Final_Score, where w_1 and w_2 are mixing coefficients that we must pick, that satisfy:

w_1 + w_2 = 1

dkoslicki commented 4 years ago

The problem with this is that the “Alpha ranker” implemented is not complete (only utilizes a few edge properties), simply multiplies the scores (penalizing the longer paths), and does not consider the ranges of the edge property values. I do not think Eric imagined this to be a “final ranker” (but correct me if I’m wrong), so shouldn’t be included in the final score.

The “Beta ranker” (i.e. the summing approach), seems to only make sense when each subgraph has the same number of edges (which is not the case).

Was there a reason why your previous idea (or mine with Max Flow) was abandoned?

Also, does my implementation of the ranker (which produces edge confidences for every edge in the KG), not get incorporated into the Final_score? The point of that ranker was to address the incompleteness of Eric's ranker and to consider the ranges of the edge property values...

edeutsch commented 4 years ago

Indeed I did not envision it to be a "final ranker", but rather something simple that seemed to do a pretty good job on the basic queries that I commonly test with (and improved the ranking on those A LOT). However, David then quickly demonstrated some additional queries/query results for which it stumbled badly. Clearly something better is needed. If I had time to work on this (which I do not this week), I would start by collecting a small zoo of examples that we can test an eventual ranker with. Some good ideas were tossed around today, but there is more to do. I have a few higher priority fires before open this up again.

dkoslicki commented 4 years ago

For future reference (sent to @saramsey as he volunteered to implement his ideas):

All references are to the brank ranker-refine at commit d601964b62832aa4bd765c6e789f0561a6cf4d9c

The main code is found here in the “ranker-refine” branch: https://github.com/RTXteam/RTX/blob/ranker-refine/code/ARAX/ARAXQuery/ARAX_ranker.py

Edge_attribute_score_normalizer does the brunt of the work to normalize each edge property to be a value between 0 and 1.

Aggregate_scores_dmk is the code that, (after line 296), has each edge in the KP populated with a confidence.

Your code can probably go in line 45: result_confidence_maker, which currently does the same thing as Eric’s code (multiply all edge confidences together, a not-too-good-idea in general).

The property you will be needing is `self.kg_edge_id_to_edge[kg_edge_id].confidence` (that’s the value between 0 and 1 for each edge in the result=self).

Example graphs are in lines  578-585. I have more complicated examples, but they are stored in my local database (you can see the general idea in lines 589-621.
saramsey commented 4 years ago

Just an FYI, I am working on a max-flow ranker, using as a starting point the module ARAX_ranker.py in branch ranker-refine.

saramsey commented 4 years ago

OK, commit 72bb37b updates the ARAX_ranker.py module as requested. It should be considered a prototype (perhaps even a straw-man) at this point; testing on more nontrivial query graphs is of course needed.

In the aforementioned commit, I've added a bunch of functions at module-level scope, and some new package dependencies. The critical change to the ARAXRanker class is that in the method aggregate_scores_dmk, instead of

for result in message.results:
       self.result_confidence_maker(result)

I am now generating three different types of ranks and averaging them:

Screen Shot 2020-09-05 at 3 42 12 PM

The three different types of ranks are (1) max-flow, (2) Hamiltonian path (not sure what to call it; it involves taking the adjacency matrix to the N-1 power and then taking np.max); and (3) Frobenius norm. They are all (I believe) equivalent on the two-node, single-edge test graph. But where things should get interesting is on multigraphs or graphs with two or more edges. This is just a prototype (I hope it is OK to check prototype-level code directly into this branch; I'm assuming that is the case and that no one is depending on this branch being in a fully-tested and functionally-validated state at all times). What is needed? (1) testing on more nontrivial cases (2) probably an informal code review (3) code cleanup & commenting, informed by the code review

saramsey commented 4 years ago

Just an FYI, I've adjusted the ranker formerly known as _rank_networkx_graphs_by_hamiltonian_path so that it now (for each result graph):

  1. finds the longest geodesic path length L
  2. finds the set S of all pairs of nodes whose geodesic path length is L
  3. for weighted adjacency matrix A, computes A^L / L!
  4. computes the mean of the entries of A that correspond to the node pairs in S, and the result graph-specific score

Then, it computes the fractional rank of the result graph-specific scores.

So, it's no longer accurate to call it "Hamiltonian Path" since with the new algorithm shown above, for some types of query graphs (e.g., E = {X->Y, X->Z}) L is not equal the number of nodes N. So now I have renamed this ranker _rank_networkx_graphs_by_longest_path. See commit aa11dbb. (Note, commit aa11dbb also had a bunch of code cleanup for PEP8 to appease flymake in order to make it easier for me to lint-check my code).

saramsey commented 4 years ago

Commit f53624c does a bit of refactoring on the code, using a double map call so that we are not explicitly calling _quantile_rank_list within each of the three subgraph scoring functions (this just centralizes the call to _quantile_rank_list to a single place in the code, so seems cleaner. Also fixed a bunch of PEP8 type things to appease flymake.

dkoslicki commented 2 years ago

Re-ran this query (an updated version of it) and the ranker appears to be returning things that make sense. Closing