RTXteam / RTX

Software repo for Team Expander Agent (Oregon State U., Institute for Systems Biology, and Penn State U.)
https://arax.ncats.io/
MIT License
33 stars 21 forks source link

Should we completely trust `infores:drugcentral`? #2334

Closed chunyuma closed 2 months ago

chunyuma commented 3 months ago

@dkoslicki @edeutsch,

I found the infores:drugcentral edge also contains the publication information (see the infores:drugcentral edge in result 69). It seems like not all infores:drugcentral edges have the publication information.

My question is: for the new ranking algorithm, except for the SemMedDB, if the edge from other knowledge providers also contains 'attribute_type_id': 'biolink:publications', should we consider its publication number in the algorithm?

For example:

  {'attribute_source': 'infores:drugcentral',
   'attribute_type_id': 'biolink:publications',
   'attributes': None,
   'description': None,
   'original_attribute_name': None,
   'value': ['PMID:https://pubmed.ncbi.nlm.nih.gov/24786236'],
   'value_type_id': 'biolink:Uriorcurie',
   'value_url': None}

Currently, the weight for publications is 0.5, for text-mining-provider is 0.8, for drugcentral is 1.

The confidence is calculated as: normalized score x weight. The normalized score for both drugcentral and text-mining-provider is 1 while the normalized score for publications is based on the publication count.

edeutsch commented 3 months ago

Hi @chunyuma I don't think I fully understand the "publications" part of this question.

I think it does make sense to refine our weights a bit in a way that conveys a "trust factor". Maybe that trust factor can include multiple components. So here're some ideas. I'm not certain this really answers your question, but maybe you can elaborate if not:

Maybe the base trust factor (weight) is based on where the edge comes from: SemMed 0.5 text-mining-provider 0.8 gold databases like drugcentral 0.95

And then maybe we can have modifiers based on the number of publications. SemMed 0.5 + 0.01/pub text-mining-provider 0.8 + 0.005/pub gold databases like drugcentral 0.95 + 0.005/pub Maybe cap the number of pubs at 20 or something. so more than 20 just counts as 20. (is 50 pubs really better than 20?) Plus an absolute cap at 1.0. If the total ever goes higher than 1.0, just limit it to 1.0 as "totally solid".

So this means that a SemMed with 20+ publications is a weight of 0.7, a t-m-p with 20+ publications would be 0.9, and a drugcentral edge with 20+ publications would be 1.0 (because of cap). I have no hard data to back this up, but it feels sorta right. This allows the number of publications to boost the base weight in a limited scope, while still letting drugcentral edges dominate.

There could be some other modifiers that we could establish that would either boost or penalize these weights. I don't know what they might be, but we could potentially look at some data and decide what feels right about improving or reducing our confidence.

One other thing to consider is: should 2 SemMed edges beat out a drugcentral edge? Probably not. So multiple edges would only contribute minimally over the best edge weight. So maybe for a given node pair, determine the best weight and use that as the base and then only minimally boost the final weight for each additional edge. (how many SemMedDB edges would you need to beat out a single DrugCentral edge? I dunno. Maybe never)

Ideally we would have a training set with true positives and false positives that would guide this, but putting that together would be hard. so for now I can only offer something that "feels right".

Does that answer the question?

chunyuma commented 3 months ago

Thanks so much for your idea and answers @edeutsch. I get it now.

If we think drugcentral is gold database, then I know how to treat them.

Question: except for drugcentral and drugbank, is there any other databases we should treat them as "gold databases"?

One other thing to consider is: should 2 SemMed edges beat out a drugcentral edge? Probably not. So multiple edges would only contribute minimally over the best edge weight. So maybe for a given node pair, determine the best weight and use that as the base and then only minimally boost the final weight for each additional edge. (how many SemMedDB edges would you need to beat out a single DrugCentral edge? I dunno. Maybe never)

I think the quality of publication in drugcentral edge is much higher than that in SemMed edge. So I would not use the same strategy to consider them. Since not all drugcentral edges contains the publication info, most of them don't have this info. So, probably I would ignore this info from drugcenral but always treat this database as "gold database". Does this make sense to you @edeutsch?

chunyuma commented 3 months ago

Hi KG2 team (@saramsey, @amykglen, @sundareswarpullela) or @edeutsch, could you please let me know how we define a edge with biolink:knowledge_level or what edge con contain biolink:knowledge_level in its attribute? Do I consider correctly those edges with biolink:knowledge_level in the original attribute are high quality or gold edges that we tend to trust?

edeutsch commented 3 months ago

It is my understanding that eventually all edges should have an attribute knowledge_level and agent_type. So their mere presence will not be remarkable. However, their values will be very valuable. Here is the implementation guidance:

https://github.com/NCATSTranslator/ReasonerAPI/blob/master/ImplementationGuidance/Specifications/knowledge_level_agent_type_specification.md

Supporting both the short term and long term implementations would be good. Maybe we should get together to discuss how to weight the different levels, and also find out what was implemented in KG2.. is that documented somewhere @saramsey ?

edeutsch commented 3 months ago

I think the quality of publication in drugcentral edge is much higher than that in SemMed edge. So I would not use the same strategy to consider them. Since not all drugcentral edges contains the publication info, most of them don't have this info. So, probably I would ignore this info from drugcenral but always treat this database as "gold database". Does this make sense to you @edeutsch?

Seems reasonable. It would be great to have a document from the KG2 team on how they view the trustworthiness of all their knowledge sources.

And the question is: does a drugcentral edge with publications lend more confidence than a drugcentral edge without. I don't know the answer. If yes, then that can be a factor. If not, then, you're right, no.

chunyuma commented 3 months ago

And the question is: does a drugcentral edge with publications lend more confidence than a drugcentral edge without. I don't know the answer. If yes, then that can be a factor. If not, then, you're right, no.

I have no idea for this question too. Temporarily, I will trust all drugcentral edge with same weight.

chunyuma commented 3 months ago

@edeutsch, I have already updated the ranker algorithm to treat drugcentral as gold database, please see this PR.

saramsey commented 3 months ago

See this YAML file: https://github.com/RTXteam/RTX-KG2/blob/master/maps/knowledge-level-agent-type-map.yaml

edeutsch commented 3 months ago

I wonder if it is appropriate/useful to insert weights into this yaml table? with an estimate of what default weight should be applied to edges from each source. Maybe DrugBank is 0.99 and SemMedDB is 0.5 and DrugCentral 0.92. These could then then be used by the ranker?

although the Ranker also needs to consider edges from other sources in the Translator ecosystem, not just KG2

chunyuma commented 3 months ago

Hi @edeutsch,

The weight table we currently used for different KPs is here. Now, only semmeddb and text-mining-provider are given weights, other KPs used the default weight that is 1. I have opened another github issue #2338 for a better weight table.

Another thing I will do is to remove drugcentral from the reliable KP list (aka "gold" databases). Currently only drugbank and drugcentral are in this list.

How do you think for these?

edeutsch commented 3 months ago

Hi @chunyuma I think this sounds good, but I'm thinking it would be even better to NOT have a "reliable KP list" but rather have a single weights table where drugbank gets a 0.99 (near perfect gold), drugcentral gets a 0.93 maybe, text-mining provider gets a 0.8, SemMedDB a 0.5, etc. ideally an entry in the table for all the primary sources that we know about. Maybe we can crowdsource (teamsource?) the initial weights for all the sources with input from the team members that know the sources the best. And then the ranking algorithm would use those weights from the teamsourced table? Maybe rather than a default of 1, it should be a default of 0.75 to downweight sources that we don't explicitly weight.

chunyuma commented 3 months ago

Hi @edeutsch, yes, I will take a single weight table for assessing each individual edge. However, since some ARAX returned results contain multiple edges from different data sources. In that case, we need a way to comprehensively assess the results based on the edges.

Below are three parts I am using to calculate the final score of a ARAX result in the algorithm:

  1. For each individual edge, a confidence score is assigned mainly based on that single weight table, number of publication (mainly for semmedb), as well as other metric attribute (e.g., fisher exact test, NGD, etc). If a result has multiple supporting edges, we take the max score among edge scores.
  2. We also consider the number of supporting edges. If a result has more non-semmedb non-virtual edges, it should be given higher score.
  3. A reliable KP list. If there are more "reliable" datasource edges, it should be given higher score. (This is where a "reliable KP list" is used, but as you see, the weight of this score for final score is only 0.1)

So the final score = 0.8 x (score of 1) + 0.1 x (score of 2) + 0.1 x (score of 3).

@edeutsch @saramsey @dkoslicki, what do you think of this method for calculating ranking scores?

edeutsch commented 3 months ago

Thanks, @chunyuma I guess I don't really fully understand this, especially item 2. But I think best in examples, so maybe we can walk through an example. How about this query: https://arax.ncats.io/?r=b3da1551-d518-445c-a256-f9a2482bc50d which is TestCase13, which we've discussed before in the very nice thread at #2300

Let's focus on result 2 enalapril. What should its rank score be?

it's just a single pair of nodes with 4 edges. If I were designing something, I think this is how I would do it: 1) Compute the raw weight of each of the 4 edges between the node pair based on their primary_knowledge_source 1a) The SemMedDB edge: base weight 0.5 + 0.01 18pubs = 0.68 (note that pubs are capped at 20 as discussed above) 1b) The dgidb edge: base weight: 0.85 1c) The drugbank edge: base weight: 0.99 + 0.001 4pubs = 0.994 (pubs capped at 20) 1d) The drugcentral edge: base weight 0.93 The base weights come from our planned table, I just guessed here. The table might also contain the pub count bonus factor

2) Now we have 4 edges, each with a weight. How do we combined them? I can't think of a tidy formula, but I can think of a crazy looping algorithm (sorry, I am notorious for coming up with such ridiculous things) 2a) Sort all the edges in descending order of raw weight: 0.994, 0.93, 0.85, 0.68 2b) Start with the first running weight: W_r = 0.994 2c) Loop through the rest of the weights (W_i) (i=2..n), computing a running weight (W_r) as follows: W_r = W_r + 0.25 (1 - W_r) W_i Conceptually, this is: each additional edge weight can close the distance 1 - W_r by a quarter times W_i this ensures you never get to 1, but get progressively closer to it the more edges you have. 2d) If you end up over 1.0, cap at 1.0 2e) So in this example, you get:

W_i         W_r
0.994      0.994
0.93        0.995395     (W_r + 0.25 * (1 - W_r) * W_i)
0.85        0.996373563
0.68        0.996990057

The best edge always dominates, but can be further boosted by more edges in a way that gets you closer to 1.0 the 0.25 is arbitrary. Maybe 0.333 is better, I don't know, it seems like a good number at the time. You want each edge to give you a boost, but not too much. One high quality DrugBank edge should still beat out several SemMedDB edges)

What about perindoprilat in the same resultset with 2 edges?

This is all good for a one-hop query. What about 2 hops or N hops? How about just multiply the final W_r's for each hop and call it done? The penalizes more hops in a way that seems okay (as long as the weights are less than 1.0)

I think this is a good way to do it, but maybe other think this is lunacy. The only remaining trick then is to set a table/matrix of values

PrimarySrc   base      perPub    PubLim
SemMedDB     0.5       0.01      20
dgidb        0.85      0.005     20
DrugCentral  0.93      0.001     20
DrugBank     0.99      0.001     10
default      0.75      0.008     20

This is the numerology part. We need to assign a confidence weight to each source and how much each publication can boost the score to each source. Kinda arbitrary, but I bet we can teamsource some good numbers here. And if anyone ever does a study to compute the true FDR of knowledge in these sources, we could replace with 1 - FDR or something. In the mean time we just go with what feels right. And if we can easily tweak the ranking algorithm just by tweaking the source base weights and pub modifier weights (and potentially come up with other modifiers for specific sources as appropriate).

Now soon we're going to get knowledge_level and agent_type information. I'm think that we can easily augment this system with base weights and/or weight modifiers based on KL and AT. But we probably need to understand more about what values we'll be seeing here before we can add that. But it should be easily integrated.

How crazy is that?

chunyuma commented 3 months ago

Hi @edeutsch, thanks so much for proposing this creative idea. I like your looping algorithm. I can see that this idea combines both raw individual weight and edge count, but it relies more on the edge count.

For example, say, we have a result with 1 high-quality edge and 4 "normal" edges. The weights of them are separately: 0.994, 0.75, 0.75, 0.75, 0.68, based on your looping algorithm, its combined score is 0.997328. If we have another result with 4 high-quality edge with weights: 0.994, 0.99, 0.93, 0.85, its combined score is 0.9972, which is lower. But I think in this case, the second result should be ranked higher.

My suggested modification is that we can combine the loop algorithm and a sigmoid function that considers high-quality edges only.

Take the above example to illustrate:

  1. Based on your loop algorithm, the first case is 0.9973, the second case is 0.9972.

  2. We can set a threshold, say 0.8, and a sigmoid function like: Screenshot 2024-08-01 at 2 48 35 PM

Note that if a result has >= 4 high-quality edges (raw weight > 0.8), it will have very high score in this part.

Since only one edge with weight > 0.8 in first case, its normalized edge-count-based score is 0.6225. In second case, there are 4 edges with weight > 0.8, its normalized score is 0.88079.

Based on my previous method:

The final score of case1 is 0.8 x 0.9973 + (1 - 0.8) x 0.6225 = 0.9223 The final score of case2 is 0.8 x 0.9972 + (1 - 0.8) x 0.88079 = 0.9739

Does this modification look good?

chunyuma commented 3 months ago

Compute the raw weight of each of the 4 edges between the node pair based on their primary_knowledge_source 1a) The SemMedDB edge: base weight 0.5 + 0.01 18pubs = 0.68 (note that pubs are capped at 20 as discussed above) 1b) The dgidb edge: base weight: 0.85 1c) The drugbank edge: base weight: 0.99 + 0.001 4pubs = 0.994 (pubs capped at 20) 1d) The drugcentral edge: base weight 0.93 The base weights come from our planned table, I just guessed here. The table might also contain the pub count bonus factor

I think we probably also need to consider the scores of some edge attributes such as:

'probability': 0.5,
'normalized_google_distance': 0.8,
'jaccard_index': 0.5,
'probability_treats': 0.8,
'paired_concept_frequency': 0.5,
'observed_expected_ratio': 0.8,
'chi_square': 0.8,
'chi_square_pvalue': 0.8,
'MAGMA-pvalue': 1.0,
'Genetics-quantile': 1.0,
'pValue': 1.0,
'fisher_exact_test_p-value': 0.8,
'Richards-effector-genes': 0.5,
'feature_coefficient': 1.0,
'CMAP similarity score': 1.0,

Some edges may have more than one above attributes. so my idea is:

raw weight of individual edge = sigmoid function( base weights + attribute1 weight attribute1 score + attribute2 weight attribute2 score + ...). These attributes also include publication count. One limitation of this calculation is that if an edge has more such attributes, its raw weight will be higher. Perhaps, we can consider using your loop algorithm to combine these attribute scores.

edeutsch commented 3 months ago

Hi @chunyuma great, thanks for thinking about this. I agree that my formula has a somewhat unpleasant side effect of preferring more edges over higher weight edges. I think I might understand your proposed fix, but I confess I have trouble grappling with such multipart scores with sigmoids.

I will propose an alternate solution, which is simply to remove my foolish "0.25" idea. It is this factor that causes the bonus to more edges. How about we simply remove that? That simplifies the equation and removes that unwanted side effect. This does accelerate the race to 1.0, which is what I was originally trying avoid. But that seems like a lesser problem. We could compensate for this by doing the calculations in 1-W space. So I'm proposing now simply:

invW_r = (1 - W_r) * (1 - Wi)

W_i invW_r  W_r
0.994   0.006   0.994
0.93    0.00042 0.99958
0.85    0.000063    0.999937
0.68    0.00002016  0.99997984

W_i invW_r  W_r
0.994   0.006   0.994
0.75    0.0015  0.9985
0.75    0.000375    0.999625
0.68    0.00012 0.99988

W_i invW_r  W_r
0.994   0.006   0.994
0.99    6E-05   0.99994
0.93    0.0000042   0.9999958

W_i invW_r  W_r
0.994   0.006   0.994
0.99    6E-05   0.99994
0.93    0.0000042   0.9999958
0.75    0.00000105  0.99999895
0.75    2.625E-07   0.999999738
0.68    8.4E-08 0.999999916
0.65    2.94E-08    0.999999971
0.6 1.176E-08   0.999999988
0.55    5.292E-09   0.999999995
0.5 2.646E-09   0.999999997

W_i invW_r  W_r
0.994   0.006   0.994
0.75    0.0015  0.9985
0.75    0.000375    0.999625
0.68    0.00012 0.99988
0.65    0.000042    0.999958
0.6 0.0000168   0.9999832
0.55    0.00000756  0.99999244
0.5 0.00000378  0.99999622

W_i invW_r  W_r
0.75    0.25    0.75

W_i invW_r  W_r
0.5 0.5 0.5

Anyway, I don't feel strongly about it.

As far as the other edge attributes, I would treat those as edge weight modifiers to come up with a final weight for each edge and then just combine as normal. Probabilities could just stand in for the edge weight. If we believed them. p values could be treated as a weight with 1 - pvalue. again, if we believed them. I'm not sure they're all to be taken at face value. Maybe they should be scaled to 0.9 or something. It might be useful to create a catalog of examples (in a wiki page for example) and then we could decide how to handle all of them. Or maybe you are already did.

I think this is very encouraging!

chunyuma commented 3 months ago

Hi @edeutsch, thanks to keep proposing the new idea.

Sorry, I might not understand some calculations you did. For example, in this case:

W_i invW_r  W_r
0.994   0.006   0.994
0.93    0.00042 0.99958
0.85    0.000063    0.999937
0.68    0.00002016  0.99997984

How did you get 0.99958 from the first row? Could you please do more clarification for me? Thank you!

I am sorry that it is difficult for me to determine which one we should choose although I may still vote my method. But it seems like your method is simpler. Probably we can get some thoughts from @dkoslicki @saramsey.

edeutsch commented 3 months ago

Hi @chunyu, sure, either approach is fine.

The equation is just W_r = W_r + (1 - W_r) * W_i

So 0.9958 = 0.994 + (1 - 0.994) * 0.93

chunyuma commented 3 months ago

Thanks @edeutsch. Sorry for the late response.

I have updated the ranking algorithm via this commit in this PR.

Basically, the update is based the combination of my original method and your looping algorithm.

Here is what we operate in the ranking algorithm.

  1. For each individual edge, we first calculate its confidence score via the looping equation you propose (i.e., W_r = W_r + (1 - W_r) * W_i):

    • For each non-virtual edges (i.e., edge_key contains infores), base score follows:
      # how much we trust each data source
      self.data_source_base_weights = {'infores:semmeddb': 0.5, # downweight semmeddb
                                       'infores:text-mining-provider': 0.85,
                                       'infores:drugcentral': 0.93,
                                       'infores:drugbank': 0.99
                                       # we can define the more customized weights for other data sources here later if needed.
      }
    • For the virtual edges (i.e., NGD virtual edges, or inferred edges from xDTD), base score is 0. That means it highly depends on its attribute sore. This can avoid weighting too high to the bad virtual edges. For example, if the inferred edge is only 0.3, we should use this score to value the edge instead of setting the base score (e.g., 0.5 or 0.75).
    • The attribute score is based on sigmoid function transformation and attribute weight. For example, if a xDTD score is 0.7, its normalized score is around 0.8 based on this sigmoid function: Screenshot 2024-08-05 at 1 40 46 PM I understand that you might be confused about this function. Let me explain it. As you can see, this transformation will give very low normalized score to original xDTD score less than 0.4. The half point (normalized score = 0.5) is set to 0.6. Once the original score is larger than 0.8, the normalized score is very high (>0.9). This means we value more on the xDTD score with at least 0.8.

    The attribute weights are:

    self.known_attributes_to_trust = {'probability': 0.8,
                                      'normalized_google_distance': 0.8,
                                      'jaccard_index': 0.5,
                                      'probability_treats': 0.8,
                                      'paired_concept_frequency': 0.5,
                                      'observed_expected_ratio': 0.8,
                                      'chi_square': 0.8,
                                      'chi_square_pvalue': 0.8,
                                      'MAGMA-pvalue': 1.0,
                                      'Genetics-quantile': 1.0,
                                      'pValue': 1.0,
                                      'fisher_exact_test_p-value': 0.8,
                                      'Richards-effector-genes': 0.5,
                                      'feature_coefficient': 1.0,
                                      'CMAP similarity score': 1.0,
                                      }

    This means if an inferred edge from xDTD has original score equals to 0.7, its final confidence is `sigmoid_function(0.7) 0.8 = 0.8 0.8 = 0.64.

    1. For each result with one or more supporting edges with confidence, its final result score is still using the looping equation: W_r = W_r + (1 - W_r) * W_i). Example: Given edge confidence score list: 0.994, 0.93, 0.85, 0.68, we have:

        Round   W_i             W_r
        1             0.994         0.994
        2             0.93           0.99958
        3             0.85           0.999937
        4             0.68           0.99997984
        Final result score = 0.99997984

Hope this makes sense.

edeutsch commented 3 months ago

Hi @chunyu, okay, thanks. I tghink I understand most of it except for how the attributes fit into the equation. I don't see that anywhere. Also, I wonder why are some attribute weights 1.0? Seems like they should all be below 1.0 at least a little bit? I don't see how they are applied. Does a MAGMA p-value of 0.1 still get a weight of 1.0?

chunyuma commented 3 months ago

Hi @edeutsch, sorry again for the late response.

Please see theses lines about how the attributes fit into the equation. Basically, for each attribute score, we will first put it into a corresponding sigmoid transformation function to make it scale between 0 and 1, 0 is worse and 1 is better. Given that some original attribute scores have a feature where a low score is considered better, so I think using the transformation function here is better because it can make score consistent. Then once we have a transformed score, the confidence score is calculated by multiplying the transformed score with its corresponding weight.

Say, if an edge has only an attribute MAGMA p-value, which is 0.1, after transformation, its transformed score may be 0.15 (or maybe still 0.1), the confidence score is 0.1 x 1.0 = 0.1, so it is still 0.1. Does this make sense?

edeutsch commented 3 months ago

Hi @chunyu, thanks, I confess it is not entirely clear to me, but it's probably good.

chunyuma commented 2 months ago

close it as it has been completed.