Closed chunyuma closed 2 months ago
Hi @chunyuma I don't think I fully understand the "publications" part of this question.
I think it does make sense to refine our weights a bit in a way that conveys a "trust factor". Maybe that trust factor can include multiple components. So here're some ideas. I'm not certain this really answers your question, but maybe you can elaborate if not:
Maybe the base trust factor (weight) is based on where the edge comes from: SemMed 0.5 text-mining-provider 0.8 gold databases like drugcentral 0.95
And then maybe we can have modifiers based on the number of publications. SemMed 0.5 + 0.01/pub text-mining-provider 0.8 + 0.005/pub gold databases like drugcentral 0.95 + 0.005/pub Maybe cap the number of pubs at 20 or something. so more than 20 just counts as 20. (is 50 pubs really better than 20?) Plus an absolute cap at 1.0. If the total ever goes higher than 1.0, just limit it to 1.0 as "totally solid".
So this means that a SemMed with 20+ publications is a weight of 0.7, a t-m-p with 20+ publications would be 0.9, and a drugcentral edge with 20+ publications would be 1.0 (because of cap). I have no hard data to back this up, but it feels sorta right. This allows the number of publications to boost the base weight in a limited scope, while still letting drugcentral edges dominate.
There could be some other modifiers that we could establish that would either boost or penalize these weights. I don't know what they might be, but we could potentially look at some data and decide what feels right about improving or reducing our confidence.
One other thing to consider is: should 2 SemMed edges beat out a drugcentral edge? Probably not. So multiple edges would only contribute minimally over the best edge weight. So maybe for a given node pair, determine the best weight and use that as the base and then only minimally boost the final weight for each additional edge. (how many SemMedDB edges would you need to beat out a single DrugCentral edge? I dunno. Maybe never)
Ideally we would have a training set with true positives and false positives that would guide this, but putting that together would be hard. so for now I can only offer something that "feels right".
Does that answer the question?
Thanks so much for your idea and answers @edeutsch. I get it now.
If we think drugcentral
is gold database, then I know how to treat them.
Question: except for drugcentral
and drugbank
, is there any other databases we should treat them as "gold databases"?
One other thing to consider is: should 2 SemMed edges beat out a drugcentral edge? Probably not. So multiple edges would only contribute minimally over the best edge weight. So maybe for a given node pair, determine the best weight and use that as the base and then only minimally boost the final weight for each additional edge. (how many SemMedDB edges would you need to beat out a single DrugCentral edge? I dunno. Maybe never)
I think the quality of publication in drugcentral edge is much higher than that in SemMed edge. So I would not use the same strategy to consider them. Since not all drugcentral edges contains the publication info, most of them don't have this info. So, probably I would ignore this info from drugcenral but always treat this database as "gold database". Does this make sense to you @edeutsch?
Hi KG2 team (@saramsey, @amykglen, @sundareswarpullela) or @edeutsch, could you please let me know how we define a edge with biolink:knowledge_level
or what edge con contain biolink:knowledge_level
in its attribute? Do I consider correctly those edges with biolink:knowledge_level
in the original attribute are high quality or gold edges that we tend to trust?
It is my understanding that eventually all edges should have an attribute knowledge_level and agent_type. So their mere presence will not be remarkable. However, their values will be very valuable. Here is the implementation guidance:
Supporting both the short term and long term implementations would be good. Maybe we should get together to discuss how to weight the different levels, and also find out what was implemented in KG2.. is that documented somewhere @saramsey ?
I think the quality of publication in drugcentral edge is much higher than that in SemMed edge. So I would not use the same strategy to consider them. Since not all drugcentral edges contains the publication info, most of them don't have this info. So, probably I would ignore this info from drugcenral but always treat this database as "gold database". Does this make sense to you @edeutsch?
Seems reasonable. It would be great to have a document from the KG2 team on how they view the trustworthiness of all their knowledge sources.
And the question is: does a drugcentral edge with publications lend more confidence than a drugcentral edge without. I don't know the answer. If yes, then that can be a factor. If not, then, you're right, no.
And the question is: does a drugcentral edge with publications lend more confidence than a drugcentral edge without. I don't know the answer. If yes, then that can be a factor. If not, then, you're right, no.
I have no idea for this question too. Temporarily, I will trust all drugcentral edge with same weight.
@edeutsch, I have already updated the ranker algorithm to treat drugcentral
as gold database, please see this PR.
I wonder if it is appropriate/useful to insert weights into this yaml table? with an estimate of what default weight should be applied to edges from each source. Maybe DrugBank is 0.99 and SemMedDB is 0.5 and DrugCentral 0.92. These could then then be used by the ranker?
although the Ranker also needs to consider edges from other sources in the Translator ecosystem, not just KG2
Hi @edeutsch,
The weight table we currently used for different KPs is here. Now, only semmeddb
and text-mining-provider
are given weights, other KPs used the default weight that is 1. I have opened another github issue #2338 for a better weight table.
Another thing I will do is to remove drugcentral
from the reliable KP list (aka "gold" databases). Currently only drugbank
and drugcentral
are in this list.
How do you think for these?
Hi @chunyuma I think this sounds good, but I'm thinking it would be even better to NOT have a "reliable KP list" but rather have a single weights table where drugbank gets a 0.99 (near perfect gold), drugcentral gets a 0.93 maybe, text-mining provider gets a 0.8, SemMedDB a 0.5, etc. ideally an entry in the table for all the primary sources that we know about. Maybe we can crowdsource (teamsource?) the initial weights for all the sources with input from the team members that know the sources the best. And then the ranking algorithm would use those weights from the teamsourced table? Maybe rather than a default of 1, it should be a default of 0.75 to downweight sources that we don't explicitly weight.
Hi @edeutsch, yes, I will take a single weight table for assessing each individual edge. However, since some ARAX returned results contain multiple edges from different data sources. In that case, we need a way to comprehensively assess the results based on the edges.
Below are three parts I am using to calculate the final score of a ARAX result in the algorithm:
So the final score = 0.8 x (score of 1) + 0.1 x (score of 2) + 0.1 x (score of 3).
@edeutsch @saramsey @dkoslicki, what do you think of this method for calculating ranking scores?
Thanks, @chunyuma I guess I don't really fully understand this, especially item 2. But I think best in examples, so maybe we can walk through an example. How about this query: https://arax.ncats.io/?r=b3da1551-d518-445c-a256-f9a2482bc50d which is TestCase13, which we've discussed before in the very nice thread at #2300
Let's focus on result 2 enalapril. What should its rank score be?
it's just a single pair of nodes with 4 edges. If I were designing something, I think this is how I would do it: 1) Compute the raw weight of each of the 4 edges between the node pair based on their primary_knowledge_source 1a) The SemMedDB edge: base weight 0.5 + 0.01 18pubs = 0.68 (note that pubs are capped at 20 as discussed above) 1b) The dgidb edge: base weight: 0.85 1c) The drugbank edge: base weight: 0.99 + 0.001 4pubs = 0.994 (pubs capped at 20) 1d) The drugcentral edge: base weight 0.93 The base weights come from our planned table, I just guessed here. The table might also contain the pub count bonus factor
2) Now we have 4 edges, each with a weight. How do we combined them? I can't think of a tidy formula, but I can think of a crazy looping algorithm (sorry, I am notorious for coming up with such ridiculous things) 2a) Sort all the edges in descending order of raw weight: 0.994, 0.93, 0.85, 0.68 2b) Start with the first running weight: W_r = 0.994 2c) Loop through the rest of the weights (W_i) (i=2..n), computing a running weight (W_r) as follows: W_r = W_r + 0.25 (1 - W_r) W_i Conceptually, this is: each additional edge weight can close the distance 1 - W_r by a quarter times W_i this ensures you never get to 1, but get progressively closer to it the more edges you have. 2d) If you end up over 1.0, cap at 1.0 2e) So in this example, you get:
W_i W_r
0.994 0.994
0.93 0.995395 (W_r + 0.25 * (1 - W_r) * W_i)
0.85 0.996373563
0.68 0.996990057
The best edge always dominates, but can be further boosted by more edges in a way that gets you closer to 1.0 the 0.25 is arbitrary. Maybe 0.333 is better, I don't know, it seems like a good number at the time. You want each edge to give you a boost, but not too much. One high quality DrugBank edge should still beat out several SemMedDB edges)
What about perindoprilat in the same resultset with 2 edges?
This is all good for a one-hop query. What about 2 hops or N hops? How about just multiply the final W_r's for each hop and call it done? The penalizes more hops in a way that seems okay (as long as the weights are less than 1.0)
I think this is a good way to do it, but maybe other think this is lunacy. The only remaining trick then is to set a table/matrix of values
PrimarySrc base perPub PubLim
SemMedDB 0.5 0.01 20
dgidb 0.85 0.005 20
DrugCentral 0.93 0.001 20
DrugBank 0.99 0.001 10
default 0.75 0.008 20
This is the numerology part. We need to assign a confidence weight to each source and how much each publication can boost the score to each source. Kinda arbitrary, but I bet we can teamsource some good numbers here. And if anyone ever does a study to compute the true FDR of knowledge in these sources, we could replace with 1 - FDR or something. In the mean time we just go with what feels right. And if we can easily tweak the ranking algorithm just by tweaking the source base weights and pub modifier weights (and potentially come up with other modifiers for specific sources as appropriate).
Now soon we're going to get knowledge_level and agent_type information. I'm think that we can easily augment this system with base weights and/or weight modifiers based on KL and AT. But we probably need to understand more about what values we'll be seeing here before we can add that. But it should be easily integrated.
How crazy is that?
Hi @edeutsch, thanks so much for proposing this creative idea. I like your looping algorithm. I can see that this idea combines both raw individual weight and edge count, but it relies more on the edge count.
For example, say, we have a result with 1 high-quality edge and 4 "normal" edges. The weights of them are separately: 0.994, 0.75, 0.75, 0.75, 0.68, based on your looping algorithm, its combined score is 0.997328. If we have another result with 4 high-quality edge with weights: 0.994, 0.99, 0.93, 0.85, its combined score is 0.9972, which is lower. But I think in this case, the second result should be ranked higher.
My suggested modification is that we can combine the loop algorithm and a sigmoid function that considers high-quality edges only.
Take the above example to illustrate:
Based on your loop algorithm, the first case is 0.9973, the second case is 0.9972.
We can set a threshold, say 0.8, and a sigmoid function like:
Note that if a result has >= 4 high-quality edges (raw weight > 0.8), it will have very high score in this part.
Since only one edge with weight > 0.8 in first case, its normalized edge-count-based score is 0.6225. In second case, there are 4 edges with weight > 0.8, its normalized score is 0.88079.
Based on my previous method:
The final score of case1 is 0.8 x 0.9973 + (1 - 0.8) x 0.6225 = 0.9223 The final score of case2 is 0.8 x 0.9972 + (1 - 0.8) x 0.88079 = 0.9739
Does this modification look good?
Compute the raw weight of each of the 4 edges between the node pair based on their primary_knowledge_source 1a) The SemMedDB edge: base weight 0.5 + 0.01 18pubs = 0.68 (note that pubs are capped at 20 as discussed above) 1b) The dgidb edge: base weight: 0.85 1c) The drugbank edge: base weight: 0.99 + 0.001 4pubs = 0.994 (pubs capped at 20) 1d) The drugcentral edge: base weight 0.93 The base weights come from our planned table, I just guessed here. The table might also contain the pub count bonus factor
I think we probably also need to consider the scores of some edge attributes such as:
'probability': 0.5,
'normalized_google_distance': 0.8,
'jaccard_index': 0.5,
'probability_treats': 0.8,
'paired_concept_frequency': 0.5,
'observed_expected_ratio': 0.8,
'chi_square': 0.8,
'chi_square_pvalue': 0.8,
'MAGMA-pvalue': 1.0,
'Genetics-quantile': 1.0,
'pValue': 1.0,
'fisher_exact_test_p-value': 0.8,
'Richards-effector-genes': 0.5,
'feature_coefficient': 1.0,
'CMAP similarity score': 1.0,
Some edges may have more than one above attributes. so my idea is:
raw weight of individual edge = sigmoid function( base weights + attribute1 weight attribute1 score + attribute2 weight attribute2 score + ...). These attributes also include publication count. One limitation of this calculation is that if an edge has more such attributes, its raw weight will be higher. Perhaps, we can consider using your loop algorithm to combine these attribute scores.
Hi @chunyuma great, thanks for thinking about this. I agree that my formula has a somewhat unpleasant side effect of preferring more edges over higher weight edges. I think I might understand your proposed fix, but I confess I have trouble grappling with such multipart scores with sigmoids.
I will propose an alternate solution, which is simply to remove my foolish "0.25" idea. It is this factor that causes the bonus to more edges. How about we simply remove that? That simplifies the equation and removes that unwanted side effect. This does accelerate the race to 1.0, which is what I was originally trying avoid. But that seems like a lesser problem. We could compensate for this by doing the calculations in 1-W space. So I'm proposing now simply:
invW_r = (1 - W_r) * (1 - Wi)
W_i invW_r W_r
0.994 0.006 0.994
0.93 0.00042 0.99958
0.85 0.000063 0.999937
0.68 0.00002016 0.99997984
W_i invW_r W_r
0.994 0.006 0.994
0.75 0.0015 0.9985
0.75 0.000375 0.999625
0.68 0.00012 0.99988
W_i invW_r W_r
0.994 0.006 0.994
0.99 6E-05 0.99994
0.93 0.0000042 0.9999958
W_i invW_r W_r
0.994 0.006 0.994
0.99 6E-05 0.99994
0.93 0.0000042 0.9999958
0.75 0.00000105 0.99999895
0.75 2.625E-07 0.999999738
0.68 8.4E-08 0.999999916
0.65 2.94E-08 0.999999971
0.6 1.176E-08 0.999999988
0.55 5.292E-09 0.999999995
0.5 2.646E-09 0.999999997
W_i invW_r W_r
0.994 0.006 0.994
0.75 0.0015 0.9985
0.75 0.000375 0.999625
0.68 0.00012 0.99988
0.65 0.000042 0.999958
0.6 0.0000168 0.9999832
0.55 0.00000756 0.99999244
0.5 0.00000378 0.99999622
W_i invW_r W_r
0.75 0.25 0.75
W_i invW_r W_r
0.5 0.5 0.5
Anyway, I don't feel strongly about it.
As far as the other edge attributes, I would treat those as edge weight modifiers to come up with a final weight for each edge and then just combine as normal. Probabilities could just stand in for the edge weight. If we believed them. p values could be treated as a weight with 1 - pvalue. again, if we believed them. I'm not sure they're all to be taken at face value. Maybe they should be scaled to 0.9 or something. It might be useful to create a catalog of examples (in a wiki page for example) and then we could decide how to handle all of them. Or maybe you are already did.
I think this is very encouraging!
Hi @edeutsch, thanks to keep proposing the new idea.
Sorry, I might not understand some calculations you did. For example, in this case:
W_i invW_r W_r
0.994 0.006 0.994
0.93 0.00042 0.99958
0.85 0.000063 0.999937
0.68 0.00002016 0.99997984
How did you get 0.99958 from the first row? Could you please do more clarification for me? Thank you!
I am sorry that it is difficult for me to determine which one we should choose although I may still vote my method. But it seems like your method is simpler. Probably we can get some thoughts from @dkoslicki @saramsey.
Hi @chunyu, sure, either approach is fine.
The equation is just W_r = W_r + (1 - W_r) * W_i
So 0.9958 = 0.994 + (1 - 0.994) * 0.93
Thanks @edeutsch. Sorry for the late response.
I have updated the ranking algorithm via this commit in this PR.
Basically, the update is based the combination of my original method and your looping algorithm.
Here is what we operate in the ranking algorithm.
For each individual edge, we first calculate its confidence score via the looping equation you propose (i.e., W_r = W_r + (1 - W_r) * W_i
):
edge_key
contains infores
), base score follows:
# how much we trust each data source
self.data_source_base_weights = {'infores:semmeddb': 0.5, # downweight semmeddb
'infores:text-mining-provider': 0.85,
'infores:drugcentral': 0.93,
'infores:drugbank': 0.99
# we can define the more customized weights for other data sources here later if needed.
}
The attribute weights are:
self.known_attributes_to_trust = {'probability': 0.8,
'normalized_google_distance': 0.8,
'jaccard_index': 0.5,
'probability_treats': 0.8,
'paired_concept_frequency': 0.5,
'observed_expected_ratio': 0.8,
'chi_square': 0.8,
'chi_square_pvalue': 0.8,
'MAGMA-pvalue': 1.0,
'Genetics-quantile': 1.0,
'pValue': 1.0,
'fisher_exact_test_p-value': 0.8,
'Richards-effector-genes': 0.5,
'feature_coefficient': 1.0,
'CMAP similarity score': 1.0,
}
This means if an inferred edge from xDTD has original score equals to 0.7, its final confidence is `sigmoid_function(0.7) 0.8 = 0.8 0.8 = 0.64.
For each result with one or more supporting edges with confidence, its final result score is still using the looping equation: W_r = W_r + (1 - W_r) * W_i
).
Example: Given edge confidence score list: 0.994, 0.93, 0.85, 0.68, we have:
Round W_i W_r
1 0.994 0.994
2 0.93 0.99958
3 0.85 0.999937
4 0.68 0.99997984
Final result score = 0.99997984
Hope this makes sense.
Hi @chunyu, okay, thanks. I tghink I understand most of it except for how the attributes fit into the equation. I don't see that anywhere. Also, I wonder why are some attribute weights 1.0? Seems like they should all be below 1.0 at least a little bit? I don't see how they are applied. Does a MAGMA p-value of 0.1 still get a weight of 1.0?
Hi @edeutsch, sorry again for the late response.
Please see theses lines about how the attributes fit into the equation. Basically, for each attribute score, we will first put it into a corresponding sigmoid transformation function to make it scale between 0 and 1, 0 is worse and 1 is better. Given that some original attribute scores have a feature where a low score is considered better, so I think using the transformation function here is better because it can make score consistent. Then once we have a transformed score, the confidence score is calculated by multiplying the transformed score with its corresponding weight.
Say, if an edge has only an attribute MAGMA p-value, which is 0.1, after transformation, its transformed score may be 0.15 (or maybe still 0.1), the confidence score is 0.1 x 1.0 = 0.1, so it is still 0.1. Does this make sense?
Hi @chunyu, thanks, I confess it is not entirely clear to me, but it's probably good.
close it as it has been completed.
@dkoslicki @edeutsch,
I found the
infores:drugcentral
edge also contains the publication information (see theinfores:drugcentral
edge in result 69). It seems like not allinfores:drugcentral
edges have the publication information.My question is: for the new ranking algorithm, except for the SemMedDB, if the edge from other knowledge providers also contains
'attribute_type_id': 'biolink:publications'
, should we consider its publication number in the algorithm?For example:
Currently, the weight for
publications
is 0.5, fortext-mining-provider
is 0.8, fordrugcentral
is 1.The confidence is calculated as: normalized score x weight. The normalized score for both
drugcentral
andtext-mining-provider
is 1 while the normalized score for publications is based on the publication count.