Closed bentsherman closed 4 years ago
I'm 100% supportive of this.
The nice thing here is that we only need (1) the correlation and (2) the sample size to perform this significance test, so it can be implemented as a separate analytic rather than as part of similarity. The challenge is that we have to compute this incomplete beta integral:
df = n - 2
prob = special.betainc(0.5 * df, 0.5, df/(df + t_squared))
After doing some digging, it doesn't look like betainc
is implemented in Python, but rather in some low-level numerical library.
So one really easy solution would be to write a Python script that reads the network file and performs the proper significance test on each edge. We also might be able to do it in KINC via statslib (actually it looks like GCEM has an incomplete_beta
function here).
@bentsherman we have a power analysis filter (the corrpower
annaltyic) built into KINC that will remove edges that don't have sufficient samples to justify the correlation score given a desired power (1-Beta) and significance level (alpha). It sounds quite similar to what you're trying to do here.
@spficklin I think you're right, and that corrpower analytic makes a lot more sense now... closing.
From studying gene pair plots in the past I've noticed that when looking at plots of gene pair correlations in a CMX, the p-value of the correlation is an extremely good indicator of whether a correlation is "good" or not. That is, strong correlations typically had p-values on the order of
1e-40
while bad correlations had p-values much closer to 1. I'd be curious to see if using a p-value threshold rather than a correlation threshold produces a scale-free network all on its own!To that end, the
similarity.py
has a--pvalue
option which allows you to filter out gene pairs by p-value in addition to correlation. I'd like to try and implement this p-value in KINC, and then see if it's effective. I think the p-value is based on Student's t-test but I'll have to study thescipy
code to be sure. We could also test this idea now withsimilarity.py
, if we have a small GEM that can produce a realistic network. Based on what I saw, I think1e-20
would be a good p-value threshold for starters.