SystemsGenetics / KINC

Knowledge Independent Network Construction
MIT License
11 stars 4 forks source link

p-value test in Similarity analytic #64

Closed bentsherman closed 4 years ago

bentsherman commented 5 years ago

From studying gene pair plots in the past I've noticed that when looking at plots of gene pair correlations in a CMX, the p-value of the correlation is an extremely good indicator of whether a correlation is "good" or not. That is, strong correlations typically had p-values on the order of 1e-40 while bad correlations had p-values much closer to 1. I'd be curious to see if using a p-value threshold rather than a correlation threshold produces a scale-free network all on its own!

To that end, the similarity.py has a --pvalue option which allows you to filter out gene pairs by p-value in addition to correlation. I'd like to try and implement this p-value in KINC, and then see if it's effective. I think the p-value is based on Student's t-test but I'll have to study the scipy code to be sure. We could also test this idea now with similarity.py, if we have a small GEM that can produce a realistic network. Based on what I saw, I think 1e-20 would be a good p-value threshold for starters.

spficklin commented 5 years ago

I'm 100% supportive of this.

bentsherman commented 4 years ago

For Pearson and Spearman correlation we can use the following relation:

t_obs^2 = r^2 (n - 2) / (1 - r^2)

To get the associated p-value, we have to compute alpha such that t_obs = t(alpha, n - 2)

Using the scipy source code as a reference:

bentsherman commented 4 years ago

The nice thing here is that we only need (1) the correlation and (2) the sample size to perform this significance test, so it can be implemented as a separate analytic rather than as part of similarity. The challenge is that we have to compute this incomplete beta integral:

df = n - 2
prob = special.betainc(0.5 * df, 0.5, df/(df + t_squared))

After doing some digging, it doesn't look like betainc is implemented in Python, but rather in some low-level numerical library.

So one really easy solution would be to write a Python script that reads the network file and performs the proper significance test on each edge. We also might be able to do it in KINC via statslib (actually it looks like GCEM has an incomplete_beta function here).

spficklin commented 4 years ago

@bentsherman we have a power analysis filter (the corrpower annaltyic) built into KINC that will remove edges that don't have sufficient samples to justify the correlation score given a desired power (1-Beta) and significance level (alpha). It sounds quite similar to what you're trying to do here.

bentsherman commented 4 years ago

@spficklin I think you're right, and that corrpower analytic makes a lot more sense now... closing.