SystemsGenetics / KINC

Knowledge Independent Network Construction
MIT License
11 stars 4 forks source link

Removal of Biased Edges #127

Closed spficklin closed 4 years ago

spficklin commented 4 years ago

When using GMMs with 2D scatterplots of gene expression data, there are two types of bias that can occur which will lead to false condition-specific discovery:

First is a lack of Differential Cluster Expression (DCE) in either gene of the edge. DCE is when the mean of the expression of the in-cluster samples vs the out-cluster samples is significantly different. If DCE is present in one gene but not the other, then the one gene with DCE may cause a condition-specific cluster simply because it shifts expression of the condition-specific samples. For a cluster to be considered condition-specific, DCE must be present in both genes.

Second is a difference in patterns of missingness in the two genes. If one gene has missing samples that are a result of condition-specific expression, then it will always result in clusters that are condition-specific regardless if the other gene has expression in the missing samples or not. This is because we throw out the samples before performing correlation if they are missing in one or the other.

Current Implementation in KINC.R

Currently in the KINC.R code we are performing DCE by using a Welch's Anova test. Anything with a p-value < 1e-3 is considered differentially expressed.

We perform a paired t-test to check if the pattern of missingness is different between the two genes in an edge. If the p-value > 0.1 we keep the edge, otherwise there is sufficient evidence to conclude they may be different and we throw it out.

We should implement these in KINC so the end-user does not have to take the extracted network and load it up in KINC.R.

Missingness test:

There are two possible ways to do this:

  1. In the similarity analytic, we could perform a missingness test prior to performing GMMs so that we can skip the GMMs if the gene pair seems different.
  2. In a new analytic named filter-bias, we could perform the test using the results from the similarity step.

DCE test:

There are two possible ways to do this:

  1. In the similarity analytic, after GMMs have been found, we can test each good cluster with the Welch's Anova test and only keep those that match.
  2. In a new analytic named filter-bias, we could perform the test using the results from the similarity step.
spficklin commented 4 years ago

With the temporary Rscript solution now in the scripts folder I'm going to close this out and tag it with "For the future".