SystemsGenetics / KINC

Knowledge Independent Network Construction
MIT License
11 stars 4 forks source link

Hypergeometric test performs poorly for sub-categories. #166

Closed spficklin closed 4 years ago

spficklin commented 4 years ago

Consider a case where there might be a subspecies and two genotypes that belong to that subspecies. Suppose an edge is very much specific to the subspecies and contains samples from both genotypes. The current use of the hypergeometric test will result in the subspecies having a condition-specific edge and both of the genotypes that belong to the edge. Unfortunately, if the genotypes have different number of samples in the cluster then one or the other may result in a significant edge but not both. This then makes it seem as if one genotype has a unique edge and this may be misleading because the condition-specificity is at the subspecies level not the genotype level. The hypergenometric test cannot easily detect this problem.

Additionally, it may be the case for such an edge that genotype is condition-specific if the genotype samples are "phased" or have a different mean and variance. Exploring these "phased" edges can be of interest from those that are not phased.

To fix this, the hypergeometric test should be swapped for another test. One idea is to use a z-score test of proportions in two test

Test 1: Ho: the samples with label, x, in a cluster have a proportion of 0.5 or less Ha: the samples with label, x, in the cluster have a proportion greater than 0.5 Method: Use a bootstrap approach that randomly selects 30 random samples from the cluster. Repeat this for n iterations and take the average proportion. This test ensures that most of the samples in the cluster are specific to the categorical label.

Test 2. Ho: the samples in the cluster with label, x, have a proportion equal or less to that of the label proportion in the population of all samples. Ha: the samples in the cluster with label, x, have a proportion greater than that of the label proportion in the population of all samples. Method: Use the bootstrap method as before. This test ensures that the edge contains almost all of the samples with label x.

Test for phased edges Use a Hotteling 2D t-test (as suggested by @JohnHadish) for a given factor (class) of labels. Thus for the example discussed above, if an edge is for a subspecies then we can test it for phased genotype edges using this test.

spficklin commented 4 years ago

Whoops. I accidentally committed a fix for this directly to the develop branch instead of to a new branch for review of the PR. @JohnHadish can you test the develop branch as if it was a PR just for the sanity check? Thanks...

spficklin commented 4 years ago

With this fix, the hypergeometric test is replaced in KINC with a function that tests for the two proportions described above. The maximum p-value from the test is returned. The code for the hypergeometric test is still in KINC in case we want to go back to it, but it's not being called.

Also, a new Rscript was added for finding phased edges. It requires KINC.R v1.2, so you have to upgrade KINC.R to test this.

JohnHadish commented 4 years ago

Approved