AlineTalhouk / diceR

Diverse Cluster Ensemble in R
https://alinetalhouk.github.io/diceR/
Other
34 stars 10 forks source link

Framework for hypothesis testing #18

Open AlineTalhouk opened 8 years ago

AlineTalhouk commented 8 years ago

derive a probabilistic assessment of cluster assignment

dchiu911 commented 7 years ago

how do we do this exactly?

dchiu911 commented 7 years ago

@AlineTalhouk please let me know when you are free to discuss this

AlineTalhouk commented 7 years ago

Hi @dchiu911 life has been a little crazy. How about this afternoon at 4ish? You should come to the seminar at 1pm

dchiu911 commented 7 years ago

What's the seminar?

dchiu911 commented 7 years ago

Not too much headway so far.

The pcNormal description in the Nature paper (Senbabaoglu et al., 2014) doesn't really describe a hypothesis testing framework.

The sigclust package originally tests the statistical significance of splitting a data set into two clusters using kmeans. Currently modified for k > 2. But still unsure if this is doing what we think. Experimental code here.

There is also a bayesclust package that I haven't investigated. Install at last link below using

install.packages("path/to/file/bayesclust_3.1.tar.gz", repos = NULL, type = "source")

References: http://www.tandfonline.com/doi/pdf/10.1198/016214508000000454?needAccess=true& https://github.com/pkimes/sigclust2 https://arxiv.org/pdf/1610.01424.pdf http://www.stat.ufl.edu/archived/casella/Papers/FuentesandCasella.pdf https://www.jstatsoft.org/article/view/v047i14

dchiu911 commented 7 years ago

Setting parameter icovest = 2 in my_sigclust() seems to yield more reasonable p-values (not just 0 or 1), but something in between. The description is

There are three options for estimating the eigenvalues of the covariance matrix: 1. Soft Thresholding (recommended for high dimensions, when the diagnostics indicate assumptions are met). 2. Sample eigenvalues (recommended for low dimensions, and when assumptions, such as Gaussianity fail, but known to be generally conservative). 3. Hard Thresholding.

Since we have n > p for data(hgsc), option 2 seems to work better than option 1 as I had noticed.

Update: Option 1 seems more robust.

dchiu911 commented 7 years ago

@AlineTalhouk please review again regarding the hypothesis testing

AlineTalhouk commented 7 years ago

@dchiu911 icovest = 2 seems reasonable. I just read https://arxiv.org/pdf/1610.01424.pdf That makes sense to me as a framework.

AlineTalhouk commented 7 years ago

We will probably need to do some simulations to see whether we are detecting or not..