Methylation data imputation

yuqimiao commented 2 years ago

I did a benchmark over the 2 imputation methods over following procedure

Run a small imputation benchmark on KNN and softImpute(SVD)
1. subset the first 10,000 of the CpG sites with totally 300 samples with full data, so getting a 10,000*300 full matrix
2. Randomly sample 100 rows and 50 columns to be NA
3. KNN
  1. set grid k = 5,15,...,45
  2. impute and calculate MSE for each k
4. softImputation
  1. Use rank(maximum rank in SVD) as 50 and lambda(hyper-par for nuclear norm) 30 (default), see [here](https://cran.r-project.org/web/packages/softImpute/softImpute.pdf), page 11
  2. complete data
5. Repeat missing sampling for 30 times and get comparison matrix, row as each sampling process, column as mse of imputation
Impute the methylation data using softImpute

From the 30 trials, the softImputation works much better than the KNN based methods(Not sure about the reason now);

The problem is that we have ~400K rows in the methyl matrix, which is beyond the scope of both the algorithm to handle. Should I try to make blocks of the rows and impute block by block(strategy used here: https://www.rdocumentation.org/packages/impute/versions/1.46.0/topics/impute.knn, need further blocking)? Or I can just use the mean per row for a easy and fast imputation?

gaow commented 2 years ago

Thanks @yuqimiao this is a great outline. Are there more detailed notes on the procedure and the results? eg what do you mean by "much better"?

Since you've already explored this much, I think we should stick to softImpute and do it per chrom if you have to partition it. Without partition, I wonder if a smarter Python implementation of softImpute would work for larger data?

yuqimiao commented 2 years ago

The compare martrix:

here the column name numbers are the k grid for KNN; and the measure is MSE between the full matrix and imputed matrix I can try the per-Chrome imputation first and try python at the mean time.

gaow commented 2 years ago

@yuqimiao Indeed softimpute is a lot better! please make a notebook (rmd or ipynb) to formally document the comparison and push that to github.

cumc / xqtl-protocol

Methylation data imputation #381