Open yuqimiao opened 2 years ago
Thanks @yuqimiao this is a great outline. Are there more detailed notes on the procedure and the results? eg what do you mean by "much better"?
Since you've already explored this much, I think we should stick to softImpute and do it per chrom if you have to partition it. Without partition, I wonder if a smarter Python implementation of softImpute would work for larger data?
The compare martrix:
here the column name numbers are the k grid for KNN; and the measure is MSE between the full matrix and imputed matrix I can try the per-Chrome imputation first and try python at the mean time.
@yuqimiao Indeed softimpute is a lot better! please make a notebook (rmd or ipynb) to formally document the comparison and push that to github.
I did a benchmark over the 2 imputation methods over following procedure
Run a small imputation benchmark on KNN and softImpute(SVD)
Impute the methylation data using softImpute
From the 30 trials, the softImputation works much better than the KNN based methods(Not sure about the reason now);
The problem is that we have ~400K rows in the methyl matrix, which is beyond the scope of both the algorithm to handle. Should I try to make blocks of the rows and impute block by block(strategy used here: https://www.rdocumentation.org/packages/impute/versions/1.46.0/topics/impute.knn, need further blocking)? Or I can just use the mean per row for a easy and fast imputation?