WangLab-MSSM / DreamAI

Imputation of missing values of a matrix or data.frame using iterative prediction model
Apache License 2.0
31 stars 7 forks source link

DreamAI

Authors

Shrabanti Chowdhury1, Weiping Ma1, Sunkyu Kim2; Zhi Li3, Thomas Yu4, Mi Yang5,6, Francesca Petralia1, Jeremy Jacobsen7, Jingyi Jessica Li8, Xinzhou Ge8, Kexin Li9, Nathan Edwards10, Samuel Payne11, Henry Rodriguez12, Paul Boutros13, Gustavo Stolovitzky14, Jaewoo Kang2, David Fenyo3, Julio Saez-Rodriguez,6,15, Pei Wang1

1Icahn School of Medicine at Mount Sinai (USA), 2Department of Computer Science and Engineering, Korea University (South Korea), 3New York University (USA), 4Sage Bionetworks (USA), 5Heidelberg University, Faculty of Biosciences (Germany), 6RWTH Aachen University (Germany), 7University of Colorado (USA), 8Department of Statistics, University of California (USA), 9Department of Mathematics, Tsinghua University (China), 10Georgetown University (USA), 11Pacific Northwest National Laboratory (USA), 12National Cancer Institute (USA), 13Ontario Institute of Cancer Research (Canada), 14IBM Research & Mount Sinai (USA), 15European Molecular Biology Laboratory-European Bioinformatics Institute (UK)

Overview

To develop powerful computational tools to extract the most information from the proteome, Clinical Proteomic Tumor Analysis Consortium (CPTAC) and DREAM organization launched The NCI-CPTAC DREAM Proteogenomics Challenge in 2016, one of the subchallenges: impute missing values in proteomics data given observed proteins.

In this challenge, participants were invited to develop proper imputation algorithms for proteomics data. And with their help an optimal imputation method: DreamAI was ensembled as an outcome of this challenge.

Specifically in DreamAI, ensemble imputation matrix is obtained from averaging results of six imputation algorithms: top 3 teams in challenge (spectroFM: Team DMIS_PTG; RegImpute: Team Jeremy Jacobsen; Birnn: Team BruinGo) and 3 baseline algorithms (KNN, missForest, ADMIN). Bootstrap aggregating (bagging) is also adopted to improve unstable estimation and accuracy of machine learning algorithms.

In the output option of this function, it provides user the flexibility to select imputation matrix from the ensemble method or each individual algorithm:

Installation

Packages required prior to installing DreamAI

require("cluster")
require("survival")
require("randomForest")
require("missForest")
require("glmnet")
require("Rcpp")
require("foreach")
require("itertools")
require("iterators")
require("Matrix")
require("devtools")
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("impute", version = "3.8")
require("impute")

Install DreamAI

require("remotes")
install_github("WangLab-MSSM/DreamAI/Code")

Usage

DreamAI(data, k = 10, maxiter_MF = 10, ntree = 100,
  maxnodes = NULL, maxiter_ADMIN = 30, tol = 10^(-2),
  gamma_ADMIN = NA, gamma = 50, CV = FALSE,
  fillmethod = "row_mean", maxiter_RegImpute = 10,
  conv_nrmse = 1e-06, iter_SpectroFM = 40, method = c("KNN",
  "MissForest", "ADMIN", "Birnn", "SpectroFM", "RegImpute"),
  out = c("Ensemble"))

Arguments

Parameter Default Description
data dataset in the form of a matrix or dataframe with missing values or NA's. The function throws an error message and stops if any row or column in the dataset is missing all values
k 10 number of neighbors to be used in the imputation by KNN and ADMIN
maxiter_MF 10 maximum number of iteration to be performed in the imputation by "MissForest" if the stopping criteria is not met beforehand
ntree 100 number of trees to grow in each forest in "MissForest"
maxnodes NULL maximum number of terminal nodes for trees in the forest in "MissForest", has to equal at least the number of columns in the given data
maxiter_ADMIN 30 maximum number of iteration to be performed in the imputation by "ADMIN" if the stopping criteria is not met beforehand
tol 10^(-2) convergence threshold for "ADMIN"
gamma_ADMIN NA parameter for ADMIN to control abundance dependent missing. Set gamma_ADMIN=0 for log ratio intensity data. For abundance data put gamma_ADMIN=NA, and it will be estimated accordingly
gamma 50 parameter of the supergradients of popular nonconvex surrogate functions, e.g. SCAD and MCP of L0-norm for Birnn
CV FALSE a logical value indicating whether to fit the best gamma with cross validation for "Birnn". If CV=FALSE, default gamma=50 is used, while if CV=TRUE gamma is calculated using cross-validation.
fillmethod "row_mean" a string identifying the method to be used to initially filling the missing values using simple imputation for "RegImpute". That could be "row_mean" or "zeros", with "row_mean" being the default. It throws an warning if "row_median" is used.
maxiter_RegImpute 10 maximum number of iterations to reach convergence in the imputation by "RegImpute"
conv_nrmse 1e-06 convergence threshold for "RegImpute"
iter_SpectroFM 40 number of iterations for "SpectroFM"
method c("KNN","MissForest", "ADMIN", "Birnn", "SpectroFM", "RegImpute", "Ensemble") a vector of imputation methods: ("KNN", "MissForest", "ADMIN", "Birnn", "SpectroFM, "RegImpute", "Ensemble"). Default is "Ensemble" if nothing is specified
out c("Ensemble") a vector of imputation methods for which the function will output the imputed matrices

Value

a list of imputed datasets by different methods as specified by the user. Always returns imputed data by "Ensemble"

Note

If all methods are specified for obtaining "Ensemble" imputed matrix, the approximate time required to output the imputed matrix for a dataset of dimension 26000 x 200 is ~50 hours.

Examples

data(datapnnl)
data<-datapnnl.rm.ref[1:100,1:21]
impute<- DreamAI(data,k=10,maxiter_MF = 10, ntree = 100,maxnodes = NULL,maxiter_ADMIN=30,tol=10^(-2),gamma_ADMIN=NA,gamma=50,CV=FALSE,fillmethod="row_mean",maxiter_RegImpute=10,conv_nrmse = 1e-6,iter_SpectroFM=40, method = c("KNN", "MissForest", "ADMIN", "Birnn", "SpectroFM", "RegImpute"),out="Ensemble")
impute$Ensemble

Contributions

If you find small bugs, larger issues, or have suggestions, please email the maintainer at shrabanti.chowdhury@mssm.edu or weiping.ma@mssm.edu. Contributions (via pull requests or otherwise) are welcome.