We need a pipeline for preprocessing steps in assessing data quality and data cleaning before running the predictor. Currently there is no such mechanism in place. Operations pipeline would run:
identify structure in missingness of data
identify and flag outlier samples
run some unsupervised analyses on the samples. e.g. pca, hierarchical clustering
For continuous-valued data, compare several similarity metrics to find one which best separates classes. e.g. RNAcorr.R written by SP for PanCancer
Hierarchical clustering of classes and PCA, following same idea.
Running univariate test to prune matrix of variables that goes into netDx.
We need a pipeline for preprocessing steps in assessing data quality and data cleaning before running the predictor. Currently there is no such mechanism in place. Operations pipeline would run: