Imputation questions - Githubissues

mmulvahill commented 6 years ago

@kechrisk

One imputation method is the half-minimum approach, where we impute 0's with half the minimum value for that compound. Do we actually want to use the minimum across all patients & replicates, or should we do this using the minimum value within each patient?
In the old code, when BPCA imputes a number < 0, we replace that negative value with the half-minimum imputation -- resulting in a combined BPCA/half-min imputation method. Is this what we want to do? Or should these negative values be considered 'true missing'? Or, should these be two separate options? (1. BPCA, assumed 0 if neg. and 2. BPCA, assumed below threshold)

For reference, from the manuscript:

Missing Data: There are three primary modes of missing data in metabolomics datasets and each mode has different implications for subsequent analysis; therefore, different imputation routines and statistical methods are required and three are offered in the MSPrep package. The three modes are truly not present, present below the detectable limit of the instrument and absent owing to error in pre-processing algorithms. The MSPrep package implements three methods of managing missing data: (i) No imputation assumes the mode of missing is true zeros and therefore assigns the missing values as zeros. This dataset could be useful for PCA analysis, cluster analysis and methods that account for clustering at zero. Unless a stringent filter is applied, normalization routines may have poor performance, as most have assumptions about underlying distributions that are not valid with zero clustered data. (ii) The second option assumes missing compounds were below the detectable limit and imputes a value of one half of the minimum observed value for that compound (Xia et al., 2009). (iii) The final method is a call to the Bayesian PCA (BPCA) imputation algorithm (Oba et al., 2003) from the PCAMethods R package (Stacklies et al., 2007) and assumes that the compound is present but failed to be accurately detected. This algorithm estimates the missing value by a linear combination of principal axis vectors, where the parameters of the model are identified by a Bayesian estimation method and is not sensitive to the quantity of missing data.

mmulvahill commented 6 years ago

via KK

For the imputation method, isn't this after the summarization step? In that case, you won't have multiple replicates per subject? If so, then use minimum across all subjects for that metabolite that have data (excluding zeros). If you want to separate the steps, then what do you think? Take the minimum of the replicates for the subject, then take the minimum of those that are not zero across subjects? I don't think there are hard established rules.

I didn't remember this BPCA/half-min approach. How often does this happen? Is it frequent? I need to think about what's best in this scenario.

My responses

You're correct. What we do have in the example dataset is multiple spikes for each patient. I'm guessing we should average across patients and not spikes, but I'm not entirely clear on the role of spikes in mass spec. For now I don't necessarily want to the separate the steps. A secondary goal is to define valid paths through the pipeline and to throw errors/warnings if the functions are used in an invalid order. For this goal we may want to separate them out.
So this actually occurs in the old code for both kNN and BPCA. I'm only working with one dataset, but it occurred 2/200 missing values in BPCA. I've coded both a BPCA + set negative = 0 (default) and a BPCA + half-min option for now. I haven't come across any negative values for kNN.

mmulvahill commented 6 years ago

Remove spike from the pipeline -- users will have to add to create an ID that uses this to differentiate from other of same patient.
Just use the bpca + half-min, taking half-min exlcuding the imputed BPCA values. Get rid of the bpca + 0 approach.

GET NEW DATASET -- ask Dominick for raw data NGH131 and Emory datasets

Emory dataset has no replicates, so 'summarization' not entirely necessary.

mmulvahill commented 6 years ago

Closing issue -- removed spike from pipeline (build broken due to other changes though) and removed bpca+0 approach

Also got new dataset from Dominik

KechrisLab / MSPrep

Imputation questions #14