biodavidjm / artMS

Analytical R Tools for Mass Spectrometry
GNU General Public License v3.0
14 stars 7 forks source link

Handling of missing values in PCA plots #171

Open bpolacco opened 3 years ago

bpolacco commented 3 years ago

I noticed recently that the artMS PCA plots were very different from my own PCA plots on the same dataset. The difference was in how missing values were treated. I remove rows with missing values. It looks like artMS sets missing values to 0. This has major consequences when the mean log2.intensity is about 25 and standard deviation on the order of ~1. Adding zeros injects huge and un-interesting variance that the PCA will work to display, to the detriment of displaying actual interesting variance in the data. Here's some toy code using random data to demonstrate

plotPCA  <- function(mat){
  pca <- prcomp(t(mat))
  pca.dt <- as.data.table(pca$x, keep.rownames = TRUE)
  print (ggplot (pca.dt, aes(x = PC1, y = PC2, label = rn)) + geom_label())
}

# set up a toy matrix
mat <- sapply (1:10, function(x)rnorm(26, mean = 20+x))
colnames(mat) <- LETTERS[1:10]
rownames(mat) <- paste0("protein.", letters)

# inject some missing data in an alternative matrix
mat.w0 <- mat
for (rn in sample(rownames(mat), 13)){
  mat.w0[rn, sample(colnames(mat), 3)] <- 0
}

# no missing data
# this looks as expected with a clear trend A->H in PC1
plotPCA(mat)

# with missing data
# this probably looks random in both PC1 and PC2 because its describing mostly the pattern of missing values
plotPCA(mat.w0)

I think the easiest way to deal with this is to limit the matrix to complete cases; complete.cases function is handy for this. This may discard too much data on large or noisy datasets or a dataset with one very sparse run, so a check and possible warning would be good. Alternatively, there are packages to impute missing values in PCA, but I don't know enough about them to recommend a single package, especially not one robust to all cases.

bpolacco commented 3 years ago

From my code above, here's run with no missing values:

image

Same but with missing values added: image

bpolacco commented 3 years ago

Another possible solution is to "center" the log2.intensity at zero before setting missing to zero, then the effect will be less. Looks good with my toy data, not sure with real world data yet.

mat <- t(scale(t(mat), center = TRUE, scale = FALSE))

biodavidjm commented 3 years ago

Thanks @bpolacco Let me carefully study it and get back to you