Open bpolacco opened 3 years ago
From my code above, here's run with no missing values:
Same but with missing values added:
Another possible solution is to "center" the log2.intensity at zero before setting missing to zero, then the effect will be less. Looks good with my toy data, not sure with real world data yet.
mat <- t(scale(t(mat), center = TRUE, scale = FALSE))
Thanks @bpolacco Let me carefully study it and get back to you
I noticed recently that the artMS PCA plots were very different from my own PCA plots on the same dataset. The difference was in how missing values were treated. I remove rows with missing values. It looks like artMS sets missing values to 0. This has major consequences when the mean log2.intensity is about 25 and standard deviation on the order of ~1. Adding zeros injects huge and un-interesting variance that the PCA will work to display, to the detriment of displaying actual interesting variance in the data. Here's some toy code using random data to demonstrate
I think the easiest way to deal with this is to limit the matrix to complete cases;
complete.cases
function is handy for this. This may discard too much data on large or noisy datasets or a dataset with one very sparse run, so a check and possible warning would be good. Alternatively, there are packages to impute missing values in PCA, but I don't know enough about them to recommend a single package, especially not one robust to all cases.