PLN-team / PLNmodels

A collection of Poisson lognormal models for multivariate count data analysis
https://pln-team.github.io/PLNmodels
GNU General Public License v3.0
54 stars 18 forks source link

Predict newdata into PCA space #82

Closed maddyduran closed 3 years ago

maddyduran commented 3 years ago

Is there currently a way to predict new data into the PCA space returned by PLNPCA? I can find a rotation matrix but I can't seem to identify a center or scaling vector.

mahendra-mariadassou commented 3 years ago

Hello,

There is currently no proper way to do that. There are a few hacky ones but they lack sound statistical grounding.

To expand a bit, the main problem is that you need to compute the latent position of the newdata into the latent space (the so called variational parameters) while keeping the model parameters before applying the rotation matrix. For the PCA case, the current functions optimize both the variational parameters and the model parameters but not one or the other. However, It should be fairly doable as we do something similar for other analyses (PLN, PLNLDA) so I'll give a try and keep you updated.

mahendra-mariadassou commented 3 years ago

Hello,

I pushed a new development version of the package with the requested feature on the dev branch. You can install it on your computer with

remotes::install_github("pln-team/PLNmodels@dev")

And here is a short demo to show how it works (it's not super user-friendly yet, I'll work on that before inclusion in the next CRAN release but you may not want to wait until then)

library(PLNmodels)
library(ggplot2)
data("oaks")
## Learning set for PCA
learn <- sample(x = 1:nrow(oaks), size = ceiling(0.8*nrow(oaks)), replace = FALSE)
## newdata
newdata <- oaks[-learn, ]

## Compute PCA with 15 latent dimensions
myPCAmodels <- PLNPCA(Abundance ~ 1 + offset(log(Offset)), data = oaks[learn, ], ranks = 15)
myPCA <- myPCAmodels$getBestModel()

## Predict centered latent position of newdata
model <- model.frame(Abundance ~ 1 + offset(log(Offset)), data = newdata)
covariates <- model.matrix(terms(model), model)
offsets    <- model.offset(model)
responses  <- model[[1L]]
newM <- myPCA$VEstep(covariates = covariates, offsets = offsets, responses = responses)$M
## Compute scores positions in PCA space
new_latent_pos <- t(myPCA$model_par$B %*% t(newM)) %>% scale(center = TRUE, scale = FALSE)
new_scores <- new_latent_pos %*% myPCA$rotation
## Add dimnames and metadata for easier plotting.
dimnames(new_scores) <- list(rownames(newdata), paste0("PC", 1:ncol(new_scores)))
new_scores <- cbind(newdata[ , c("Sample", "tree")], as.data.frame(new_scores))

## fviz_pca for original data + 
p <- factoextra::fviz_pca_ind(myPCA, habillage = oaks[learn, "tree"], geom = "point", alpha.ind = 0.5, )
p +
  geom_point(data = new_scores, aes(x = PC1, y = PC2, color = tree)) +
  geom_text(data = new_scores, aes(x = PC1, y = PC2, color = tree, label = Sample), vjust = 0)

with the following results (original data with transparent colors, newdata with labels and solid colors) image

mahendra-mariadassou commented 3 years ago

Hi,

Just to let you know, the code is now slightly more user-friendly and available in the master branch of the github repo, available via:

remotes::install_github("pln-team/PLNmodels")

And the new code to project newdata into PCA space is:

library(PLNmodels)
library(ggplot2)
data("oaks")
## Learning set for PCA (80% of the samples)
set.seed(20210421)
learn <- sample(x = 1:nrow(oaks), size = ceiling(0.8*nrow(oaks)), replace = FALSE)
## newdata (remaining 20%)
newdata <- oaks[-learn, ]

## Compute PCA with 15 latent dimensions
myPCAmodels <- PLNPCA(Abundance ~ 1 + offset(log(Offset)), data = oaks[learn, ], ranks = 15)
myPCA <- myPCAmodels$getBestModel()

## Project newdata in PCA space and add metadata for easier plotting
scores <- myPCA$project(newdata)
scores <- cbind(newdata[ , c("Sample", "tree")], as.data.frame(scores))

## Plot original and newdata
p <- factoextra::fviz_pca_ind(myPCA, habillage = oaks[learn, "tree"], geom = "point", alpha.ind = 0.5)
p + geom_point(data = scores, aes(x = PC1, y = PC2, color = tree)) +
  geom_text(data = scores, aes(x = PC1, y = PC2, color = tree, label = Sample), vjust = 0)

image