Clustering at the sample level

aadamk commented 3 years ago

Hello, Is coseq appropriate for clustering at the sample level? If so, what quantitative adjustments can one make to ensure a valid analysis (e.g. in terms of data transformations that one can use as input for GMM)?

andreamrau commented 3 years ago

Hello,

Thanks for your question. To use coseq for sample clustering, I think we can split the answer in two according to the algorithm used:

(1) For GMM, for me the main issue that requires some careful thought is the relatively large number of variables p (i.e. genes).

=> First, I would think that in such a case it would not be useful to try to make use of the compositional nature of the data (i.e., clustering profiles x/rowSums(x) or transformations of profiles like arcsin, logit, clr, logclr). I would think it would be better to work with normalized (for library size differences) and variance-stabilized (i.e. log or vst) counts.

=> Second, the default form of GMM as defined by Rmixmod used by coseq is "Gaussian_pk_Lk_Ck", a general family model with free proportions, free volume, free shape, and free orientation. Such a model is probably way overparameterized for clustering RNA-seq samples (where n << p). You could try using a simpler form of GMM instead, like "Gaussian_pk_Lk_I", which assumes spherical covariance matrices but varying proportions between clusters (see Table 1 in https://www.jstatsoft.org/article/view/v067i06).

(2) A GMM with spherical covariance matrices and identical proportions is actually roughly equivalent to a K-means clustering, so you could also try using that on the normalized variance-stabilized counts as well.

library(coseq)
set.seed(12345)
countmat <- matrix(runif(300*20, min=0, max=500), nrow=300, ncol=20)
countmat <- round(countmat[which(rowSums(countmat) > 0),])

lognorm <- transformRNAseq(y = countmat,
                           normFactors = "TMM",
                           transformation = "log")$tcounts
tlognorm <- t(lognorm)

run_gmm <- coseq(object=tlognorm,
                 K=2:15,
                 transformation="none",
                 normFactors="none",
                 model="Normal",
                 GaussianModel = "Gaussian_pk_Lk_I",
                 seed=12345)

run_kmeans <- coseq(object=tlognorm,
                 K=2:15,
                 transformation="none",
                 normFactors="none",
                 model="kmeans")

Either way, it is worth thinking carefully about whether you need to filter weakly expressed genes before filtering, and whether a hierarchical clustering on inter-sample distances would in fact be more adapted to your question.

AR

aadamk commented 3 years ago

Thank you very much for your detailed response.

andreamrau / coseq

Clustering at the sample level #4