Closed aadamk closed 3 years ago
Hello,
Thanks for your question. To use coseq for sample clustering, I think we can split the answer in two according to the algorithm used:
(1) For GMM, for me the main issue that requires some careful thought is the relatively large number of variables p (i.e. genes).
=> First, I would think that in such a case it would not be useful to try to make use of the compositional nature of the data (i.e., clustering profiles x/rowSums(x) or transformations of profiles like arcsin, logit, clr, logclr). I would think it would be better to work with normalized (for library size differences) and variance-stabilized (i.e. log or vst) counts.
=> Second, the default form of GMM as defined by Rmixmod used by coseq is "Gaussian_pk_Lk_Ck"
, a general family model with free proportions, free volume, free shape, and free orientation. Such a model is probably way overparameterized for clustering RNA-seq samples (where n << p). You could try using a simpler form of GMM instead, like "Gaussian_pk_Lk_I"
, which assumes spherical covariance matrices but varying proportions between clusters (see Table 1 in https://www.jstatsoft.org/article/view/v067i06).
(2) A GMM with spherical covariance matrices and identical proportions is actually roughly equivalent to a K-means clustering, so you could also try using that on the normalized variance-stabilized counts as well.
library(coseq)
set.seed(12345)
countmat <- matrix(runif(300*20, min=0, max=500), nrow=300, ncol=20)
countmat <- round(countmat[which(rowSums(countmat) > 0),])
lognorm <- transformRNAseq(y = countmat,
normFactors = "TMM",
transformation = "log")$tcounts
tlognorm <- t(lognorm)
run_gmm <- coseq(object=tlognorm,
K=2:15,
transformation="none",
normFactors="none",
model="Normal",
GaussianModel = "Gaussian_pk_Lk_I",
seed=12345)
run_kmeans <- coseq(object=tlognorm,
K=2:15,
transformation="none",
normFactors="none",
model="kmeans")
Either way, it is worth thinking carefully about whether you need to filter weakly expressed genes before filtering, and whether a hierarchical clustering on inter-sample distances would in fact be more adapted to your question.
AR
Thank you very much for your detailed response.
Hello, Is coseq appropriate for clustering at the sample level? If so, what quantitative adjustments can one make to ensure a valid analysis (e.g. in terms of data transformations that one can use as input for GMM)?