const-ae / lemur

Latent Embedding Multivariate Regression
https://www.bioconductor.org/packages/lemur/
Other
80 stars 7 forks source link

Expectations for UMAP of embeddings #10

Open Thapeachydude opened 10 months ago

Thapeachydude commented 10 months ago

Hi,

interesting tool (and the use of the single-cell experiment class is much appreciated). We have single-cell RNA-Seq data from multiple samples at various conditions. This far, I've seen the most sensible results by creating pseudo-bulks per cell type, sample and condition and fitting a linear model ~ sample + Condition using edgeR.

So, I was very curious when I saw your tool. This far, however, I'm not sure I understand the output. I ran:

fit <- lemur(sce, design = ~ sampleID + Treatment, n_embedding = 20)

set.seed(100)
fit <- runUMAP(fit, dimred = "embedding", n_neighbors = 15, min_dist = 0.25, name = "UMAP_embedding", BPPARAM = mcparam)

to get an overview of the embeddings. I would expect at least some separation based on the conditions (since for some of them the pseudo-bulk results are quite strong, and we can even appreciate them in a UMAP of PCA loadings). But I see a big blob of cells, and some very small individual groups. But no "Treatment-shifts".

Is this what you would expect?

Btw. is there a way to limit the memory of the lemur function call? It is very fast but super memory intesive. Ideally, I don't always need to run it on a HPC.

Best, M

const-ae commented 10 months ago

Hi M,

thanks for your interest and reaching out.

I would expect at least some separation based on the conditions (since for some of them the pseudo-bulk results are quite strong, and we can even appreciate them in a UMAP of PCA loadings)

LEMUR tries to absorb as much of the variation in the data associated with the known covariates into $R(x)$ and $S(x)$ so that the embedding ($Z$) will show you the residual variance (i.e., everything that is varying not due to sampleID or Treatment). This would typically be different cell states.

But I see a big blob of cells, and some very small individual groups. But no "Treatment-shifts".

Depending on your data this might be a reasonable outcome. If there is not much latent heterogeneity (is your data from a cell line for example?) you would expect to see one big blob and with cells from all conditions intermixed.

Best, Constantin

const-ae commented 10 months ago

Btw. is there a way to limit the memory of the lemur function call? It is very fast but super memory intesive. Ideally, I don't always need to run it on a HPC.

There currently is no easy way to limit the memory requirements beyond subsampling your cells and subsetting to a reasonable set of highly variable genes.