alanocallaghan / scater

Clone of the Bioconductor repository for the scater package.
https://bioconductor.org/packages/devel/bioc/html/scater.html
95 stars 40 forks source link

runPCA out of memory #213

Closed Thapeachydude closed 1 week ago

Thapeachydude commented 2 weeks ago

Hello,

I was hoping for some wisdom on how to handle some very large datasets. The dataset I'm struggling with is a recently acquired 10x Xenium data set, with roughly 3 million cells and 5000 detected genes.

I tried running the following with 25 cores and a total of 500GB of RAM but even so it ran out of memory.

ncores <- 25
mcparam <- MulticoreParam(workers=ncores)
register(mcparam)

set.seed(100)
dec <- modelGeneVar(sce.x) # subset to most differentially expressed genes
set.seed(100)
sce.x <- runPCA(sce.x, subset_row=getTopHVGs(dec), BPPARAM = mcparam, BSPARAM = IrlbaParam(), ntop = 2000) 

Of course I can reduce the number of features, but I was hoping for an alternative that let's me keep as much of the information in the data set as possible.

Any insights are highly welcome!

Best

LTLA commented 2 weeks ago

An unfortunate aspect of the BPPARAM argument with IrlbaParam() is that it may not, actually, make it run faster. This is because setting BPPARAM= will instruct BiocSingular to switch from the fast C++ code (serial) to the slow R code where the parallelized matrix multiplication can be used. The multiplication is based on block processing where the memory usage scales with the number of cores, as each core processes a different block. This can get particularly out of hand when the process-based nature of R's parallelization interacts with memory management, causing each process to think that it can still use all available memory (and thus causing the entire job to exceed the allocation, resulting in OOM errors).

If your matrix is a dgCMatrix or an ordinary matrix, it is often better to just use BSPARAM = SerialParam(), so that the fast C++ code path can be used. Otherwise, if your matrix is a bit more exotic, you might consider using scrapper::runPca(), which avoids BiocParallel's memory usage scaling when the number of cores increases.

Thapeachydude commented 2 weeks ago

Hi thanks for the quick reply. BSPARAM = SerialParam() through an error, but BPPARAM = SerialParam() works. I'm running following combo at the moment:

runPCA(sce.x, subset_row=getTopHVGs(dec), BPPARAM = SerialParam(), BSPARAM = IrlbaParam())

While it hasn't finished yet, it seems to use a lot less memory, while still using all the CPUs. Thanks!

Is there a similar easy trick to avoid running out of memory during clustering? In the past I used to run: clusterCells(sce.x, use.dimred = "PCA", BLUSPARAM = SNNGraphParam(k = opt$gexk, cluster.fun = "walktrap", BPPARAM=mcparam, type = "rank")) But that already proved challening with standard single-cell data sets. Would BPPARAM=SerialParam() + BNPARAM=AnnoyParam() be also the "fastest" version here?

LTLA commented 2 weeks ago

BSPARAM = SerialParam() through an error

Oops. Sorry, that should have read BPPARAM=.

While it hasn't finished yet, it seems to use a lot less memory, while still using all the CPUs. Thanks!

I'm not actually sure why it's using more CPUs; if it's using the C++ path, it should be performing single-threaded matrix multiplication. Perhaps you have a parallelized BLAS library.

Is there a similar easy trick to avoid running out of memory during clustering?

I wouldn't think that graph-based clustering would use all that much memory, but that depends on the size of your graph, which is in turn determined by the number of neighbors.

Switching to BNPARAM=AnnoyParam() will certainly speed things up. I don't think it will save much memory though. Perhaps also switching to cluster.fun="louvain" will help. You could then decrease the resolution=, which would achieve the same effect as increasing the number of neighbors without the extra memory usage.

Thapeachydude commented 1 week ago

Indeed, cluster.fun="louvain" massively reduced the amount of memory/time needed. I think for now this can be closed.

Thanks a lot!