Closed LTLA closed 1 year ago
FYI, this is my current blurb.
The
r Biocpkg("scMerge")
package [@lin2019scmerge] takes a different approach to batch correction. First, $k$-means clustering is used to group together cells within a batch, which is followed by the construction of pseudo-bulk samples by summing counts across all cells in each cluster. Mutually neighboring pairs of pseudo-bulk samples are identified for every pair of batches, enabling the identification of sets of "pseudo-replicate" clusters across batches. Stably expressed genes (SEGs) are also identified that do not differ between cells within each batch, akin to lowly variable genes. A factor analysis model is then constructed using the pseudo-replicate relationships and applied to the SEGs to identify factors of unwanted variation, i.e., the batch effects.
Just putting it here to avoid writing it again, because I need to remove this section until the deal-breakers are resolved so that I can compile the rest of the chapter.
Hi Aaron,
I have updated scMerge
to version 1.1.6 on the GitHub master branch. And it currently accepts matrix
, dgCMatrix
and DelayedArray
.
Running the following codes with matrix
class takes about 6 minutes. However, for if the matrices are kept as HDF5Array
, the computation is so slow I didn't manage to get an output for hours. Setting the BPPARAM
argument to BiocParallel::MulticoreParam(workers = 12)
(which gets passed off to BiocSingular::runSVD
) did not make the computation any faster.
Do you have any suggestions with possible configurations with respect to the BiocSingular::runSVD
?
library(TENxPBMCData)
library(SummarizedExperiment)
library(scMerge)
library(scater)
pbmc3k <- TENxPBMCData('pbmc3k')
pbmc4k <- TENxPBMCData('pbmc4k')
unfiltered <- pbmc3k
is.mito <- grep("MT", rowData(pbmc3k)$Symbol_TENx)
stats <- perCellQCMetrics(pbmc3k, subsets=list(Mito=is.mito))
high.mito <- isOutlier(stats$subsets_Mito_percent, nmads=3, type="higher")
pbmc3k <- pbmc3k[,!high.mito]
pbmc3k <- logNormCounts(pbmc3k)
unfiltered <- pbmc4k
is.mito <- grep("MT", rowData(pbmc4k)$Symbol_TENx)
stats <- perCellQCMetrics(pbmc4k, subsets=list(Mito=is.mito))
high.mito <- isOutlier(stats$subsets_Mito_percent, nmads=3, type="higher")
pbmc4k <- pbmc4k[,!high.mito]
pbmc4k <- logNormCounts(pbmc4k)
pbmc_combine = scMerge::sce_cbind(list(pbmc3k = pbmc3k, pbmc4k = pbmc4k),
method = "intersect")
dim(pbmc_combine)
data("segList_ensemblGeneID")
colnames(pbmc_combine) = paste0("gene", seq_len(ncol(pbmc_combine)))
SummarizedExperiment::assay(pbmc_combine, "counts") = as.matrix(SummarizedExperiment::assay(pbmc_combine, "counts"))
SummarizedExperiment::assay(pbmc_combine, "logcounts") = as.matrix(SummarizedExperiment::assay(pbmc_combine, "logcounts"))
scMerge::scMerge(
sce_combine = pbmc_combine,
ctl = segList_ensemblGeneID$human$human_scSEG,
kmeansK = c(10, 10),
assay_name = "scMerge_unsupervised",
verbose = TRUE,
BPPARAM = BiocParallel::SerialParam(),
BSPARAM = BiocSingular::IrlbaParam(fold = 5))
This takes about 6 minutes.
scMerge::scMerge(
sce_combine = pbmc_combine,
ctl = segList_ensemblGeneID$human$human_scSEG,
kmeansK = c(10, 10),
assay_name = "scMerge_unsupervised",
verbose = TRUE,
BPPARAM = BiocParallel::MulticoreParam(workers = 12),
BSPARAM = BiocSingular::IrlbaParam(fold = 5))
Don't set fold
, for starters. This will compute a cross-product, which is slow.
Also don't use IRLBA for file-backed matrices, RSVD is much faster as it touches the disk less.
Great, thanks @LTLA.
The latest version 1.1.6 on the master branch of the GitHub repo can process the HDF5Array
backend in about 11 minutes and the matrix
backend in about 1 minute.
I believe the latest version resolves most issues you raised, however, the issue with scSEGIndex
remains. I will keep working on this function in the coming days. Particularly the inconsistent parameter names in both camel/snake case.
library(TENxPBMCData)
library(SummarizedExperiment)
library(scMerge)
library(scater)
pbmc3k <- TENxPBMCData('pbmc3k')
pbmc4k <- TENxPBMCData('pbmc4k')
unfiltered <- pbmc3k
is.mito <- grep("MT", rowData(pbmc3k)$Symbol_TENx)
stats <- perCellQCMetrics(pbmc3k, subsets=list(Mito=is.mito))
high.mito <- isOutlier(stats$subsets_Mito_percent, nmads=3, type="higher")
pbmc3k <- pbmc3k[,!high.mito]
pbmc3k <- logNormCounts(pbmc3k)
unfiltered <- pbmc4k
is.mito <- grep("MT", rowData(pbmc4k)$Symbol_TENx)
stats <- perCellQCMetrics(pbmc4k, subsets=list(Mito=is.mito))
high.mito <- isOutlier(stats$subsets_Mito_percent, nmads=3, type="higher")
pbmc4k <- pbmc4k[,!high.mito]
pbmc4k <- logNormCounts(pbmc4k)
pbmc_combine = scMerge::sce_cbind(list(pbmc3k = pbmc3k, pbmc4k = pbmc4k),
method = "intersect")
dim(pbmc_combine)
data("segList_ensemblGeneID")
colnames(pbmc_combine) = paste0("gene", seq_len(ncol(pbmc_combine)))
##################################
system.time(
obj1 <- scMerge::scMerge(
sce_combine = pbmc_combine,
ctl = segList_ensemblGeneID$human$human_scSEG,
kmeansK = c(10, 10),
assay_name = "scMerge_unsupervised2",
verbose = TRUE,
BPPARAM = BiocParallel::SerialParam(),
svd_k = 50,
BSPARAM = BiocSingular::RandomParam(fold = Inf),
BACKEND = "HDF5Array")
)
SummarizedExperiment::assay(pbmc_combine, "counts") = as.matrix(SummarizedExperiment::assay(pbmc_combine, "counts"))
SummarizedExperiment::assay(pbmc_combine, "logcounts") = as.matrix(SummarizedExperiment::assay(pbmc_combine, "logcounts"))
system.time(
obj2 <- scMerge::scMerge(
sce_combine = pbmc_combine,
ctl = segList_ensemblGeneID$human$human_scSEG,
kmeansK = c(10, 10),
assay_name = "scMerge_unsupervised2",
verbose = TRUE,
BPPARAM = BiocParallel::SerialParam(),
svd_k = 50,
BSPARAM = BiocSingular::RandomParam(fold = Inf))
)
@LTLA, let me know if you are happy with the description of the scSEGIndex
function here or if you have any other suggestions.
One thing that I found to be particular strange is the speed associated with DelayedArray multiplication, which I opened up an issue here
Let me know your thoughts, cheers, Kevin
I tried to run the HDF5 example and the scMerge()
command used so much memory that my computer forcibly logged me out. And I have 16 GB of RAM on this baby, so that's not a good sign.
That aside, immediate impressions are that I shouldn't have to set BACKEND = "HDF5Array"
.
I ran this script on Google Cloud (VM with 15GB memory and another with 30GB) under a docker container. Both machines managed to handle both the DelayedArray+HDF5Array
and the matrix
version of scMerge
. The memory allocation for the top functions is here.
It seems like the main culprit of the large memory usage is the runSVD
call in running the HDF5 example. This blowing up memory usage essentially killed off any advantage offered by out-of-memory computations since the data when realize as a full matrix
is less than 1GB.
The time plot is here
For starters, computePCA_byHVGMarker
has BSPARAM = BiocSingular::ExactParam(fold=5)
rather than using the specified RandomParam
, so that's not good. This instructs the function to compute the cross-product, which is painfully slow for file-backed matrices (effectively needs to read the matrix into memory twice to achieve a result that is agnostic to the parallelization scheme).
Revisiting this. Should I try again with the latest version?
@YingxinLin is now the maintainer of this package and will investigate this. I remember the last time I looked into this, the part that was making the computation slow was fastRUVIII
, where the SVD is performed on the whole matrix. This also happened to be the part that takes the most RAM.
With the new updates in the DelayedArray
(and the BioC in general), it will be good to test if this is still the case. If this is true, then we need to think of some ways to speed this up. e.g. I am not sure of the speed gains by invoking the BPPARAM
parameter in the DelayedArray::runSVD
function.
Currently testing out scMerge for inclusion in the OSCA book chapter on data integration. Here are some inconveniences and deal-breakers that I've encountered:
Setup
You'll need to run the code in https://osca.bioconductor.org/integrating-datasets-1.html#setting-up-the-data-1 to create
pbmc3k
andpbmc4k
. There's quite a lot of code, so I won't bother copying that here, but my points should be fairly apparent without them.Executed code example
This is the code chunk that I actually ran to do the work:
Deal-breakers
as.matrix()
calls, which defeats the purpose of having special memory-efficient representations for large data sets.fast_svd=TRUE
. It was taking about 10 minutes on the PBMC example above (~8000 cells). I was hoping for 1-2 minute runtime.Inconveniences
cbind
the SCEs together (which then requires matchingrowData
). Allowing people to pass in a series of separate SCEs is one less barrier, check out the allowed inputs in various batchelor functions.scSEGIndex
was not obvious, had to actually ask @kevinwang09 for directions.