LTLA commented 5 years ago

Currently testing out scMerge for inclusion in the OSCA book chapter on data integration. Here are some inconveniences and deal-breakers that I've encountered:

Setup

You'll need to run the code in https://osca.bioconductor.org/integrating-datasets-1.html#setting-up-the-data-1 to create pbmc3k and pbmc4k. There's quite a lot of code, so I won't bother copying that here, but my points should be fairly apparent without them.

Executed code example

This is the code chunk that I actually ran to do the work:

library(scMerge)

# Identifying SEGs:
seg.3k <- scSEGIndex(logcounts(pbmc3k))
seg.4k <- scSEGIndex(logcounts(pbmc4k))
both <- intersect(rownames(seg.3k), rownames(seg.4k))

# Performing the rest of the correction:
combined <- cbind(pbmc3k, pbmc4k)

# TODO: shouldn't have to do this.
counts(combined) <- as.matrix(counts(combined))
logcounts(combined) <- as.matrix(logcounts(combined))
colnames(combined) <- seq_len(ncol(combined))
combined <- combined[rowSums(counts(combined)) > 0,]

sc.merged <- scMerge(sce_combine = combined, 
    ctl = both, kmeansK = c(10, 10), fast_svd=TRUE,
    assay_name = "scMerge_unsupervised")

Deal-breakers

No direct support for non-ordinary matrices. See the as.matrix() calls, which defeats the purpose of having special memory-efficient representations for large data sets.
Returns a dense matrix and blows up my memory, and I already have 16 GB RAM on this machine. This is factor analysis, so surely there's an option to return a low-dimensional representation at some point if I don't want the full expression matrix?
It takes a while, even with fast_svd=TRUE. It was taking about 10 minutes on the PBMC example above (~8000 cells). I was hoping for 1-2 minute runtime.

Inconveniences

Requires column names when that doesn’t seem necessary. The code should be geared to work without requiring the user to set column names, which usually aren't useful anyway.
Requires the user to remove of zero-count genes. Just do it for me, please.
Requires all cells to be in a single SCE. This is also inconvenient as it assumes that I could cbind the SCEs together (which then requires matching rowData). Allowing people to pass in a series of separate SCEs is one less barrier, check out the allowed inputs in various batchelor functions.
The use of scSEGIndex was not obvious, had to actually ask @kevinwang09 for directions.

LTLA commented 5 years ago

FYI, this is my current blurb.

The r Biocpkg("scMerge") package [@lin2019scmerge] takes a different approach to batch correction. First, $k$-means clustering is used to group together cells within a batch, which is followed by the construction of pseudo-bulk samples by summing counts across all cells in each cluster. Mutually neighboring pairs of pseudo-bulk samples are identified for every pair of batches, enabling the identification of sets of "pseudo-replicate" clusters across batches. Stably expressed genes (SEGs) are also identified that do not differ between cells within each batch, akin to lowly variable genes. A factor analysis model is then constructed using the pseudo-replicate relationships and applied to the SEGs to identify factors of unwanted variation, i.e., the batch effects.

Just putting it here to avoid writing it again, because I need to remove this section until the deal-breakers are resolved so that I can compile the rest of the chapter.

kevinwang09 commented 5 years ago

Hi Aaron, I have updated scMerge to version 1.1.6 on the GitHub master branch. And it currently accepts matrix, dgCMatrix and DelayedArray.

Running the following codes with matrix class takes about 6 minutes. However, for if the matrices are kept as HDF5Array, the computation is so slow I didn't manage to get an output for hours. Setting the BPPARAM argument to BiocParallel::MulticoreParam(workers = 12) (which gets passed off to BiocSingular::runSVD) did not make the computation any faster.

Do you have any suggestions with possible configurations with respect to the BiocSingular::runSVD?

Code that works

library(TENxPBMCData)
library(SummarizedExperiment)
library(scMerge)
library(scater)

pbmc3k <- TENxPBMCData('pbmc3k')
pbmc4k <- TENxPBMCData('pbmc4k')

unfiltered <- pbmc3k
is.mito <- grep("MT", rowData(pbmc3k)$Symbol_TENx)
stats <- perCellQCMetrics(pbmc3k, subsets=list(Mito=is.mito))
high.mito <- isOutlier(stats$subsets_Mito_percent, nmads=3, type="higher")
pbmc3k <- pbmc3k[,!high.mito]
pbmc3k <- logNormCounts(pbmc3k)

unfiltered <- pbmc4k
is.mito <- grep("MT", rowData(pbmc4k)$Symbol_TENx)
stats <- perCellQCMetrics(pbmc4k, subsets=list(Mito=is.mito))
high.mito <- isOutlier(stats$subsets_Mito_percent, nmads=3, type="higher")
pbmc4k <- pbmc4k[,!high.mito]
pbmc4k <- logNormCounts(pbmc4k)

pbmc_combine = scMerge::sce_cbind(list(pbmc3k = pbmc3k, pbmc4k = pbmc4k),
                                  method = "intersect")
dim(pbmc_combine)
data("segList_ensemblGeneID")
colnames(pbmc_combine) = paste0("gene", seq_len(ncol(pbmc_combine)))
SummarizedExperiment::assay(pbmc_combine, "counts") = as.matrix(SummarizedExperiment::assay(pbmc_combine, "counts"))
SummarizedExperiment::assay(pbmc_combine, "logcounts") = as.matrix(SummarizedExperiment::assay(pbmc_combine, "logcounts"))

scMerge::scMerge(
    sce_combine = pbmc_combine,
    ctl = segList_ensemblGeneID$human$human_scSEG,
    kmeansK = c(10, 10),
    assay_name = "scMerge_unsupervised", 
    verbose = TRUE,
    BPPARAM = BiocParallel::SerialParam(),
    BSPARAM = BiocSingular::IrlbaParam(fold = 5))

This takes about 6 minutes.

Code that takes too long

scMerge::scMerge(
  sce_combine = pbmc_combine,
  ctl = segList_ensemblGeneID$human$human_scSEG,
  kmeansK = c(10, 10),
  assay_name = "scMerge_unsupervised", 
  verbose = TRUE,
  BPPARAM = BiocParallel::MulticoreParam(workers = 12),
  BSPARAM = BiocSingular::IrlbaParam(fold = 5))

LTLA commented 5 years ago

Don't set fold, for starters. This will compute a cross-product, which is slow.

Also don't use IRLBA for file-backed matrices, RSVD is much faster as it touches the disk less.

kevinwang09 commented 5 years ago

Great, thanks @LTLA.

The latest version 1.1.6 on the master branch of the GitHub repo can process the HDF5Array backend in about 11 minutes and the matrix backend in about 1 minute.

I believe the latest version resolves most issues you raised, however, the issue with scSEGIndex remains. I will keep working on this function in the coming days. Particularly the inconsistent parameter names in both camel/snake case.

library(TENxPBMCData)
library(SummarizedExperiment)
library(scMerge)
library(scater)

pbmc3k <- TENxPBMCData('pbmc3k')
pbmc4k <- TENxPBMCData('pbmc4k')

unfiltered <- pbmc3k
is.mito <- grep("MT", rowData(pbmc3k)$Symbol_TENx)
stats <- perCellQCMetrics(pbmc3k, subsets=list(Mito=is.mito))
high.mito <- isOutlier(stats$subsets_Mito_percent, nmads=3, type="higher")
pbmc3k <- pbmc3k[,!high.mito]
pbmc3k <- logNormCounts(pbmc3k)

unfiltered <- pbmc4k
is.mito <- grep("MT", rowData(pbmc4k)$Symbol_TENx)
stats <- perCellQCMetrics(pbmc4k, subsets=list(Mito=is.mito))
high.mito <- isOutlier(stats$subsets_Mito_percent, nmads=3, type="higher")
pbmc4k <- pbmc4k[,!high.mito]
pbmc4k <- logNormCounts(pbmc4k)

pbmc_combine = scMerge::sce_cbind(list(pbmc3k = pbmc3k, pbmc4k = pbmc4k),
                                  method = "intersect")
dim(pbmc_combine)
data("segList_ensemblGeneID")
colnames(pbmc_combine) = paste0("gene", seq_len(ncol(pbmc_combine)))
##################################
system.time(
  obj1 <- scMerge::scMerge(
    sce_combine = pbmc_combine,
    ctl = segList_ensemblGeneID$human$human_scSEG,
    kmeansK = c(10, 10),
    assay_name = "scMerge_unsupervised2", 
    verbose = TRUE,
    BPPARAM = BiocParallel::SerialParam(),
    svd_k = 50,
    BSPARAM = BiocSingular::RandomParam(fold = Inf),
    BACKEND = "HDF5Array")
)

SummarizedExperiment::assay(pbmc_combine, "counts") = as.matrix(SummarizedExperiment::assay(pbmc_combine, "counts"))
SummarizedExperiment::assay(pbmc_combine, "logcounts") = as.matrix(SummarizedExperiment::assay(pbmc_combine, "logcounts"))

system.time(
  obj2 <- scMerge::scMerge(
    sce_combine = pbmc_combine,
    ctl = segList_ensemblGeneID$human$human_scSEG,
    kmeansK = c(10, 10),
    assay_name = "scMerge_unsupervised2", 
    verbose = TRUE,
    BPPARAM = BiocParallel::SerialParam(),
    svd_k = 50,
    BSPARAM = BiocSingular::RandomParam(fold = Inf))
)

kevinwang09 commented 5 years ago

@LTLA, let me know if you are happy with the description of the scSEGIndex function here or if you have any other suggestions.

One thing that I found to be particular strange is the speed associated with DelayedArray multiplication, which I opened up an issue here

Let me know your thoughts, cheers, Kevin

LTLA commented 5 years ago

I tried to run the HDF5 example and the scMerge() command used so much memory that my computer forcibly logged me out. And I have 16 GB of RAM on this baby, so that's not a good sign.

That aside, immediate impressions are that I shouldn't have to set BACKEND = "HDF5Array".

kevinwang09 commented 5 years ago

I ran this script on Google Cloud (VM with 15GB memory and another with 30GB) under a docker container. Both machines managed to handle both the DelayedArray+HDF5Array and the matrix version of scMerge. The memory allocation for the top functions is here.

It seems like the main culprit of the large memory usage is the runSVD call in running the HDF5 example. This blowing up memory usage essentially killed off any advantage offered by out-of-memory computations since the data when realize as a full matrix is less than 1GB.

The time plot is here

LTLA commented 5 years ago

For starters, computePCA_byHVGMarker has BSPARAM = BiocSingular::ExactParam(fold=5) rather than using the specified RandomParam, so that's not good. This instructs the function to compute the cross-product, which is painfully slow for file-backed matrices (effectively needs to read the matrix into memory twice to achieve a result that is agnostic to the parallelization scheme).

LTLA commented 3 years ago

Revisiting this. Should I try again with the latest version?

kevinwang09 commented 3 years ago

@YingxinLin is now the maintainer of this package and will investigate this. I remember the last time I looked into this, the part that was making the computation slow was fastRUVIII, where the SVD is performed on the whole matrix. This also happened to be the part that takes the most RAM.

With the new updates in the DelayedArray (and the BioC in general), it will be good to test if this is still the case. If this is true, then we need to think of some ways to speed this up. e.g. I am not sure of the speed gains by invoking the BPPARAM parameter in the DelayedArray::runSVD function.

SydneyBioX / scMerge

Tire kicking for use in the OSCA book #14

Setup

Executed code example

Deal-breakers

Inconveniences

Code that works

Code that takes too long