aertslab / AUCell

AUCell: score single cells with gene regulatory networks
110 stars 26 forks source link

Added prototype of block processing algorithm. #16

Closed LTLA closed 2 years ago

LTLA commented 3 years ago

Closes #14. To illustrate:

library(scRNAseq)
sce <- ZeiselBrainData()

library(AUCell)
library(GSEABase)
gmtFile <- paste(file.path(system.file('examples', package='AUCell')), "geneSignatures.gmt", sep="/")
geneSets <- getGmt(gmtFile)

output <- AUCell:::blocked_AUCell(counts(sce), geneIds(geneSets))
output
## AUC for 6 gene sets (rows) and 3005 cells (columns).
## 
## Top-left corner of the AUC matrix:
##                        cells
## gene sets               1772071015_C02 1772071017_G12 1772071017_A05
##   Astrocyte_Cahoy           0.13612188     0.12130869     0.13154845
##   Neuron_Cahoy              0.31817982     0.29461738     0.32486713
##   Oligodendrocyte_Cahoy     0.12325075     0.10638761     0.11422777
##   Astrocyte_Lein            0.07844790     0.06150269     0.07769220
##   Neuron_Lein               0.36950026     0.36965482     0.41504379
##   Microglia_lavin           0.06656563     0.05830827     0.06800066
##                        cells
## gene sets               1772071014_B06 1772067065_H06
##   Astrocyte_Cahoy           0.13260939      0.1306693
##   Neuron_Cahoy              0.32144855      0.3322178
##   Oligodendrocyte_Cahoy     0.10032967      0.1203936
##   Astrocyte_Lein            0.07882575      0.0415056
##   Neuron_Lein               0.41665808      0.4015628
##   Microglia_lavin           0.06911838      0.0645007

We can easily handle large matrices without coercing them into an ordinary matrix - or, heaven forbid, a data.table. One may also consider using DelayedMatrixStats to generate a ranking in a more efficient manner, especially for sparse data.

y <- as(counts(sce), "dgCMatrix")
output <- AUCell:::blocked_AUCell(y, geneIds(geneSets))
dim(output)

Parallelization on a variety of backends (forking, SNOW, Slurm, etc.) is easily supported, though there are some tricky issues with differences in results due to how the RNG stream is set up in each core. Is ties.method="random" really necessary? I also used the same approach for scran::correlatePairs in the past, but it was more trouble than it's worth.

output <- AUCell:::blocked_AUCell(counts(sce), geneIds(geneSets), 
    BPPARAM=BiocParallel::MulticoreParam(5))

In any case, this PR contains the minimal components required to achieve block processing. The count matrix above can be substituted with any matrix representation, including arbitrarily large file-backed HDF5Arrays, and it should continue to work. I will leave it to you to decide how you would like to integrate this into the other AUCell functions, if at all. (Note that DelayedArray is already a dependency of SummarizedExperiment, so this does not add any further dependencies.)

s-aibar commented 2 years ago

Thank you!

ptdtan commented 1 year ago

Was it merged into the main branch?