Improving memory efficiency for CITE-seq data analysis

LTLA commented 5 years ago

I would like this to work better than it currently does:

## ------------------------------------------------------------------------
# Caching it locally with BiocFileCache to avoid repeating it.
library(BiocFileCache)
bfc <- BiocFileCache(ask=FALSE)
stuff <- bfcrpath(bfc, file.path("http://cf.10xgenomics.com",
    "samples/cell-exp/3.0.0/pbmc_10k_protein_v3",
    "pbmc_10k_protein_v3_filtered_feature_bc_matrix.tar.gz"))
untar(stuff, exdir=tempdir())

# Loading it in as a SingleCellExperiment object.
library(DropletUtils)
sce <- read10xCounts(file.path(tempdir(), "filtered_feature_bc_matrix"))

## ------------------------------------------------------------------------
sce <- splitAltExps(sce, rowData(sce)$Type)
altExpNames(sce)
altExp(sce) # Can be used like any other SingleCellExperiment. 

counts(altExp(sce)) <- as.matrix(counts(altExp(sce)))
counts(altExp(sce))[,1:10] # sneak peek

## ------------------------------------------------------------------------
library(scater)
mito <- grep("^MT-", rowData(sce)$Symbol)
df <- perCellQCMetrics(sce, subsets=list(Mito=mito))
mito.discard <- isOutlier(df$subsets_Mito_percent, type="higher")

ab.detected <- df$`altexps_Antibody Capture_detected`
med.detected <- median(ab.detected)
threshold <- med.detected/2
ab.discard <- ab.detected < threshold

discard <- ab.discard | mito.discard
sce <- sce[,!discard]

## ------------------------------------------------------------------------
library(DelayedMatrixStats)
# TODO: move into the DropletUtils package.
ambient <- rowMeans(counts(altExp(sce)))
sf.amb <- colMedians(counts(altExp(sce))/ambient)
sf.amb <- sf.amb/mean(sf.amb)

## ------------------------------------------------------------------------
sizeFactors(altExp(sce)) <- sf.amb
sce <- logNormCounts(sce, use_altexps=TRUE)

## ------------------------------------------------------------------------
library(MOFA)
mobj <- createMOFAobject(list(genes=logcounts(sce),
    tags=logcounts(altExp(sce))))
mobj <- prepareMOFA(mobj)
mobj <- runMOFA(mobj)

This is hitting swap on my 16 GB laptop, and there's only 8000 cells involved. I'm a bit bemused about why this is occurring - even densifying the larger matrix should only cost ~2GB.

Session info

``` R version 3.6.0 Patched (2019-05-02 r76458) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.3 LTS Matrix products: default BLAS: /home/luna/Software/R/R-3-6-branch-dev/lib/libRblas.so LAPACK: /home/luna/Software/R/R-3-6-branch-dev/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] MOFA_1.1.1 DelayedMatrixStats_1.7.2 [3] scater_1.13.27 ggplot2_3.2.1 [5] DropletUtils_1.5.12 SingleCellExperiment_1.7.11 [7] SummarizedExperiment_1.15.10 DelayedArray_0.11.8 [9] BiocParallel_1.19.5 matrixStats_0.55.0 [11] Biobase_2.45.1 GenomicRanges_1.37.17 [13] GenomeInfoDb_1.21.2 IRanges_2.19.18 [15] S4Vectors_0.23.25 BiocGenerics_0.31.6 [17] BiocFileCache_1.9.1 dbplyr_1.4.2 loaded via a namespace (and not attached): [1] viridis_0.5.1 httr_1.4.1 [3] edgeR_3.27.14 BiocSingular_1.1.7 [5] jsonlite_1.6 foreach_1.4.7 [7] bit64_0.9-7 viridisLite_0.3.0 [9] R.utils_2.9.0 assertthat_0.2.1 [11] dqrng_0.2.1 blob_1.2.0 [13] GenomeInfoDbData_1.2.2 vipor_0.4.5 [15] ggrepel_0.8.1 corrplot_0.84 [17] pillar_1.4.2 RSQLite_2.1.2 [19] backports_1.1.5 lattice_0.20-38 [21] reticulate_1.13 glue_1.3.1 [23] limma_3.41.18 digest_0.6.22 [25] RColorBrewer_1.1-2 XVector_0.25.0 [27] colorspace_1.4-1 cowplot_1.0.0 [29] plyr_1.8.4 Matrix_1.2-17 [31] R.oo_1.22.0 pkgconfig_2.0.3 [33] pheatmap_1.0.12 zlibbioc_1.31.0 [35] purrr_0.3.3 scales_1.0.0 [37] HDF5Array_1.13.11 MultiAssayExperiment_1.11.9 [39] tibble_2.1.3 withr_2.1.2 [41] lazyeval_0.2.2 magrittr_1.5 [43] crayon_1.3.4 memoise_1.1.0 [45] R.methodsS3_1.7.1 doParallel_1.0.15 [47] beeswarm_0.2.3 tools_3.6.0 [49] stringr_1.4.0 Rhdf5lib_1.7.6 [51] munsell_0.5.0 locfit_1.5-9.1 [53] irlba_2.3.3 compiler_3.6.0 [55] rsvd_1.0.2 rlang_0.4.0 [57] rhdf5_2.29.6 grid_3.6.0 [59] RCurl_1.95-4.12 iterators_1.0.12 [61] BiocNeighbors_1.3.5 rappdirs_0.3.1 [63] bitops_1.0-6 codetools_0.2-16 [65] gtable_0.3.0 DBI_1.0.0 [67] curl_4.2 reshape2_1.4.3 [69] R6_2.4.0 gridExtra_2.3 [71] dplyr_0.8.3 bit_1.1-14 [73] zeallot_0.1.0 stringi_1.4.3 [75] ggbeeswarm_0.6.0 Rcpp_1.0.2 [77] vctrs_0.2.0 tidyselect_0.2.5 ```

rargelaguet commented 5 years ago

Hi Aaron, MOFA v1 was not aimed at >1000 samples. You will find not just memory but also speed issues. We are about to release the new version that is tailored to single-cell data and which should significantly improve this. I'll invite you to the repository

LTLA commented 5 years ago

I'll continue this discussion off-line.

wyattmcdonnell commented 4 years ago

@rargelaguet could you invite me to that repo as well? thanks!

rargelaguet commented 4 years ago

I am still preparing the vignettes/documentation, I will release it in the next couple of days. Can it wait until Monday next week?

wyattmcdonnell commented 4 years ago

how cool! yes, I'd be happy to wait until Monday of next week. thanks again @rargelaguet!

bioFAM / MOFA

Improving memory efficiency for CITE-seq data analysis #49