NathanSkene / EWCE

Expression Weighted Celltype Enrichment. See the package website for up-to-date instructions on usage.
https://nathanskene.github.io/EWCE/index.html
53 stars 25 forks source link

Get it working with large input matrices. #8

Closed NathanSkene closed 2 years ago

prashanthsama1 commented 4 years ago

Hi ,

Thanks for the great package.

I am interested to know if there are any updates regarding this issue. As I recently started working with a large dataset and ran into some error because of large matrix file.

NathanSkene commented 4 years ago

Thanks for filing an issue! We're working on pushing a new version of the generate.celltype.data function now which should better handle large matrixes and run faster. Will be a few more days before it's pushed to master though!

marieniemi commented 4 years ago

Hello Nathan,

Thanks for the package! I'm also trying to use this on a large dataset, and appear to be experiencing an issue related to dataset size.

When using either the drop.uninformative.genes or generate.celltype.data, it runs for about 4 seconds and then outputs an error:

Error in asMethod(object) :
Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105

My file size is 4.1G. The max mem on my VM is 98.3G, so I wonder if it's trying to reserve more memory than is available. Would you recommend increasing VM mem? On a test file 59M the program works well, it's only with this larger file that I am experiencing issues.

Is this related to the same issue that prashanthsama1 raised, or should I raise a new issue? The version I'm using is f840d130d3c105542753b3a99f4bdf762af50167

NathanSkene commented 4 years ago

Hi @marieniemi, still haven't gotten around to 'packaging' the functions for large matrices (sorry, trying to get my lab up and running). For now, convert the matrix to a sparse matrix then get the means with this approach (https://github.com/NathanSkene/EWCE/issues/13), then just divide each row by its sum: that's the specificity matrix. I'll need to think about the drop.uninformative.genes function, should be a way to get that working with large matrices as well (at simplest level, could just drop those with too few reads).

marieniemi commented 4 years ago

Thanks, Nathan! And congrats on the new lab

kleurless commented 4 years ago

I'm running into the same problem. @NathanSkene what exactly do you mean when you say "get the means"? Get the mean expression of each gene per cell type, and then divide each of those mean expressions by the sum of the gene expression of all cells? Besides this, do you build the specificity matrix based on the raw or normalized and the log-transformed expression? I can't find it in the documentation

bschilder commented 3 years ago

Hi there,

I seem to be running into the same issue at both steps, even though my matrix is already sparse. I'm using the expression matrix from the Linnaerson developmental mouse brain dataset (LaManno2020).

Same error occurs on both HPC (qsub -I -l select=01:ncpus=8:mem=96gb -l walltime=08:00:00) and the Threadripper (64 cores).

> exp_drop <- EWCE::drop.uninformative.genes(exp=X,
                                            # Must be a factor
                                            level2annot =as.factor(meta[[level2]]))
Error in asMethod(object) : 
  Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 102

Checking matrix info:

> dim(X)
[1]  16474 510193

> class(X)
[1] "dgRMatrix"
attr(,"package")
[1] "Matrix"
bschilder commented 3 years ago

I've developed a DelayedArray implementation of both functions that were running into memory allocation issues. Basically, it solves this problem by chunking the matrix and parallelizing operations across these chunks.

drop.uninformative.genes()

generate.celltype.data()

See pull request here, which includes various other enhancements and new helper functions.

malosreet commented 2 years ago

Hello! Thank you for making a solution that will work for large datasets. Should I install a specific branch from Github to get this version? Which branch should I use?

bschilder commented 2 years ago

Hi @malosreet, this is currently implemented in the bschilder_dev branch. So you would install it with

remotes::install_github("NathanSkene/EWCE@bschilder_dev")

The one caveat atm is that you need to first store your sc dataset as an HDF5SummarizedExperiment because this will prevent the entire dataset from being realised into memory at once. scKirby, another tool from our group, should be able to help with this conversion process (tho it's still in alpha devel so hasn't been tested extensively just yet!).

We plan to merge with this with the master branch very soon.