KlugerLab / ALRA

Imputation method for scRNA-seq based on low-rank approximation
MIT License
71 stars 19 forks source link

Tips for larger matrices? #11

Open PedroMilanezAlmeida opened 3 years ago

PedroMilanezAlmeida commented 3 years ago

I am working with a matrix that has 53201 cells and 20245 genes.

Its size in memory is only 482 MB as a dgCMatrix but 8.62 GB as.matrix().

When I try RunALRA from Seurat, I get:

Error: vector memory exhausted (limit reached?)

Same if I try to run with alra(A_norm = as.matrix(normRNA), use.mkl = FALSE) and use.mkl = TRUE (only that if TRUE it takes a lot longer to show the error).

Do you have any suggestions for how to run on large matrices on a laptop?

linqiaozhi commented 3 years ago

Hi Pedro, thanks for your interest in ALRA.

The ALRA function produces multiple copies of the matrix, which can be problematic when you have limited memory. The reason the matrices are duplicated in the memory is because we originally thought people will want to access the imputed matrix before scaling and thresholding. This does not seem to be the case...we are pretty much only interested in the final matrix.

Please see this branch, I added a function called alra.low.memory(). That should reduce the memory footprint. Can you try that function? See here.

If you still are having trouble, can you tell me at which step you actually get the error? Also, how much memory is on your laptop? Are any of these steps helpful?

PedroMilanezAlmeida commented 3 years ago

Hi George, thanks for the quick feedback!

If I get it right, the change in alra.low.memory is in line 271 (don't return all matrices), right? However, when I tried to run alra step-by-step last night, memory was exhausted already at line 232 (A_norm_rank_k is already another (approximate) copy of A_norm, occupying additional 8.6 GB in memory).

While going through alra step-by-step, I tried to convert A_norm_rank_k to a dgCMatrix but, probably bc A_norm_rank_k is not sparse, the conversion also exhausted the memory. I also tried to force the matrix multiplications in line 227 to give a dgCMatrix as result by coverting fastDecomp_noc$u, fastDecomp_noc$v and diag(fastDecomp_noc$d) to dgCMatrix, but the matrix multiplications as dgCMatrix blew up memory anyways and never finished.

My solution for the moment was to run alra only on 2k variable genes instead of the entire matrix, which runs pretty smooth and fast now, but I haven't yet looked into whether the results are any good.

Btw, my laptop has 16GB mem and I have not tried to change R_MAX_VSIZE in .Renviron yet.

Biomiha commented 6 months ago

Hi @linqiaozhi,

I've recently come across this issue also having hit the memory limits due to the large number of cells we are analysing. I have noticed that there is a bug in the alra.low.memory function. There is a line that checks if the class of the input A_norm is a matrix but the if statement only takes class(A_norm) == "matrix". In my case the output of class(A_norm) is a vector of matrix and array and this generates an error that prevents the function from completing. I have submitted a pull request where the if statement checks if matrix is present in the vector and ignores any additional classes.

Thanks