ChangSuBiostats / CS-CORE_python

Python package for CS-CORE, a statistical method for cell-type-specific co-expression inference from single cell RNA-sequencing data
MIT License
5 stars 0 forks source link

running out of memory #2

Closed daniel-spies closed 9 months ago

daniel-spies commented 9 months ago

Hi there, thanks for the great tool.

I followed the python IRLS example but after a short while I get an out of memory error and python is killed (working on a cluster). Even requesting up to 512GB of RAM did not help. I used a cluster subset of 500 cells with 3000 genes, so this should not be a problem.

When running the same count table imported into R the results are there within a few seconds, somehow you python implementation is killing itself. I'm running CS-CORE with the following package versions:

python 3.9.16 numpy 1.23.5 scipy 1.7.2

Looking forward to use it in the future!

best Daniel

ChangSuBiostats commented 9 months ago

Hi Daniel,

Thanks for your interest in our work & thank you for brining this issue to our attention!

This is likely due to the fact that our previous implementation was designed to take count matrix in the format of a numpy array, and matrix multiplication was implemented with numpy functions. However, np.dot behaves unexpectedly when the count matrix is in the format of scipy csr_matrix. It generates a large number of p*p matrices, instead of computing a dot product, which may explain the memory issue.

We have updated the implementation to take AnnData / csr_matrix as input. You can follow this notebook for an example, or CSCORE_IRLS.py for the actual implementations. We have also benchmarked the time and memory usage of this new implementation at this notebook. This implementation should give comparable speed as the R version.

I hope this helps! Feel free to leave a comment if you have more questions.

Best, Chang