aertslab / pycisTopic

pycisTopic is a Python module to simultaneously identify cell states and cis-regulatory topics from single cell epigenomics data.
Other
56 stars 12 forks source link

Running pycisTopic on large datasets? #66

Open klgoss opened 1 year ago

klgoss commented 1 year ago
          Any update on this? I'm trying to run pycisTopic on a large dataset (~114,000 cells) and have maxed out the memory requirement on HPC (700GB) and it still crashes. Any insight appreciated!

Originally posted by @klgoss in https://github.com/aertslab/pycisTopic/issues/14#issuecomment-1412586687

dburkhardt commented 1 year ago

@klgoss you might have more luck if you post a full bug report:

**Describe the bug**
A clear and concise description of what the bug is.

**To Reproduce**
Commands relevant to reproduce the error.

**Error output**
Paste the entire output of the command, including log information prior to the error.

**Expected behavior**
A clear and concise description of what you expected to happen.

**Screenshots**
If applicable, add screenshots to help explain your problem or show the format of the input data for the command/s.

**Version (please complete the following information):**
 - Python: [e.g. Python 3.7.4]
 - If a bug is related to another module [e.g. matplotlib 3.3.9]

**Additional context**
Add any other context about the problem here.

Where does it crash? What's the last step? What's the error message?

klgoss commented 1 year ago

It crashes at this step: cistopic_obj.add_cell_data(cell_data). This is with 12 nodes and 700GB of memory. The error I'm getting is /var/lib/slurm/slurmd/job609736/slurm_script: line 17: 86137 Killed python pycisTopic.py slurmstepd: error: Detected 1 oom-kill event(s) in StepId=609736.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

dburkhardt commented 1 year ago

Hmm, this is a strange step to run out of memory, all you're doing is adding metadata to the object.

What is the size of cell_data and cistopic_obj if you write to disk?

Can you share details of what's in cistopic_obj?

klgoss commented 1 year ago

The files are pretty large-- The atac sparse matrix is around 55 GB and the cell data is 23432314 bytes. This dataset is actually 13 samples, so I guess I could run pycisTopic on each sample separately if needed.

For more context, here is the full code up until the error:

# Create cisTopic object
import pycisTopic
from pycisTopic.cistopic_class import *
import warnings
warnings.simplefilter(action='ignore')
import os
import pickle

count_matrix=pd.read_csv('final_atac_sparse.tsv', sep='\t')

path_to_blacklist='blacklist/hg38-blacklist.v2.bed'

cistopic_obj = create_cistopic_object(fragment_matrix=count_matrix, path_to_blacklist=path_to_blacklist)

# Adding cell information
cell_data =  pd.read_csv('meta.tsv', sep='\t')
cistopic_obj.add_cell_data(cell_data)
dburkhardt commented 1 year ago

So first off, pd.read_csv('final_atac_sparse.tsv', sep='\t') doesn't return a sparse matrix (unless I'm very confused -- you can check using scipy.sparse.issparse()). You need to call scipy.sparse.csr_matrix on the data to make it sparse. Try that first. In the future, I recommend storing sparse matrices in either an AnnData object or at least as an mtx file.

Second, something doesn't add up here. There's no way you should be running out of 700GB of RAM when your datasets are only 56GB on disk.

Are you sure that this is the only script your running on the node? There's absolutely nothing else there?

klgoss commented 1 year ago

@dburkhardt I am following the pycisTopic tutorial here: https://pycistopic.readthedocs.io/en/latest/Toy_melanoma-RTD.html# which is why I tried using pd.read_csv(). I created my sparse matrix from my Seurat object in R with the following code:

DefaultAssay(obj) <- "peaks"
a <- GetAssayData(object = obj, slot = "counts", assay = "peaks") %>% as.sparse()

write.csv(a, file = "test.txt", sep = “\t”)

Is there another (more efficient) way to get the sparse matrix? Any insight is greatly appreciated!!

ghuls commented 1 year ago

This pull request might help with your memory problems: https://github.com/aertslab/pycisTopic/pull/77

ghuls commented 1 year ago

Pull request #77 is merged in master now.