Open klgoss opened 1 year ago
@klgoss you might have more luck if you post a full bug report:
**Describe the bug**
A clear and concise description of what the bug is.
**To Reproduce**
Commands relevant to reproduce the error.
**Error output**
Paste the entire output of the command, including log information prior to the error.
**Expected behavior**
A clear and concise description of what you expected to happen.
**Screenshots**
If applicable, add screenshots to help explain your problem or show the format of the input data for the command/s.
**Version (please complete the following information):**
- Python: [e.g. Python 3.7.4]
- If a bug is related to another module [e.g. matplotlib 3.3.9]
**Additional context**
Add any other context about the problem here.
Where does it crash? What's the last step? What's the error message?
It crashes at this step: cistopic_obj.add_cell_data(cell_data)
. This is with 12 nodes and 700GB of memory. The error I'm getting is /var/lib/slurm/slurmd/job609736/slurm_script: line 17: 86137 Killed python pycisTopic.py slurmstepd: error: Detected 1 oom-kill event(s) in StepId=609736.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.
Hmm, this is a strange step to run out of memory, all you're doing is adding metadata to the object.
What is the size of cell_data
and cistopic_obj
if you write to disk?
Can you share details of what's in cistopic_obj
?
The files are pretty large-- The atac sparse matrix is around 55 GB and the cell data is 23432314 bytes. This dataset is actually 13 samples, so I guess I could run pycisTopic on each sample separately if needed.
For more context, here is the full code up until the error:
# Create cisTopic object
import pycisTopic
from pycisTopic.cistopic_class import *
import warnings
warnings.simplefilter(action='ignore')
import os
import pickle
count_matrix=pd.read_csv('final_atac_sparse.tsv', sep='\t')
path_to_blacklist='blacklist/hg38-blacklist.v2.bed'
cistopic_obj = create_cistopic_object(fragment_matrix=count_matrix, path_to_blacklist=path_to_blacklist)
# Adding cell information
cell_data = pd.read_csv('meta.tsv', sep='\t')
cistopic_obj.add_cell_data(cell_data)
So first off, pd.read_csv('final_atac_sparse.tsv', sep='\t')
doesn't return a sparse matrix (unless I'm very confused -- you can check using scipy.sparse.issparse()
). You need to call scipy.sparse.csr_matrix
on the data to make it sparse. Try that first. In the future, I recommend storing sparse matrices in either an AnnData object or at least as an mtx
file.
Second, something doesn't add up here. There's no way you should be running out of 700GB of RAM when your datasets are only 56GB on disk.
Are you sure that this is the only script your running on the node? There's absolutely nothing else there?
@dburkhardt I am following the pycisTopic tutorial here: https://pycistopic.readthedocs.io/en/latest/Toy_melanoma-RTD.html# which is why I tried using pd.read_csv(). I created my sparse matrix from my Seurat object in R with the following code:
DefaultAssay(obj) <- "peaks"
a <- GetAssayData(object = obj, slot = "counts", assay = "peaks") %>% as.sparse()
write.csv(a, file = "test.txt", sep = “\t”)
Is there another (more efficient) way to get the sparse matrix? Any insight is greatly appreciated!!
This pull request might help with your memory problems: https://github.com/aertslab/pycisTopic/pull/77
Pull request #77 is merged in master now.
Originally posted by @klgoss in https://github.com/aertslab/pycisTopic/issues/14#issuecomment-1412586687